SOX的一些命令和kaldi使用sox音频数据增强

    • 1 SOX win10和linux安装
    • 2 sox命令
      • 2.1 音频基本信息
      • 2.2 音频采样率转换
      • 2.3 wav和pcm互转
      • 2.4 裁剪音频
      • 2.5 提取指定通道数据
      • 2.6 合并两个单通道为一个双通道
      • 2.7 调整语速
      • 2.8 音频串联合并
      • 2.9 获取音频统计信息、调整音量
      • 2.10 sox计算音频时长
    • 3 批量音频操作
    • 4 kaldi中使用sox音频数据增强
      • 4.1 音速扰动(sp) utils/data/perturb_data_dir_speed_3way.sh
      • 4.2 音量扰动vp utils/data/perturb_data_dir_volume.sh
      • 4.3 加混响 steps/data/reverberate_data_dir.py
      • 4.4 加性噪声 steps/data/augment_data_dir.py
      • 4.5 kaldi中的wav-reverberate
    • Reference


如果喜欢FFMPEG命令,请看FFMPEG的一些命令


批量操作音频文件参考5.1 批量输出绝对路径下的所有音频并更名

1 SOX win10和linux安装

yum install sox
apt install sox

sox win10下载地址和sox win10介绍

配置环境变量C:\JAVA\sox-14.4.2-win32
SOX的一些命令和kaldi使用sox音频数据增强-编程知识网

C:\JAVA\sox-14.4.2-win32;
C:\ProgramFiles\cmake-3.7.0-win64-x64\bin;
C:\JAVA\ffmpeg-20200504-5767a2e-win64-static\bin;
C:\ProgramFiles\node-v14.0.0-win-x64;
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\bin;
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.2\libnvvp;
C:\Program Files\Java\jdk1.8.0_121\bin;
C:\Program Files\Java\jdk1.8.0_121\jre\bin;
C:\Python37;
C:\Python37\Scripts;

2 sox命令

2.1 音频基本信息

# 下面两个命令都可以,window下只能sox --i
soxi demo.wav
sox --i demo.wavInput File     : 'demo.wav'
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Duration       : 00:00:04.88 = 78031 samples ~ 365.77 CDDA sectors
File Size      : 156k
Bit Rate       : 256k
Sample Encoding: 16-bit Signed Integer PCM

2.2 音频采样率转换

# 将单通道demo.wav音频采用指定编码器将采样率转换为16000的wav
# -c 1 表示单通道 -b表示位宽 -r表示采样率 三个角度可进行修改
sox demo.wav -c 1 -b 16 -r 8000 output.wav

2.3 wav和pcm互转

# 将单通道demo.wav按照16000采样率转换为pcm文件,要求原始就是小尾端字节序、有符号16位比特存储
sox demo.wav -b 16 -e signed-integer --endian little -c 1 -r 16000 -t raw output.pcm# 将单通道demo.pcm按照16000采样率转换为wav文件,要求原始就是小尾端字节序、有符号16位比特存储
sox -b 16 -e signed-integer --endian little -c 1 -r 16000 -t raw demo.pcm output.wav

2.4 裁剪音频

# 裁剪[0,10]秒区间
sox demo.wav output.wav trim 0 10

2.5 提取指定通道数据

# 提取第2通道的数据,若提取第一通道设remix值为1,output.wav长度和较长的一个一致
sox demo.wav output.wav remix 2

2.6 合并两个单通道为一个双通道

# 将left通道和right通道进行合并为一个独立音频
sox -M demo.left.wav demo.right.wav output.wav

2.7 调整语速

【特别注意】语速调整阈值组好在[0.85,1.25]之间,否则音频质量严重变形

sox demo.wav output.wav speed 0.85

2.8 音频串联合并

这个会将两个音频进行串联合并,最后output.wav音频时长是demo.left.wav和demo.right.wav的时间之和。

sox demo.left.wav demo.right.wav output.wav

2.9 获取音频统计信息、调整音量

查看音量可以最大 调整值【Volume adjustment】

sox demo.wav -n statSamples read:             78031
Length (seconds):      4.876938
Scaled by:         2147483647.0
Maximum amplitude:     0.159882
Minimum amplitude:    -0.134613
Midline amplitude:     0.012634
Mean    norm:          0.016386
Mean    amplitude:    -0.000025
RMS     amplitude:     0.024805
Maximum delta:         0.076752
Minimum delta:         0.000000
Mean    delta:         0.004386
RMS     delta:         0.008032
Rough   frequency:          824
Volume adjustment:        6.255# 调整音频缩小0.8,(大于1为扩大)
sox -v 0.8 demo.wav output.wav

2.10 sox计算音频时长

soxi -D filename.wav

3 批量音频操作

见5.1 批量输出绝对路径下的所有音频并更名


4 kaldi中使用sox音频数据增强

4.1 音速扰动(sp) utils/data/perturb_data_dir_speed_3way.sh

调用方法run_ivector_common.sh#L43:

utils/data/perturb_data_dir_speed_3way.sh data/train data/train_sp

【utils/data/perturb_data_dir_speed_3way.sh】
会发现:

utils/data/perturb_data_dir_speed.sh 0.9 ${srcdir} ${destdir}_speed0.9
utils/data/perturb_data_dir_speed.sh 1.1 ${srcdir} ${destdir}_speed1.1# 再将原来数据进行合并
utils/data/combine_data.sh $destdir ${srcdir} ${destdir}_speed0.9 ${destdir}_speed1.1 || exit 1
rm -r ${destdir}_speed0.9 ${destdir}_speed1.1
# 再检查数据是否正确
utils/validate_data_dir.sh --no-feats --no-text $destdir

【utils/data/perturb_data_dir_speed.sh其实是utils/perturb_data_dir_speed.sh】,里面调用了sox来做速度干扰

可见kaldi里面的速度干扰用的sox倍数是0.9和1.1

sp的话就会新增一个数据集了


4.2 音量扰动vp utils/data/perturb_data_dir_volume.sh

调用方法run_ivector_common.sh#L73:

里面其实用到了sox,直接修改了tran_sp_hires里面的wavs.scp文件,并没有新增数据集

utils/data/perturb_data_dir_volume.sh data/train_sp_hires

如:

HAO-001 sox --vol 1.54213288279 -t wav /data1/speechResource/HAO/HAO-001.wav -t wav - |
HAO-002 sox --vol 1.28444186877 -t wav /data1/speechResource/HAO/HAO-002.wav -t wav - |
HAO-003 sox --vol 0.653445958249 -t wav /data1/speechResource/HAO/HAO-003.wav -t wav - |
...

4.3 加混响 steps/data/reverberate_data_dir.py

rirs_noises.zip
这个混响包含了real和simulated,用的比较多是simulated中的小房间,中房间,对应房间大小分别是110m和1030m加性的
调用方法mobvoihotwords/v1/run.sh#L116-L135:

  steps/data/reverberate_data_dir.py \"${rvb_opts[@]}" \--speech-rvb-probability 1 \--prefix "rev" \--pointsource-noise-addition-probability 0 \--isotropic-noise-addition-probability 0 \--num-replications 1 \--source-sampling-rate 16000 \data/train_shorter data/train_shorter_reverbcat data/train_shorter/utt2dur | awk -v name=rev1 '{print name"-"$0}' >data/train_shorter_reverb/utt2dur
    noise.noise_rspecifier = "sox {0} -r {1} -t wav - |".format(noise.noise_rspecifier, sampling_rate)
else:noise.noise_rspecifier = "{0} sox -t wav - -r {1} -t wav - |".format(noise.noise_rspecifier, sampling_rate)
# 混响增强
rev1-0003b40554dd83e95d9f77a86d9c28a6-0003b40554dd83e95d9f77a86d9c28a6-00000000-00000210 wav-copy data/train_shorter/data/wav_segments.1.ark:625364 - | wav-reverberate --shift-output=true --impulse-response="sox RIRS_NOISES/simulated_rirs/smallroom/Room060/Room060-00042.wav -r 16000 -t wav - |" - - |
rev1-000414cb940dc155a7d4cc9b66538607-000414cb940dc155a7d4cc9b66538607-00000000-00000211 wav-copy data/train_shorter/data/wav_segments.1.ark:692692 - | wav-reverberate --shift-output=true --impulse-response="sox RIRS_NOISES/simulated_rirs/mediumroom/Room046/Room046-00011.wav -r 16000 -t wav - |" - - |
rev1-00043653948102852b37edcf558ca535-00043653948102852b37edcf558ca535-00000000-00000056 wav-copy data/train_shorter/data/wav_segments.1.ark:760340 - | wav-reverberate --shift-output=true --impulse-response="sox RIRS_NOISES/simulated_rirs/mediumroom/Room130/Room130-00034.wav -r 16000 -t wav - |" - - |

4.4 加性噪声 steps/data/augment_data_dir.py

musan.tar.gz
这个加性包含人声babble,音乐背景声和真实噪声。这两个噪声库很好,另外就是用sox改改音速和音量,这些dither也可以考虑进去。
调用方法mobvoihotwords/v1/run.sh#L139-L155:

# 对musan自己的数据集产生kaldi的数据格式,会有一个总的data/musan 分的data/musan_music data/musan_noise data/musan_speech
steps/data/make_musan.sh /data1/kaldi/corpora/musan data# Get the duration of the MUSAN recordings.  This will be used by the# script augment_data_dir.py.for name in speech noise music; doutils/data/get_utt2dur.sh data/musan_${name}cp data/musan_${name}/utt2dur data/musan_${name}/reco2durdone# Augment with musan_noiseexport LC_ALL=en_US.UTF-8steps/data/augment_data_dir.py --utt-prefix "noise" --modify-spk-id true --fg-interval 1 --fg-snrs "15:10:5:0" --fg-noise-dir "data/musan_noise" data/train_shorter data/train_shorter_noise# Augment with musan_musicsteps/data/augment_data_dir.py --utt-prefix "music" --modify-spk-id true --bg-snrs "15:10:8:5" --num-bg-noises "1" --bg-noise-dir "data/musan_music" data/train_shorter data/train_shorter_music# Augment with musan_speechsteps/data/augment_data_dir.py --utt-prefix "babble" --modify-spk-id true --bg-snrs "20:17:15:13" --num-bg-noises "3:4:5:6:7" --bg-noise-dir "data/musan_speech" data/train_shorter data/train_shorter_babbleexport LC_ALL=C
# 噪音加性增强
noise-00022952a0c58c9aa6c85b57a06e0fd3-00022952a0c58c9aa6c85b57a06e0fd3-00000000-00000148 wav-copy data/train_shorter/data/wav_segments.1.ark:519700 - | wav-reverberate --shift-output=true --additive-signals='/data1/kaldi/corpora/musan/noise/free-sound/noise-free-sound-0718.wav' --start-times='0' --snrs='10' - - |
noise-000232ce10717642b15469276c065154-000232ce10717642b15469276c065154-00000000-00000076 wav-copy data/train_shorter/data/wav_segments.1.ark:567188 - | wav-reverberate --shift-output=true --additive-signals='/data1/kaldi/corpora/musan/noise/free-sound/noise-free-sound-0167.wav' --start-times='0' --snrs='15' - - |
noise-00023dad939ab3c63b515bf6fdccfd83-00023dad939ab3c63b515bf6fdccfd83-00000000-00000105 wav-copy data/train_shorter/data/wav_segments.1.ark:591636 - | wav-reverberate --shift-output=true --additive-signals='/data1/kaldi/corpora/musan/noise/sound-bible/noise-sound-bible-0084.wav' --start-times='0' --snrs='0' - - |
noise-0003b40554dd83e95d9f77a86d9c28a6-0003b40554dd83e95d9f77a86d9c28a6-00000000-00000210 wav-copy data/train_shorter/data/wav_segments.1.ark:625364 - | wav-reverberate --shift-output=true --additive-signals='/data1/kaldi/corpora/musan/noise/free-sound/noise-free-sound-0792.wav' --start-times='0' --snrs='15' - - |# 音乐加性增强
music-0002218b62ea517d039f3cb529eff05d-0002218b62ea517d039f3cb529eff05d-00000237-00000404 wav-copy data/train_shorter/data/wav_segments.1.ark:465992 - | wav-reverberate --shift-output=true --additive-signals='wav-reverberate --duration=1.674375 "/data1/kaldi/corpora/musan/music/fma/music-fma-0092.wav" - |' --start-times='0' --snrs='5' - - |
music-00022952a0c58c9aa6c85b57a06e0fd3-00022952a0c58c9aa6c85b57a06e0fd3-00000000-00000148 wav-copy data/train_shorter/data/wav_segments.1.ark:519700 - | wav-reverberate --shift-output=true --additive-signals='wav-reverberate --duration=1.48 "/data1/kaldi/corpora/musan/music/fma/music-fma-0006.wav" - |' --start-times='0' --snrs='8' - - |
music-000232ce10717642b15469276c065154-000232ce10717642b15469276c065154-00000000-00000076 wav-copy data/train_shorter/data/wav_segments.1.ark:567188 - | wav-reverberate --shift-output=true --additive-signals='wav-reverberate --duration=0.76 "/data1/kaldi/corpora/musan/music/fma/music-fma-0107.wav" - |' --start-times='0' --snrs='15' - - |# 嘈杂人声加性增强
babble-00017d340fe9b684011e1639d06fa3b6-00017d340fe9b684011e1639d06fa3b6-00000000-00000250 wav-copy data/train_shorter/data/wav_segments.1.ark:124500 - | wav-reverberate --shift-output=true --additive-signals='wav-reverberate --duration=2.5 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0114.wav" - |,wav-reverberate --duration=2.5 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0186.wav" - |,wav-reverberate --duration=2.5 "/data1/kaldi/corpora/musan/speech/librivox/speech-librivox-0083.wav" - |,wav-reverberate --duration=2.5 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0050.wav" - |,wav-reverberate --duration=2.5 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0132.wav" - |' --start-times='0,0,0,0,0' --snrs='15,17,20,20,13' - - |
babble-00018c7f9e3f4d4a709b9a43ccd6943b-00018c7f9e3f4d4a709b9a43ccd6943b-00000000-00000166 wav-copy data/train_shorter/data/wav_segments.1.ark:204628 - | wav-reverberate --shift-output=true --additive-signals='wav-reverberate --duration=1.66975 "/data1/kaldi/corpora/musan/speech/librivox/speech-librivox-0003.wav" - |,wav-reverberate --duration=1.66975 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0201.wav" - |,wav-reverberate --duration=1.66975 "/data1/kaldi/corpora/musan/speech/librivox/speech-librivox-0052.wav" - |' --start-times='0,0,0' --snrs='15,13,20' - - |
babble-00018c7f9e3f4d4a709b9a43ccd6943b-00018c7f9e3f4d4a709b9a43ccd6943b-00000137-00000519 wav-copy data/train_shorter/data/wav_segments.1.ark:258188 - | wav-reverberate --shift-output=true --additive-signals='wav-reverberate --duration=3.82025 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0168.wav" - |,wav-reverberate --duration=3.82025 "/data1/kaldi/corpora/musan/speech/librivox/speech-librivox-0064.wav" - |,wav-reverberate --duration=3.82025 "/data1/kaldi/corpora/musan/speech/librivox/speech-librivox-0149.wav" - |' --start-times='0,0,0' --snrs='17,20,13' - - |
babble-0002218b62ea517d039f3cb529eff05d-0002218b62ea517d039f3cb529eff05d-00000000-00000266 wav-copy data/train_shorter/data/wav_segments.1.ark:380564 - | wav-reverberate --shift-output=true --additive-signals='wav-reverberate --duration=2.665625 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0071.wav" - |,wav-reverberate --duration=2.665625 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0067.wav" - |,wav-reverberate --duration=2.665625 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0220.wav" - |,wav-reverberate --duration=2.665625 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0002.wav" - |,wav-reverberate --duration=2.665625 "/data1/kaldi/corpora/musan/speech/librivox/speech-librivox-0105.wav" - |,wav-reverberate --duration=2.665625 "/data1/kaldi/corpora/musan/speech/librivox/speech-librivox-0006.wav" - |,wav-reverberate --duration=2.665625 "/data1/kaldi/corpora/musan/speech/us-gov/speech-us-gov-0220.wav" - |' --start-times='0,0,0,0,0,0,0' --snrs='15,20,15,13,15,13,13' - - |

4.5 kaldi中的wav-reverberate

需要先设置Kaldi模型信息分析中的调试准备,才会有这个命令。
使用Pipe管道,通过输入管道提供的wav波形文件,将 室内脉冲响应(rir_matrix)和加性噪声失真
(由相应的文件指定)进行数据增强。

#--duration=20.25是管道输入音频的时长
# --snrs 信噪比缩放比例,越大噪音越小
# --start-times='0,17.8' 对管道输入音频的添加数据增强的时间区域片段
wav-reverberate --duration=20.25 \
--impulse-response=rir.wav \
--additive-signals='noise1.wav,noise2.wav' \
--snrs='20.0,15.0' --start-times='0,17.8' input.wav output.wav

混响增强的话不许需要snrs值


Reference

sox安装及常用命令
Linux 对音频万能处理的命令——SOX
Kaldi当中语音数据增强有什么方法
kaldi加噪声or混响or能量衰减