For audio research, speech and noise corpus’s are commonly stored in uncompressed format, generally WAV. This results in very large file sizes, for instance the CSTR VCTK Corpus is 10GB.
However, common compression formats like MP3 or Ogg Vorbis would cause distortion of noise signals rendering them useless for comparision against results obtained from the original files.
Conversion
As an exercise, I wrote this script to convert these all into the lossless FLAC format, using FFmpeg and GNU Parallel. Unfortunately (see below), some files refuse to convert losslessly, but I still consider this some good practice in shell programming.
The script creates a parallel set of folders under flac48
, as the folder structure is important for working out the meaning of each file in the corpus. It should easily be modifiable, but I don’t have much practice in bash shell so cannot give support.
Gist below, GNU parallel request you cite them if you use it (and by extension, this script) in your research:
Verifying
After conversion, we want to check that when the FLAC audio is decompressed we get back the exact same audio.
Verifying can be done with audiodiff, but this only supports python 2, does not show progress, and is single threaded, which results in half an hour of thumb twiddling on my machine.
After some research, this helpful stackoverflow answer shows FFmpeg can encode the audio output directly to an MD5 hash, without needing any intermediate files.
Therefore, I implemented the following script using FFmpeg’s MD5 output, and GNU Parallel again for multiprocessing and to show a progress bar. This also uses the excellent tput utility (standard on most linux installations) for text colors.
Unfortunately, on running this script, I found that about 1% of the few hundred audio files differed in MD5 values. Reconverting them manually did not fix the error, so either FFmpeg is subtly failing to losslessly compress, or there is some error (probably less likely) in the MD5 output. Therefore, I will stick to WAV for now.