Dataset Preparation

It's your responsibility to obtain your data legally and ethically.

A dataset is a collection of raw data that you want the algorithm to learn from. In our case, it's a collection of audio files of your target speaker.

In this section, we will talk about the requirements of the dataset and some common procedures for preparing the dataset.

Preparing a dataset might take the most manual work. Be patient and keep in mind that the higher quality of a dataset, the higher quality of the model might be.

Notes

The audio files should be the target speaker's dry vocals:
- Without background music
- Without other voices
- Preferably without excessive reverb
You can use either speaking or singing data in any language.
- In general, singing data would give you more range.
The audio files should be in .wav or .ogg formats.
The audio files should be sliced into 5-15 seconds segments.
The sampling rate is better to be higher than 24kHz and should not be lower than 16kHz.
- The program will automatically resample files during preprocessing and handle the number of channels.
Do NOT leave spaces in the folder or file name.
Files should NOT have the same name even if they are in different folders.
There should be at least 6 files in the dataset.
The folder structure of the dataset does not matter.

Keeping audio length in 5 to 15 seconds is a best practice, you can have longer files, but excessively long files may lead to issues later.

Non-English characters in folder/file names may also lead to problems, you should avoid them.

General Process

De-noise & De-reverb (optional)

If your data is not clean, consider doing de-noise and/or de-reverb, otherwise, the may also have those unwanted parts.

There are many tools, both free and paid, that you can use to accomplish this. Some of them are:

Adobe Podcast (free)
izotope RX (paid)
Adobe Audition (paid)
Audacity (free)

You can search online for tutorials on them.

Loudness Normalization (recommended)

Loudness normalization helps to address the problem that the loudness of the output audio is very unstable.

The programs mentioned above (except Adobe Podcast) can also perform Loudness Normalization.

izotope RX: Modules -> Loudness Control
Adobe Audition: Window -> Match Loudness
Audacity: Effect -> Loudness Normalization

Try to find and use the batch process module of your program to process a large number of files. You can also look for command line tools to batch-process this step if you are able to.

Audio Slicing (usually required)

Since it's recommended to have audio files in 5 to 15 seconds, audio slicing is usually required.

There are many ways to do this, including using the tools above to slice audio files manually, but you can always use some batch-cutting tools to save some time.

It's recommended to cut audio only at silent gaps, but there are no significant negative effects found so far of not doing so.

audio-slicer is a Python script by the openvpi team specifically for this type of task. You can download the GUI version from here.

After slicing, double-check if there are any files that exceed 15 seconds. If there are, discard them or slice them manually.

Now you can proceed to the next stage: preprocessing.

PreviousSetting up the Environment NextPreprocessing

Last updated 2 years ago

Was this helpful?