> For the complete documentation index, see [llms.txt](https://diff-svc.gitbook.io/the-beginners-guide-to-diff-svc/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://diff-svc.gitbook.io/the-beginners-guide-to-diff-svc/start/dataset-preparation.md).

# Dataset Preparation

{% hint style="danger" %} <mark style="color:red;">**It's your responsibility to obtain your data legally and ethically.**</mark>
{% endhint %}

A dataset is a collection of raw data that you want the algorithm to learn from. In our case, it's a collection of audio files of your target speaker.

In this section, we will talk about the requirements of the dataset and some common procedures for preparing the dataset.

![](/files/JmcVBpFjVvW3AxrXyqPt)

Preparing a dataset might take the most manual work. Be patient and keep in mind that the higher quality of a dataset, the higher quality of the model might be<mark style="background-color:yellow;">.</mark>

## Notes

* The audio files should be the target speaker's dry vocals:
  * Without background music
  * Without other voices
  * Preferably without excessive reverb
* You can use either speaking or singing data in any language.
  * > In general, singing data would give you more range.
* The audio files should be in `.wav` or `.ogg` formats.
* The audio files should be sliced into **5-15** seconds segments.
* The sampling rate is better to be higher than 24kHz and should not be lower than 16kHz.
  * > The program will automatically resample files during preprocessing and handle the number of channels.
* Do <mark style="color:red;">**NOT**</mark> leave spaces in the folder or file name.
* Files should **NOT** have the same name even if they are in different folders.
* There should be <mark style="color:red;">**at least**</mark> <mark style="color:red;">**6**</mark> <mark style="color:red;">**files**</mark> in the dataset.
* The folder structure of the dataset does not matter.

{% hint style="info" %}
Keeping audio length in 5 to 15 seconds is a best practice, you can have longer files, but excessively long files may lead to [CUDA OOM](#user-content-fn-1)[^1] issues later.
{% endhint %}

{% hint style="warning" %}
Non-English characters in folder/file names may also lead to problems, you should avoid them.
{% endhint %}

## General Process

### De-noise & De-reverb (optional)

If your data is not clean, consider doing de-noise and/or de-reverb, otherwise, the [output audio](#user-content-fn-2)[^2] may also have those unwanted parts.

There are many tools, both free and paid, that you can use to accomplish this. Some of them are:

> * [Adobe Podcast](https://podcast.adobe.com/enhance) (free)
> * [izotope RX](https://www.izotope.com/en/products/rx.html) (paid)
> * [Adobe Audition](https://www.adobe.com/products/audition.html) (paid)
> * [Audacity](https://www.audacityteam.org/) (free)

You can search online for tutorials on them.

### Loudness Normalization (recommended)

Loudness normalization helps to address the problem that the loudness of the output audio is very unstable.

The programs mentioned above (except Adobe Podcast) can also perform Loudness Normalization.

> * [izotope RX](https://www.izotope.com/en/products/rx.html): Modules -> [Loudness Control](https://docs.izotope.com/rx10/en/loudness/index.html)&#x20;
> * [Adobe Audition](https://www.adobe.com/products/audition.html): Window -> [Match Loudness](https://helpx.adobe.com/audition/using/match-loudness.html)
> * [Audacity](https://www.audacityteam.org/): Effect -> [Loudness Normalization](https://manual.audacityteam.org/man/loudness_normalization.html)&#x20;

{% hint style="info" %}
Try to find and use the batch process module of your program to process a large number of files. You can also look for command line tools to batch-process this step if you are able to.
{% endhint %}

### Audio Slicing (usually required)

Since it's recommended to have audio files in 5 to 15 seconds, audio slicing is usually required.&#x20;

There are many ways to do this, including using the tools above to slice audio files manually, but you can always use some batch-cutting tools to save some time.

{% hint style="info" %}
It's recommended to cut audio only at silent gaps, but there are no significant negative effects found so far of not doing so.
{% endhint %}

[audio-slicer](https://github.com/openvpi/audio-slicer) is a Python script by the openvpi team specifically for this type of task. You can download the GUI version from [here](https://github.com/flutydeer/audio-slicer/releases).

After slicing, double-check if there are any files that exceed 15 seconds. If there are, discard them or slice them manually.

Now you can proceed to the next stage: preprocessing.

[^1]: CUDA Out Of Memory

[^2]: the audio generated by your model


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://diff-svc.gitbook.io/the-beginners-guide-to-diff-svc/start/dataset-preparation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
