How to Split Image Data into Train/Val/Test Sets with Python’s split-folders

In machine learning tasks involving image data, it’s crucial to split your dataset into separate test, training, and validation sets. This splitting ensures that your model is trained on one set of data, evaluated on a different set (validation), and finally tested on a completely unseen set of data (test). Manually splitting large image datasets can be a tedious and error-prone process, especially when dealing with thousands or millions of images.

Fortunately, the Python library split-folders provides a convenient solution for automatically splitting image folders into test, training, and validation sets with customizable ratios. In this article, we’ll explore how to use this library and understand the process of splitting image folders using a practical example.

Split Image Folders into Train, Validation, and Test Sets with Python’s split-folders Library

Step 1: Setting up a Virtual Environment

Step 1: Setting up a Virtual Environment

If you’re using a virtual environment, make sure to activate it before proceeding. To activate your virtual environment:

On Unix-based systems (e.g., macOS, Linux):

 

On Windows:

 

You should see (env) preceding your command prompt, indicating that the virtual environment is active.

Step 2: Installing split-folders

Step 2: Installing split-foldersStep 2: Installing split-folders

With your virtual environment activated, you can now install the split-folders library. You can do this using pip, Python’s package installer:

pip install split-folders

 

Alternatively, if you’re using Anaconda or Miniconda, you can install the library via conda:

conda install -c conda-forge split-folders

Step 3: Understanding the split-folders Command

The split-folders command is structured as follows:

split-folders --output OUTPUT_DIR --ratio TRAIN_RATIO VAL_RATIO TEST_RATIO -- INPUT_DIR

Let’s break down the different components of this command:

  • --output OUTPUT_DIR: This argument specifies the directory where the split datasets will be stored. In our example, we used --output dataset.
  • --ratio TRAIN_RATIO VAL_RATIO TEST_RATIO: This argument specifies the ratios for splitting the data into training, validation, and test sets, respectively. In our example, we used --ratio .7 .1 .2, which means 70% of the data will be used for training, 10% for validation, and 20% for testing.
  • --: This double hyphen separates the arguments from the input directory path.
  • INPUT_DIR: This is the path to the directory containing the images you want to split.

In our example, we used the following command:

Step 3: Understanding the split-folders CommandStep 3: Understanding the split-folders Command

split-folders --output dataset --ratio .7 .1 .2 -- PlantVillage

 

This command splits the images in the PlantVillage directory into three sets: 70% for training, 10% for validation, and 20% for testing. The split datasets are stored in the dataset directory.

Step 4: Executing the Command

Step 4: Executing the CommandStep 4: Executing the Command

When you execute the split-folders command, you should see an output similar to the following:

Copying files: 2152 files [00:00, 2177.87 files/s]

 

This output indicates that the library is copying the files from the input directory (PlantVillage) and splitting them into the specified sets.

Step 5: Exploring the Output

Step 5: Exploring the OutputStep 5: Exploring the Output

After the command has finished executing, you can explore the output directory (dataset in our example) and its subdirectories:

dataset
├── train
├── val
└── test

 

The train directory contains approximately 70% of the original images (in our example, around 1,506 files), the val directory contains approximately 10% (around 215 files), and the test directory contains approximately 20% (around 430 files).

Customizing the Command

The beauty of the split-folders command lies in its flexibility. You can customize the ratios for splitting the data according to your needs. For example, if you want a 60-20-20 split for train, validation, and test sets, respectively, you would use the following command:

split-folders --output my_dataset --ratio .6 .2 .2 -- my_image_folder

 

This command will split the images in the my_image_folder directory into three sets: 60% for training (my_dataset/train), 20% for validation (my_dataset/val), and 20% for testing (my_dataset/test).

You can also specify the output directory name using the --output argument, as shown in the example above.

Conclusion

The split-folders library is a powerful tool for splitting image datasets into test, training, and validation sets with customizable ratios. By following the steps outlined in this article, you can easily split your image folders and prepare your data for machine learning tasks.

Explore the split-folders library further and consider using it in your own machine learning projects involving image data.

Also Read: