"
This article is part of in the series

computer codes

Model evaluation and validation are two essential aspects of supervised machine learning. These ensure that the predictive performance of the model remains unbiased.

The train_test_split() method in the scikit-learn library allows you to split a dataset into subsets, thereby reducing the odds of bias during evaluation and validation.

train_test_split(): Prerequisites 

We are using scikit-learn version 0.23.1 below. Also called the sklearn module, the module includes several packages for machine learning and data science. 

In this guide, we will discuss how the model_selection package works, specifically the train_test_split() method in the package. 

To install sklearn, you can run:

$ python -m pip install -U "scikit-learn==0.23.1"

If you've used the Anaconda distribution of Python before, there's a chance you already have sklearn installed. We recommend using a fresh environment and ensuring you have the specified version of sklearn. 

You can also use Miniconda and install the module from Anaconda Cloud, like so:

$ conda install -c anaconda scikit-learn=0.23

train_test_split(): How It Works

Begin by importing NumPy and the train_test_split() method from the module:

import numpy as np
from sklearn.model_selection import train_test_split

You're now ready to split datasets into test and training sets. You can split inputs and outputs simultaneously with a single function call.

To use the method,  you must supply sequences you want to split and other arguments. The method returns a list of sequences such as SciPy sparse matrices and NumPy arrays. 

sklearn.model_selection.train_test_split(*arrays, **options) -> list

In the code above, "arrays" represents a sequence of lists, pandas DataFrames, NumPy arrays, or other array-like objects that can hold the data you wish to split. 

The dataset comprises such objects, and all of them have to be the same length. Since we are discussing supervised machine learning applications, you can expect to work with either of the following sequences:

  1. A one-dimensional array with the outputs 
  2. A two-dimensional array with the inputs 

The "options" in the code above represent the optional keyword arguments available. These include:

Option Description
random_state Either an int or a RandomState instance, this object controls randomization during splitting. Its default value is None.
shuffle True by default, the shuffle option indicates whether the dataset is shuffled before the split is applied.
stratify Having a default value of None by default, stratify is an array-like object that defines a stratified split.
train_size A numeric value that specifies the size of the training set. You can supply a float value between 0.0 and 1.0 to define the share of the dataset you want to test. On the other hand, if you supply an int value, the value will represent the total number of training samples. The default train_size value is None.
test_size A numeric value that defines the test set's size. It is similar to train_size, and you must supply this option or the train_size option to use the method. If you supply neither, the method will assign 25 percent of the dataset for testing purposes.

 

Let's see the method in action. Create a dataset with basic inputs in a two-dimensional array. The outputs must be recorded in a one-dimensional array.

>>> twoDimensionalArray = np.arange(1, 25).reshape(12, 2)
>>> oneDimensionalArray = np.array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])
>>> twoDimensionalArray
array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10],
       [11, 12],
       [13, 14],
       [15, 16],
       [17, 18],
       [19, 20],
       [21, 22],
       [23, 24]])
>>> oneDimensionalArray
array([0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0])

You can use the arange() method to return arrays based on ranges. Above, we alter the returned array's shape using the .reshape() function. The .reshape() function allowed us to create a 2D array.

Now, let's split the input and output datasets:

>>> twoDimensionalArray_train, twoDimensionalArray_test, oneDimensionalArray_train, oneDimensionalArray_test = train_test_split(twoDimensionalArray, oneDimensionalArray)
>>> twoDimensionalArray_train
array([[15, 16],
       [21, 22],
       [11, 12],
       [17, 18],
       [13, 14],
       [ 9, 10],
       [ 1,  2],
       [ 3,  4],
       [19, 20]])
>>> twoDimensionalArray_test
array([[ 5,  6],
       [ 7,  8],
       [23, 24]])
>>> oneDimensionalArray_train
array([1, 1, 0, 1, 0, 1, 0, 1, 0])
>>> oneDimensionalArray_test
array([1, 0, 0])

As you can see, it takes only one function call to split the datasets. Above, we're working with only two sequences: twoDimensionalArray and oneDimensionalArray. 

The train_test_split() function splits the data and returns the following:

  • oneDimensionalArray_test: The test part of the second sequence
  • oneDimensionalArray_train: The training part of the second sequence 
  • twoDimensionalArray_test: The test part of the first sequence 
  • twoDimensionalArray_train: The training part of the first sequence 

You will receive a different output than what you see here since splitting is random by default. So, you will see a different result whenever you run the function. 

Some applications demand the generation of reproducible tests. To do this, you must use a random split that supplies the same output for every function call. You can use the random_state parameter to make your tests reproducible. 

Interestingly, the value of random_state doesn't matter as long as it's a non-negative integer. You can take the more complex approach of using numpy.random.RandomState, though it's not necessary.

However, you must always define the test size explicitly. At times, experimenting with the test_size or train_size values can be beneficial. 

Let's polish off the code we wrote above:

>>> twoDimensionalArray_train, twoDimensionalArray_test, oneDimensionalArray_train, oneDimensionalArray_test = train_test_split(
...     twoDimensionalArray, oneDimensionalArray, test_size=4, random_state=4
... )
>>> twoDimensionalArray_train
array([[17, 18],
       [ 5,  6],
       [23, 24],
       [ 1,  2],
       [ 3,  4],
       [11, 12],
       [15, 16],
       [21, 22]])
>>> twoDimensionalArray_test
array([[ 7,  8],
       [ 9, 10],
       [13, 14],
       [19, 20]])
>>> oneDimensionalArray_train
array([1, 1, 0, 0, 1, 0, 1, 1])
>>> oneDimensionalArray_test
array([0, 1, 0, 0])

Since we've changed the code, if you run it, you will see a different result from before. Previously, the test set had three items and the training set had nine. 

However, we have set the test_size argument to four, with the training set having eight items and the test set having four items. 

So, you'd see the same result if you set test_size to 0.33 since 33 percent of twelve is roughly equal to four.

There's an additional major difference between the previous two examples – since the random_state argument is set to four, the result is always the same in the example above. 

The code shuffles the dataset samples and splits them into test and training sets depending on the defined size.

The oneDimensional array has six zeroes and six ones, but three of the four values in the test set are zeroes. To retain the proportion of values in oneDimensionalArray across the test and training sets, you can use the stratify argument and set it to "oneDimensionalArray."

This will split the dataset in a stratified manner, and oneDimensionalArray_test and oneDimensionalArray_train will have the same ratio of zeroes and ones as the original array.

Stratified splits can be excellent solutions in certain circumstances. For instance, stratified splits can help classify imbalanced datasets, which are datasets with major differences in the number of samples in the classes. 

To turn off random split and data shuffling, you can set the shuffle parameter to False, like so:

>>> twoDimensionalArray_train, twoDimensionalArray_test, oneDimensionalArray_train, oneDimensionalArray_test = train_test_split(twoDimensionalArray, oneDimensionalArray, test_size=0.33, shuffle=False)
>>> twoDimensionalArray_train
array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10],
       [11, 12],
       [13, 14],
       [15, 16]])
>>> twoDimensionalArray_test
array([[17, 18],
       [19, 20],
       [21, 22],
       [23, 24]])
>>> oneDimensionalArray_train
array([0, 1, 1, 0, 1, 0, 0, 1])
>>> oneDimensionalArray_test
array([1, 0, 1, 0])

This code creates a split assigning the first two-thirds of samples in the original arrays to the training set. The final third of the arrays are assigned to the test set. This removes the issues of randomness and shuffling. 

train_test_split(): Supervised Machine Learning 

Let's solve a small regression problem with linear regression to see supervised machine learning in action with train_test_split(). 

Begin by importing the essential functions, classes, and packages:

>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> from sklearn.model_selection import train_test_split

Now, create two arrays and then split them into the two sets like we did before:

>>> x = np.arange(20).reshape(-1, 1)
>>> y = np.array([5, 12, 11, 19, 30, 29, 23, 40, 51, 54, 74,
...               62, 68, 73, 89, 84, 89, 101, 99, 106])
>>> x_train, x_test, y_train, y_test = train_test_split(
...     x, y, test_size=8, random_state=0
... )

The dataset has twenty x-y pairs, and since we've defined test_size to be 8, the test set is assigned eight pairs, and the training set is assigned twelve pairs.

You can now proceed to fit the model:

>>> model = LinearRegression().fit(x_train, y_train)
>>> model.intercept_
3.1617195496417523
>>> model.coef_
array([5.53121801])

Finally, use .score() to get the coefficient of determination and check the goodness of the fit.

>>> model.score(x_train, y_train)
0.9868175024574795
>>> model.score(x_test, y_test)
0.9465896927715023