The Importance of Data Splitting
By Jared Wilber & Brent Werness
In most supervised machine learning tasks, best practice recommends to split your data into three independent sets: a training set, a testing set, and a validation set.
To learn why, let's pretend that we have a dataset of two types of pets:
Each pet in our dataset has two features: weight and fluffiness.
Our goal is to identify and evaluate suitable models for classifying a given pet as either a cat or a dog. We'll use train/test/validations splits to do this!
Train, Test, and Validation Splits
The first step in our classification task is to randomly split our pets into three independent sets:
Training Set: The dataset that we feed our model to learn potential underlying patterns and relationships.
Validation Set: The dataset that we use to understand our model's performance across different model types and hyperparameter choices.
Test Set: The dataset that we use to approximate our model's unbiased accuracy in the wild.
The Training Set
The training set is the dataset that we employ to train our model. It is this dataset that our model uses to learn any underlying patterns or relationships that will enable making predictions later on.
The training set should be as representative as possible of the population that we are trying to model. Additionally, we need to be careful and ensure that it is as unbiased as possible, as any bias at this stage may be propagated downstream during inference.
Building Our Model
Our goal (to determine whether a given pet is a cat or a dog) is a binary classification task, so we will use a simple but effective model appropriate for this task: logistic regression.
Logistic regression will learn a decision boundary to best separate the cats from dogs in our training data, using the selected feature (None, Weight, Fluffiness, or both Weight and Fluffiness).
Select the feature to visualize the corresponding logistic regression model's decision boundary. Drag each animal in the training set to a new position to see how the boundary updates!
The Validation Set
We can build four different logistic regression models (one for each feature possibility), how do we decide which model to select?
We could compare the accuracy of each model on the training set, but if we use the same exact dataset for both training and tuning, the model will overfit and won't generalize well.
This is where the validation set comes in — it acts as an independent, unbiased dataset for comparing the performance of different algorithms trained on our training set.
Select a feature to view the model's performance on the validation set in the table below. Drag the pets across the line to see how the model performance updates!
The Testing Set
Once we have used the validation set to determine the algorithm and parameter choices that we would like to use in production, the test set is used to approximate the models's true performance in the wild. It is the final step in evaluating our model's performance on unseen data.
We should never, under any circumstance, look at the test set's performance before selecting a model.
Peeking at our test set performance ahead of time is a form of overfitting, and will likely lead to unreliable performance expectations in production. It should only be checked as the final form of evaluation, after the validation set has been used to identify the best model.
You may have noticed that the test accuracy of the "just fluffiness" model was higher than that of the "both features" model, despite the validation set selecting the latter model as the best. This occurrence of the validation performance not exactly matching the test performance might happen, yet it is not a bad thing. Remember that the test performance is not a number to optimize over — it is a metric to assess future performance. It allows us to estimate, with confidence, that our model can distinguish between cats and dogs with 87.5% accuracy.
It is best practice in machine learning to split our data into the following three groups:
Training Set: For training of the model.
Validation Set: For unbiased evaluation of the model.
Test Set: For final evaluation of the model.
Adhering to this setup helps ensure that we have a realistic understanding of our model's performance, and that we (hopefully) built a model that generalizes well to unseen data.
Thanks for reading! To learn more about machine learning, check out our website, watch our videos, or read the D2L book.
Animal icons by Adrien Coquet (cat) & Maurício Brito (dog).