In part 1, we have talked about scraping transactional data of apartments. In part 2, let’s talk about splitting data into a training set and testing set.
Why we need to split?
The short answer is we want to prevent overfitting/memorizing and hope that the model trained can generalize into unknown cases. For example, if we have the final exam papers, we don’t need to learn at all. We can just memorize (overfitting) all the answers and get 100%.
The error the model made on the test set is an approximation to the real out of sample error. So, we train our model in training set, then test the model in test set which is the best approximation we can have on out of sample data.
How to split?
The ideal ratio of train and test set may not exist. But, very often, we may want to have 80% data to be our training set.
One important detail to consider is that whether the data collected includes a time component. For example, in house transactional data, the registration date is a time component. What we want to do is to train on the past data and predict the future. So, we need to split by time.
If you believe there is no time component in the data or time component does not affect the distribution of the data, then we can just split the data uniformly at random.
import pandas as pd import numpy as np df = pd.read_csv('../inputs/all_data.csv') df['Reg. Date'] = pd.to_datetime(df['Reg. Date'], format = '%Y-%m-%d') df = df.sort_values(by='Reg. Date', ascending=True) # train test split is done by time df_train, df_test = np.split(df, [int(0.8*len(df))]) df_train.to_csv("../inputs/train.csv", index=False) df_test.to_csv("../inputs/test.csv", index=False)
In the code, I sorted the data by date. Then I use 80% of the data as training set.
What’s next? Let’s do some exploratory analysis on the training set. You can keep track of the whole project progress in this Git.