Solution to Kaggle Intro to Machine Learning: Model Validation
Recap
You’ve built a model. In this exercise, you will test how good your model is.
Run the cell below to set up your coding environment where the previous exercise left off.
# Code you have previously used to load data
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
y = home_data.SalePrice
feature_columns = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[feature_columns]
# Specify Model
iowa_model = DecisionTreeRegressor()
# Fit Model
iowa_model.fit(X, y)
print("First in-sample predictions:", iowa_model.predict(X.head()))
print("Actual target values for those homes:", y.head().tolist())
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex4 import *
print("Setup Complete")
First in-sample predictions: [208500. 181500. 223500. 140000. 250000.] Actual target values for those homes: [208500, 181500, 223500, 140000, 250000] Setup Complete
Exercises
Step 1: Split Your Data
Use the train_test_split
function to split up your data.
Give it the argument random_state=1
so the check
functions know what to expect when verifying your code.
Recall, your features are loaded in the DataFrame X and your target is loaded in y.
if you use random_state=some_number, then you can guarantee that the output of Run 1 will be equal to the output of Run 2, i.e. your split will be always the same. It doesn’t matter what the actual random_state number is 42, 0, 21, … The important thing is that everytime you use 42, you will always get the same output the first time you make the split. This is useful if you want reproducible results, for example in the documentation, so that everybody can consistently see the same numbers when they run the examples. In practice I would say, you should set the random_state to some fixed number while you test stuff, but then remove it in production if you really need a random (and not a fixed) split.
# Import the train_test_split function and uncomment
from sklearn.model_selection import train_test_split
# fill in and uncomment
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# Check your answer
step_1.check()
# The lines below will show you a hint or the solution.
# step_1.hint()
# step_1.solution()
Step 2: Specify and Fit the Model
Create a DecisionTreeRegressor
model and fit it to the relevant data. Set random_state
to 1 again when creating the model.
# You imported DecisionTreeRegressor in your last exercise
# and that code has been copied to the setup code above. So, no need to
# import it again
# Specify the model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit iowa_model with the training data.
iowa_model.fit(train_X, train_y)
# Check your answer
step_2.check()
[186500. 184000. 130000. 92000. 164500. 220000. 335000. 144152. 215000. 262000.] [186500. 184000. 130000. 92000. 164500. 220000. 335000. 144152. 215000. 262000.]
# step_2.hint()
# step_2.solution()
Step 3: Make Predictions with Validation data
# Predict with all validation observations
val_predictions = iowa_model.predict(val_X)
# Check your answer
step_3.check()
# step_3.hint()
# step_3.solution()
# print the top few validation predictions
print(val_predictions[:5])
# print the top few actual prices from validation data
print(val_y[:5])
[186500. 184000. 130000. 92000. 164500.] 258 231500 267 179500 288 122000 649 84500 1233 142000 Name: SalePrice, dtype: int64
Do you remember why validation predictions differ from in-sample (or training) predictions? This is an important idea from the last lesson.
Step 4: Calculate the Mean Absolute Error in Validation Data
from sklearn.metrics import mean_absolute_error
val_mae = mean_absolute_error(val_y, val_predictions)
# uncomment following line to see the validation_mae
print(val_mae)
# Check your answer
step_4.check()
<learntools.core.constants.PlaceholderValue object at 0x7f4d6f57c350>
check()
will tell you whether your code is correct. You need to update the code that creates variable val_mae
# step_4.hint()
# step_4.solution()
Is that MAE good? There isn’t a general rule for what values are good that applies across applications. But you’ll see how to use (and improve) this number in the next step.