Solution to Kaggle Intro to Machine Learning: Machine Learning Competitions
Introduction
Machine learning competitions are a great way to improve your data science skills and measure your progress.
In this exercise, you will create and submit predictions for a Kaggle competition. You can then improve your model (e.g. by adding features) to improve and see how you stack up to others taking this micro-course.
The steps in this notebook are:
- Build a Random Forest model with all of your data (X and y)
- Read in the “test” data, which doesn’t include values for the target. Predict home values in the test data with your Random Forest model.
- Submit those predictions to the competition and see your score.
- Optionally, come back to see if you can improve your model by adding features or changing your model. Then you can resubmit to see how that stacks up on the competition leaderboard.
Recap
Here’s the code you’ve written so far. Start by running it again.
In [1]:
# Code you have previously used to load data
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")
os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv")
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex7 import *
# Path of the file to read. We changed the directory structure to simplify submitting to a competition
iowa_file_path = '../input/train.csv'
home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)
# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE when not specifying max_leaf_nodes: {:,.0f}".format(val_mae))
# Using best value for max_leaf_nodes
iowa_model = DecisionTreeRegressor(max_leaf_nodes=100, random_state=1)
iowa_model.fit(train_X, train_y)
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE for best value of max_leaf_nodes: {:,.0f}".format(val_mae))
# Define the model. Set random_state to 1
rf_model = RandomForestRegressor(random_state=1)
rf_model.fit(train_X, train_y)
rf_val_predictions = rf_model.predict(val_X)
rf_val_mae = mean_absolute_error(rf_val_predictions, val_y)
print("Validation MAE for Random Forest Model: {:,.0f}".format(rf_val_mae))
Validation MAE when not specifying max_leaf_nodes: 29,653 Validation MAE for best value of max_leaf_nodes: 27,283 Validation MAE for Random Forest Model: 21,857
Creating a Model For the Competition
Build a Random Forest model and train it on all of X and y.
In [3]:
# To improve accuracy, create a new Random Forest model which you will train on all training data
rf_model_on_full_data = RandomForestRegressor()
# fit rf_model_on_full_data on all data from the training data
rf_model_on_full_data.fit(X, y)
Out[3]:
RandomForestRegressor()
Make Predictions
Read the file of “test” data. And apply your model to make predictions
In [18]:
# path to file you will use for predictions
test_data_path = '../input/test.csv'
# read test data file using pandas
test_data = pd.read_csv(test_data_path)
# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
print(features)
test_X = test_data[features]
print(test_X)
# make predictions which we will submit.
test_preds = rf_model_on_full_data.predict(test_X)
# The lines below shows how to save predictions in format used for competition scoring
# Just uncomment them.
output = pd.DataFrame({'Id': test_data.Id,
'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)
['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd'] LotArea YearBuilt 1stFlrSF 2ndFlrSF FullBath BedroomAbvGr \ 0 11622 1961 896 0 1 2 1 14267 1958 1329 0 1 3 2 13830 1997 928 701 2 3 3 9978 1998 926 678 2 3 4 5005 1992 1280 0 2 2 ... ... ... ... ... ... ... 1454 1936 1970 546 546 1 3 1455 1894 1970 546 546 1 3 1456 20000 1960 1224 0 1 4 1457 10441 1992 970 0 1 3 1458 9627 1993 996 1004 2 3 TotRmsAbvGrd 0 5 1 6 2 6 3 7 4 5 ... ... 1454 5 1455 6 1456 7 1457 6 1458 9 [1459 rows x 7 columns]
Before submitting, run a check to make sure your
test_preds
have the right format.In [16]:
# Check your answer
step_1.check()
step_1.solution()
Check: When you’ve updated the starter code,
check()
will tell you whether your code is correct. You need to update the code that creates variable test_preds
Solution:
# In previous code cell
rf_model_on_full_data = RandomForestRegressor()
rf_model_on_full_data.fit(X, y)
# Then in last code cell
test_data_path = '../input/home-data-for-ml-course/test.csv'
test_data = pd.read_csv(test_data_path)
test_X = test_data[features]
test_preds = rf_model_on_full_data.predict(test_X)
output = pd.DataFrame({'Id': test_data.Id,
'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)