Solution to Kaggle Intro to Machine Learning: Underfitting and Overfitting
Recap
You’ve built your first model, and now it’s time to optimize the size of the tree to make better predictions. Run this cell to set up your coding environment where the previous step left off.
# Code you have previously used to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)
# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex5 import *
print("\nSetup complete")
Validation MAE: 29,653 Setup complete
Exercises
You could write the function get_mae
yourself. For now, we’ll supply it. This is the same function you read about in the previous lesson. Just run the cell below.
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
Step 1: Compare Different Tree Sizes
Write a loop that tries the following values for max_leaf_nodes from a set of possible values.
Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of max_leaf_nodes
that gives the most accurate model on your data.
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
list_mae = []
for num_nodes in candidate_max_leaf_nodes:
result = get_mae(num_nodes, train_X, val_X, train_y, val_y)
list_mae.append(result)
#[debug]print(list_mae)
# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = candidate_max_leaf_nodes[list_mae.index(min(list_mae))]
# Check your answer
step_1.check()
# The lines below will show you a hint or the solution.
# step_1.hint()
step_1.solution()
# Here is a short solution with a dict comprehension.
# The lesson gives an example of how to do this with an explicit loop.
scores = {leaf_size: get_mae(leaf_size, train_X, val_X, train_y, val_y) for leaf_size in candidate_max_leaf_nodes}
best_tree_size = min(scores, key=scores.get)
import matplotlib.pyplot as plt
#from matplotlib.pyplot import plot
#import seaborn as sns
#To check whether data is correct
#print(candidate_max_leaf_nodes)
#print(list_mae)
# Variables
a = candidate_max_leaf_nodes
b = list_mae
# Plot
plot(a, b ,'o')
plot(a, b, '-')
plt.xlabel('the nuber of leaf')
plt.ylabel('Mean Average Error')
plt.title('Tree model')
plt.grid(color='grey', linestyle='-', linewidth=0.25, alpha=0.5)
[5, 25, 50, 100, 250, 500] [35044.51299744237, 29016.41319191076, 27405.930473214907, 27282.50803885739, 27893.822225701646, 29454.18598068598]
help(plot)
Help on function plot in module matplotlib.pyplot: plot(*args, scalex=True, scaley=True, data=None, **kwargs) Plot y versus x as lines and/or markers. Call signatures:: plot([x], y, [fmt], *, data=None, **kwargs) plot([x], y, [fmt], [x2], y2, [fmt2], ..., **kwargs) The coordinates of the points or line nodes are given by *x*, *y*. The optional parameter *fmt* is a convenient way for defining basic formatting like color, marker and linestyle. It's a shortcut string notation described in the *Notes* section below. >>> plot(x, y) # plot x and y using default line style and color >>> plot(x, y, 'bo') # plot x and y using blue circle markers >>> plot(y) # plot y using x as index array 0..N-1 >>> plot(y, 'r+') # ditto, but with red plusses You can use `.Line2D` properties as keyword arguments for more control on the appearance. Line properties and *fmt* can be mixed. The following two calls yield identical results: >>> plot(x, y, 'go--', linewidth=2, markersize=12) >>> plot(x, y, color='green', marker='o', linestyle='dashed', ... linewidth=2, markersize=12) When conflicting with *fmt*, keyword arguments take precedence. **Plotting labelled data** There's a convenient way for plotting objects with labelled data (i.e. data that can be accessed by index ``obj['y']``). Instead of giving the data in *x* and *y*, you can provide the object in the *data* parameter and just give the labels for *x* and *y*:: >>> plot('xlabel', 'ylabel', data=obj) All indexable objects are supported. This could e.g. be a `dict`, a `pandas.DataFame` or a structured numpy array. **Plotting multiple sets of data** There are various ways to plot multiple sets of data. - The most straight forward way is just to call `plot` multiple times. Example: >>> plot(x1, y1, 'bo') >>> plot(x2, y2, 'go') - Alternatively, if your data is already a 2d array, you can pass it directly to *x*, *y*. A separate data set will be drawn for every column. Example: an array ``a`` where the first column represents the *x* values and the other columns are the *y* columns:: >>> plot(a[0], a[1:]) - The third way is to specify multiple sets of *[x]*, *y*, *[fmt]* groups:: >>> plot(x1, y1, 'g^', x2, y2, 'g-') In this case, any additional keyword argument applies to all datasets. Also this syntax cannot be combined with the *data* parameter. By default, each line is assigned a different style specified by a 'style cycle'. The *fmt* and line property parameters are only necessary if you want explicit deviations from these defaults. Alternatively, you can also change the style cycle using :rc:`axes.prop_cycle`. Parameters ---------- x, y : array-like or scalar The horizontal / vertical coordinates of the data points. *x* values are optional and default to `range(len(y))`. Commonly, these parameters are 1D arrays. They can also be scalars, or two-dimensional (in that case, the columns represent separate data sets). These arguments cannot be passed as keywords. fmt : str, optional A format string, e.g. 'ro' for red circles. See the *Notes* section for a full description of the format strings. Format strings are just an abbreviation for quickly setting basic line properties. All of these and more can also be controlled by keyword arguments. This argument cannot be passed as keyword. data : indexable object, optional An object with labelled data. If given, provide the label names to plot in *x* and *y*. .. note:: Technically there's a slight ambiguity in calls where the second label is a valid *fmt*. `plot('n', 'o', data=obj)` could be `plt(x, y)` or `plt(y, fmt)`. In such cases, the former interpretation is chosen, but a warning is issued. You may suppress the warning by adding an empty format string `plot('n', 'o', '', data=obj)`. Other Parameters ---------------- scalex, scaley : bool, optional, default: True These parameters determined if the view limits are adapted to the data limits. The values are passed on to `autoscale_view`. **kwargs : `.Line2D` properties, optional *kwargs* are used to specify properties like a line label (for auto legends), linewidth, antialiasing, marker face color. Example:: >>> plot([1, 2, 3], [1, 2, 3], 'go-', label='line 1', linewidth=2) >>> plot([1, 2, 3], [1, 4, 9], 'rs', label='line 2') If you make multiple lines with one plot command, the kwargs apply to all those lines. Here is a list of available `.Line2D` properties: Properties: agg_filter: a filter function, which takes a (m, n, 3) float array and a dpi value, and returns a (m, n, 3) array alpha: float or None animated: bool antialiased or aa: bool clip_box: `.Bbox` clip_on: bool clip_path: Patch or (Path, Transform) or None color or c: color contains: callable dash_capstyle: {'butt', 'round', 'projecting'} dash_joinstyle: {'miter', 'round', 'bevel'} dashes: sequence of floats (on/off ink in points) or (None, None) data: (2, N) array or two 1D arrays drawstyle or ds: {'default', 'steps', 'steps-pre', 'steps-mid', 'steps-post'}, default: 'default' figure: `.Figure` fillstyle: {'full', 'left', 'right', 'bottom', 'top', 'none'} gid: str in_layout: bool label: object linestyle or ls: {'-', '--', '-.', ':', '', (offset, on-off-seq), ...} linewidth or lw: float marker: marker style markeredgecolor or mec: color markeredgewidth or mew: float markerfacecolor or mfc: color markerfacecoloralt or mfcalt: color markersize or ms: float markevery: None or int or (int, int) or slice or List[int] or float or (float, float) path_effects: `.AbstractPathEffect` picker: float or callable[[Artist, Event], Tuple[bool, dict]] pickradius: float rasterized: bool or None sketch_params: (scale: float, length: float, randomness: float) snap: bool or None solid_capstyle: {'butt', 'round', 'projecting'} solid_joinstyle: {'miter', 'round', 'bevel'} transform: `matplotlib.transforms.Transform` url: str visible: bool xdata: 1D array ydata: 1D array zorder: float Returns ------- lines A list of `.Line2D` objects representing the plotted data. See Also -------- scatter : XY scatter plot with markers of varying size and/or color ( sometimes also called bubble chart). Notes ----- **Format Strings** A format string consists of a part for color, marker and line:: fmt = '[marker][line][color]' Each of them is optional. If not provided, the value from the style cycle is used. Exception: If ``line`` is given, but no ``marker``, the data will be a line without markers. Other combinations such as ``[color][marker][line]`` are also supported, but note that their parsing may be ambiguous. **Markers** ============= =============================== character description ============= =============================== ``'.'`` point marker ``','`` pixel marker ``'o'`` circle marker ``'v'`` triangle_down marker ``'^'`` triangle_up marker ``'<'`` triangle_left marker ``'>'`` triangle_right marker ``'1'`` tri_down marker ``'2'`` tri_up marker ``'3'`` tri_left marker ``'4'`` tri_right marker ``'s'`` square marker ``'p'`` pentagon marker ``'*'`` star marker ``'h'`` hexagon1 marker ``'H'`` hexagon2 marker ``'+'`` plus marker ``'x'`` x marker ``'D'`` diamond marker ``'d'`` thin_diamond marker ``'|'`` vline marker ``'_'`` hline marker ============= =============================== **Line Styles** ============= =============================== character description ============= =============================== ``'-'`` solid line style ``'--'`` dashed line style ``'-.'`` dash-dot line style ``':'`` dotted line style ============= =============================== Example format strings:: 'b' # blue markers with default shape 'or' # red circles '-g' # green solid line '--' # dashed line with default color '^k:' # black triangle_up markers connected by a dotted line **Colors** The supported color abbreviations are the single letter codes ============= =============================== character color ============= =============================== ``'b'`` blue ``'g'`` green ``'r'`` red ``'c'`` cyan ``'m'`` magenta ``'y'`` yellow ``'k'`` black ``'w'`` white ============= =============================== and the ``'CN'`` colors that index into the default property cycle. If the color is the only part of the format string, you can additionally use any `matplotlib.colors` spec, e.g. full names (``'green'``) or hex strings (``'#008000'``).
Step 2: Fit Model Using All Data
You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don’t need to hold out the validation data now that you’ve made all your modeling decisions.
# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size, random_state=0)
# fit the final model and uncomment the next two lines
final_model.fit(X, y)
# Check your answer
step_2.check()
# step_2.hint()
# step_2.solution()
You’ve tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use Random Forests to improve your models even more.