来自kaggle官网的机器学习标准化流程。
Recap
You've built your first model, and now it's time to optimize the size of the tree to make better predictions. Run this cell to set up your coding environment where the previous step left off.
# Code you have previously used to load data
import pandas as pd
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeRegressor
# Path of the file to read
iowa_file_path = '../input/home-data-for-ml-course/train.csv'
home_data = pd.read_csv(iowa_file_path)
# Create target object and call it y
y = home_data.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = home_data[features]
# Split into validation and training data
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
# Specify Model
iowa_model = DecisionTreeRegressor(random_state=1)
# Fit Model
iowa_model.fit(train_X, train_y)
# Make validation predictions and calculate mean absolute error
val_predictions = iowa_model.predict(val_X)
val_mae = mean_absolute_error(val_predictions, val_y)
print("Validation MAE: {:,.0f}".format(val_mae))
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex5 import *
print("\nSetup complete")
val_mae
print("Validation MAE:{:.2f}".format(val_mae))
Validation MAE: 29,653
Setup complete
Validation MAE:29652.93
Exercises
You could write the function get_mae
yourself. For now, we'll supply it. This is the same function you read about in the previous lesson. Just run the cell below.
def get_mae(max_leaf_nodes, train_X, val_X, train_y, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_X, train_y)
preds_val = model.predict(val_X)
mae = mean_absolute_error(val_y, preds_val)
return(mae)
Step 1: Compare Different Tree Sizes
Write a loop that tries the following values for max_leaf_nodes from a set of possible values.
Call the get_mae function on each value of max_leaf_nodes. Store the output in some way that allows you to select the value of max_leaf_nodes
that gives the most accurate model on your data.
接下来是要写一个循环,找出并且储存mae最小的最大叶节点数max_leaf_nodes
,
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
for i in candidate_max_leaf_nodes:
mae_ = get_mae(i,train_X,val_X,train_y,val_y)
print("when the max_leaf_nodes is %d the mae is %d"%(i,mae_))
这样不够好,我们这次是既要输出也要储存,对于这样一对数据,最好使用字典的方式进行存储,而且循环最好使用表达式,这样比循环体运行要快。
值得注意的是,从字典中取出最大值对应的键应使用方法:max(dict,key=dict.get)
,
max(dict, key)方法首先遍历迭代器,并将返回值作为参数传递给key对应的函数,然后将函数的执行结果传给key,并以此时key值为标准进行大小判断,返回最大值。
candidate_max_leaf_nodes = [5, 25, 50, 100, 250, 500]
# Write loop to find the ideal tree size from candidate_max_leaf_nodes
scores = {leaf_size:get_mae(leaf_size,train_X,val_X,train_y,val_y)
for leaf_size in candidate_max_leaf_nodes}
# Store the best value of max_leaf_nodes (it will be either 5, 25, 50, 100, 250 or 500)
best_tree_size = min(scores,key=scores.get)
# Check your answer
step_1.check()
Step 2: Fit Model Using All Data
You know the best tree size. If you were going to deploy this model in practice, you would make it even more accurate by using all of the data and keeping that tree size. That is, you don't need to hold out the validation data now that you've made all your modeling decisions.
# Fill in argument to make optimal size and uncomment
final_model = DecisionTreeRegressor(max_leaf_nodes=best_tree_size,random_state=1)
# fit the final model and uncomment the next two lines
final_model.fit(X,y)
# Check your answer
step_2.check()
You've tuned this model and improved your results. But we are still using Decision Tree models, which are not very sophisticated by modern machine learning standards. In the next step you will learn to use Random Forests to improve your models even more.
Keep Going
You are ready for [Random Forests].