AML2019
# Install XGBoost
!pip3 install --user 'xgboost'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
from sklearn.preprocessing import normalize, LabelEncoder, StandardScaler
from sklearn.model_selection import cross_validate, GridSearchCV, RandomizedSearchCV
from sklearn.metrics.scorer import make_scorer
from sklearn.metrics import mean_squared_error
from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import ElasticNetCV
from xgboost import XGBRegressor
from math import sqrt
import time
import warnings
warnings.filterwarnings("ignore")
FIRST, let's take an overview look at the data, as well as some statistics.
trainDF = pd.read_csv('challenge_data/train.csv')
trainDF.head(5)
So, at quick glance, it suggests that we should spend quite sometime investigate each column of the dataset, so that pre-processing them don't blindly discard information.
trainDF.shape
testDF = pd.read_csv('challenge_data/test.csv')
testDF.shape
testDF.head(5)
trainDF.describe()
Looking at the overall statistics, we can see that houses' price are very polarized:
This is pretty much expected.
SECOND, we'll explore the relationships between the 81 features of a house, beginning with numeric attributes.
# Get numeric attributes ONLY
numericTrainDF = trainDF._get_numeric_data().dropna()
# Also, drop categorical attributes that appear as numeric
numericTrainDF.drop(columns=["MSSubClass", "MoSold"], inplace=True)
Here we also remove MSSubClass (the building class) and MoSold (month sold) as they have categorical meanings, not quantitative ones, hence cannot be classified as numeric.
# Compute the ABSOLUTE correlation coefficients
# Excluding Id and SalePrice
numericCorrel = numericTrainDF.drop(columns=["Id", "SalePrice"]).corr().abs()
# Get the lower-left half of the square
mask = np.zeros_like(numericCorrel)
mask[np.triu_indices_from(mask)] = True
# Plot the heatmap
plt.figure(figsize=(15, 12))
plt.title("Correlation Heatmap between Predictors")
sns.heatmap(numericCorrel, mask=mask, cmap="Reds", square=True, linewidths=1.)
plt.show()
Here we look at the relationships bewteen numerical predictors only, excluding the meaningless Id and the dependent variable SalePrice.
There are several pairs of strongly correlated variables here:
Although there are many more pairs, for the sake of brevity, we just list 3 of the strongest ones. By refering to the definition of these predictors, we can see that the pairs are very reasonable. There seems to be no weird things that trigger us to investigate here.
However, this finding of correlated predictors is very useful in case we want to reduce the dimensions later.
Now, we'll explore the relationships between these predictors and SalePrice.
# Function to plot the horizontal heatmap (a set of variables against one)
def horizontalHeatMap(data, title=""):
plt.figure(figsize=(25, .5))
ax = sns.heatmap(data.T, fmt=".2f", annot=True, annot_kws={"weight":"bold"},
linewidths=1., cbar=False, cmap="Reds")
plt.title(title)
ax.title.set_fontsize(20)
plt.show()
# Drop the meaningless column "Id", compute correlations
# Extract the "SalePrice" series, then sort descendingly
numericCorrel = numericTrainDF.drop(columns=["Id"]).corr()
sortedCorrel = numericCorrel.loc[:, ["SalePrice"]] \
.sort_values(by="SalePrice", ascending=False)
# Plot the heatmap
horizontalHeatMap(sortedCorrel, "Correlations with SalePrice")
First, we can see that LowQualFinSF (low quality finished square feet (all floors)) doesn't correlate with SalePrice, while OverallQual (overall material and finish quality) strongly correlates with SalePrice. This makes total sense.
Second, GrLivArea (above grade (ground) living area square feet), GarageCars (size of garage in car capacity) and TotalBsmtSF (total square feet of basement area) are also strongly correlated with SalePrice, which also make sense.
The finding of these correlated predictors will be very helpful in the case we want to reduce the dimensions, if neccessary, later.
THIRD, we're going explore the relationship between non-numeric attributes and SalePrice.
The method we're using is Box Plot, in order to sense how Sale Prices are distributed within each individual category.
# Remove the meaningless "Id" attribute
dfWithoutId = trainDF.drop(['Id'], axis=1)
# Outline the Plot
fig, axes = plt.subplots(15, 3, figsize=(30, 150))
# Indexed and titled each subplot
subplot_idx = 0
predictors = list(dfWithoutId.columns)
predictors.sort()
# Plotting
for pred in predictors:
if dfWithoutId[pred].dtype == 'object' and pred != 'SalePrice':
# Get position and plot the subplot
row, col = divmod(subplot_idx, 3)
axes[row, col] = sns.boxplot(data=dfWithoutId, x=pred, y='SalePrice',
ax=axes[row, col])
# Labeling each subplot
axes[row, col].set_title('SalePrice against {}'.format(pred), fontsize=25)
axes[row, col].yaxis.label.set_visible(False)
axes[row, col].yaxis.set_tick_params(labelsize=20)
axes[row, col].xaxis.label.set_visible(False)
axes[row, col].xaxis.set_tick_params(labelsize=20)
# Rotate the xticks
xticks = axes[row, col].get_xticklabels()
axes[row, col].set_xticklabels(xticks, rotation=60)
# Move to next subplot
subplot_idx += 1
# Beautify the plot
fig.delaxes(axes[14, 1])
fig.delaxes(axes[14, 2])
plt.tight_layout()
There are several strong predictors worth noted here:
The Alley predictor, leads to higher SalePrice in the case of Paved, and lower in the case of Grvl.
The Bsmtqual predictor (Basement Quality), increases SalePrice as its quality increases. In fact, Ex (excellent) leads to highest SalePrice.
The Neighborhood predictor, for most values have small-margin box-plots, with the rest being medium-margin. The medium-margin box-plots are quite disinguishable in SalePrice in comparison to the others (rich area). This make Neighborhood quite a good predictor for SalePrice.
All of these make total sense, especially in the case of Neighborhood, since it is very often that the mere position of a house greatly detetrmine its sale value. There are many other reasonable relationships between non-numeric predictors and SalePrice, here we only list the most interesting 3.
Despite there aren't many strange that trigger us to investigate further, such finding allows us to reduce the dimensions (if neccessary) in later steps.
FINALLY, we investigate the dependence between SalePrice and its numerical predictors. Then, we'll proceed to investigate the joint distribution between SalesPrice and the predictors it's most dependent on. The goal is to get a sense of the distribution underlying our data, which is neccessary if we want to employ Gaussian Processes later.
For this purpose, we employ Mutual Information, which measures general dependence between two random variables, instead of Correlation, which only measures linear dependence bewteen them. The higher the mutual information between SalesPrice and one of its predictor, the more we can determine a sale-price using the values of that predictor.
from sklearn.feature_selection import mutual_info_regression
# Get Numeric Predictors
numericPredictors = numericTrainDF.drop(columns=["Id", "SalePrice"])
# Get Mutual Information between SalePrice and each predictor
# Then turn it into a DataFrame, and sort descendingly
numericMI = mutual_info_regression(numericPredictors, numericTrainDF['SalePrice'])
numericMIDF = pd.DataFrame(data=numericMI, index=numericPredictors.columns, columns=['MI'])
numericMIDF.sort_values(by='MI', inplace=True, ascending=False)
# Plot the heatmap
horizontalHeatMap(numericMIDF, "Mutual Information with SalePrice")
Here we have:
provides the most valuable information in dertemining the SalePrice of a house. This kind of make sense because in practice, these two are vital factors house-buyers take into account before spending their money.
# Function to plot the joint-distribution
def plot_salesprice_jointpdf_with(predictor):
sns.jointplot(x=numericTrainDF['SalePrice'], y=numericTrainDF[predictor],
kind='kde', color='red')
plt.show()
for pred in ['OverallQual', 'GrLivArea']:
plot_salesprice_jointpdf_with(pred)
REMARK. Do these justfy our future usage of Gaussian process. We would, although reluctantly, say Yes.
Recalling the fact that Gaussian Process finds (in a Bayesian way) a distribution over all possible function f, where each f is represented by a valued-vector. What Gaussian Process assumes is the normality of that distribution.
In other words, the dsitribution of all possible f, given a specific value of the input-vector, is assumed to be normal.
In practice, the input-vector is very large, so people only take the most "informing" variables, here we have OverallQual and GrLivArea, and verify each normality against the dependent-variable, that is SalePrice
Although SalePrice looks a little bit skewed, it is somewhat normal, therefore, its joint dsitribution with GrLivArea, which is normally distributed, does satisfy this assumption.
However, the joint distribution with OverallQual is not normal. This is because OverallQual is more categorical than continuous, leads to the issue of having multiple values in the "middle". But if we "bin" the data in the sense that these different middle values all represent medium overall-quality, then we will have kind of a normal distribution.
In conclusion, this situation is very similar to the linearity-check we do in traditional linear regression, where the relationship is loosely linear and our models still work very well. The joint-distributions we're seeing now are normal to some extend, enabling the result we get from Gaussian Process to be reliable.
FIRST, we will handle missing values on both training set and testing set since many models do not work when these values are in the dataset. Importantly, the missing values also affect on performance of learning algorithms.
We defined a function check_missing
to summary the missing values of each column in these datasets and a function fill_missing
to handle it.
# Function to summary the missing values on a given dataset
def check_missing(df):
""" Summary missing values on a given dataset
Parameters:
df (DataFrame): a dataset to check
Return:
summary_df (DataFrame): the summary dataframe of missing values
"""
missing_df = pd.DataFrame(train_data.isnull().sum())\
.reset_index().rename({'index':'column', 0:'count'}, axis=1)
df = missing_df[missing_df['count']>0]
df['percentage'] = df['count'].divide(train_data.shape[0]).round(decimals=3)
return df
# Function to fill missing values
def fill_missing(list_df, cols_vals):
"""Fill missing values on a given dataset
Parameters:
list_df (list of DataFrame): A list of dataframes to fill
cols_vals (list of Tuple): A list of tuples, each tuple contains a column name and a value to fill
Return:
list_df (list of DataFrame): A list of dataframes which are filled missing values.
"""
for df in list_df:
for tup in cols_vals:
df.update(df[tup[0]].fillna(tup[1]))
return list_df
NEXT, we load the data and see its summarization to decide what to do next in handling the missing data.
# Load training set and display problems
train_data = pd.read_csv('challenge_data/train.csv')
print("\n\nMissing values on train data:")
display(check_missing(train_data))
# ==========================================
# TODO: Move this part to the bottom section
# Load test-data
test_data = pd.read_csv('challenge_data/test.csv')
# Based on the description of the dataset, these features are grouped together by its types.
# The categorical features which have missing values not randomly
valNone_cat = ['Alley', 'MasVnrType',\
'BsmtQual','BsmtCond', 'BsmtExposure','BsmtFinType1','BsmtFinType2',\
'FireplaceQu',\
'GarageArea','GarageType','GarageFinish','GarageQual','GarageCond',
'PoolQC', 'Fence', 'Electrical']
# The numerical features which have missing values not randomly
valNone_0 = ['MasVnrArea','LotFrontage']
# Fill in missing values
[train_data_nomissing] = fill_missing(
[train_data],
[(valNone_cat, 'Have-None'), (valNone_0, 0),
('GarageYrBlt', train_data['GarageYrBlt'].mode()[0])]
)
# ==========================================
# TODO: Move this part to the bottom section
# Filling missing values for test-data
[test_data_nomissing] = fill_missing(
[test_data],
[(valNone_cat, 'Have-None'), (valNone_0, 0),
('GarageYrBlt', train_data['GarageYrBlt'].mode()[0])]
)
COMMENT.
We handle the missing values depending on its case.
How to know the reasons of missing? We may find out it on some ways like reading carefully the description of the dataset, or in cases we do not have a description in detail, we can make inferences from the summary of missing values and the meaning of the variables. As can be seen from the summary, there are some features which have the same number of missing values. These features belong to some groups like basement, garage. Definitely, it is no coincidence.
# Function to get features and target
def get_X_y(df, X_excluded=[], y_included=[]):
"""Get features (X) and target (y) of a given dataframe
Parameters:
df (DataFrame): A dataframe
X_excluded (str): Column names which do not belong to X
y_included (str): Column names of targets.
Return:
X, y (DataFrame)
"""
return df.drop(X_excluded+y_included, axis=1), df[y_included]
# Function to get numerical variables
def get_numeric_dataframe(list_df):
"""Get numerical variables of a given list dataframe.
Parameters:
list_df (list of DataFrame): A list of dataframes
Return:
List of numerical variables respectively list of given dataframes.
"""
return [df.select_dtypes(exclude=['object']) for df in list_df]
# Function to get categorical variables
def get_categorical_dataframe(list_df):
"""Get categorical variables of a given list dataframe.
Parameters:
list_df (list of DataFrame): A list of dataframes
Return:
List of categorical variables respectively list of given dataframes.
"""
return [df.select_dtypes(include=['object']) for df in list_df]
# Get X (features) and y (target)
X_train, y_train = get_X_y(train_data_nomissing, ['Id','MiscFeature'],['SalePrice'])
# Split X (features) into numerical & categorical
[X_train_num] = get_numeric_dataframe([X_train])
[X_train_cat] = get_categorical_dataframe([X_train])
# ==========================================
# TODO: Move this part to the bottom section
# Test-set: Split X (features) into numerical & categorical
X_test, _ = get_X_y(test_data_nomissing, ['Id', 'MiscFeature'])
[X_test_num] = get_numeric_dataframe([X_test])
[X_test_cat] = get_categorical_dataframe([X_test])
COMMENT. Here we simply split the dataset into 2 parts, vertically:
from collections import defaultdict
# APPROACH 1: Convert categorical variables to numerical variables
# using LabelEncoder
d = defaultdict(LabelEncoder)
# ==============================================
# TODO: Seperate the process into train and test
# TODO: Move test to the bottom-section
X_cat = pd.concat([X_train_cat, X_test_cat], ignore_index=True)
fit = X_cat.apply(lambda x: d[x.name].fit_transform(x))
X_train_cat_encoded = X_train_cat.apply(lambda x: d[x.name].transform(x))
X_test_cat_encoded = X_test_cat.apply(lambda x: d[x.name].transform(x))
# APPROACH 2: Convert categorical variables to numerical variables
# using DUMMY VARIABLES
X_train_cat_dummied = pd.get_dummies(X_train_cat, columns=X_train_cat.columns,
drop_first=True)
REMARK 1.
Because several models work only on numerical data, we need to encode categorical variables into numerical variables. There are two common ways: encode labels with value between 0 and n_classes-1 or encode labels as dummy variables. In this project, we use both.
We use LabelEncoder for several models such as XGBoost, Random Forests,... since in some experiences, LabelEncoder yield better modeling performance than the dummy variables conversion.
We use dummy variables in Gaussian Process for several reasons, one of them is about implicit meanings. That is, categorical variables often does not imply ordering or numerical difference, but if we convert them into numbers, Gaussian Process will wrongly take these implicit-orderings and implicit-difference into account, leads to bad models.
REMARK 2.
Since we approach the problem of numericalising categorical-data in 2 different ways, from now on, we seperate the pre-processing process into 2 separate pipelines:
Pipeline 1: This is used for ElasticNet and XGBoost:
Pipeline 2: This is used for Gaussian Process:
REMARK 3. Why Standardizing?
First, for data that has different scales, it is necessary that we re-scale the data so that no feature outweights the others just by having a larger scale.
We chose Standardization instead of Normalization for several reasons:
Another reason: The MinMaxScaler we use to normalize our train-dataset will be applied on the test-dataset before predicting, and this will, very likely, produce a value that are less than 0 or greater than 1 (if some values in test are greater than the max. of train, or less than the min. of train). This is undesirable.
# Merge categorical variables and numerical variables
X_train_merged = pd.concat([X_train_num, X_train_cat_encoded], axis=1)
# FULL STANDARDIZATION:
full_standardizer_X = StandardScaler().fit(X_train_merged)
X_train_fullscaled = pd.DataFrame(full_standardizer_X.transform(X_train_merged),
columns=X_train_merged.columns)
# =============================================
# TODO: Moving to the bottom (for the test-set)
# Change columns to ???
# Splitting into predictors and targets (test-set)
X_test_merge = pd.concat([X_test_num, X_test_cat_encoded], axis=1)
# FULL STANDARDIZATION (test-set):
X_test_fullscaled = pd.DataFrame(full_standardizer_X.transform(X_test_merge),
columns=X_test_merge.columns)
On a dataset, each feature affects different ways to the target. Some of them are really helpful while the other may be useless, consuming resources or even harmful to models. Here we use RFECV method which is recursive feature elimination and cross-validated selection of the best number of features to see how the number of features used in a model affect to its performance.
FIRST, since our metric is the RMSE between the logarithm of predicted and logarithm of observed, we create a helper function that computes this metric. This serves as a base for our cross-validation in RFECV
# The RMSE between the log of the predicted and the log of the observed
def rmse_log(actual, pred):
"""Return the RMSE between the logarithm of predicted and logarithm of observed
Parameters:
actual (Series-like): The observed values
pred (Series-like): The predicted values
Return:
The RMSE between the logarithm of predicted and logarithm of observed
"""
return sqrt(mean_squared_error(np.log(actual), np.log(pred)))
SECOND, we use RandomForestRegressor as an estimator and RFECV as the feature selection technique to select the optimal number of features used in models.
# Create an estimator
estimator = RandomForestRegressor(random_state=42)
# Feature selection using recursive feature elimination and cross-validated selection
selector = RFECV(estimator, step=1, cv=5, scoring=make_scorer(rmse_log, greater_is_better=False))
selector = selector.fit(X_train_fullscaled, y_train)
print("Optimal number of features : %d" % selector.n_features_)
# Plot number of features VS. cross-validation scores
plt.figure(figsize=(8,5))
plt.plot(range(1, len(selector.grid_scores_) + 1), -selector.grid_scores_)
plt.xlabel("Number of features selected")
plt.ylabel("Cross validation score")
plt.show()
# Select the optimal features
selected_features = [feature for (boolean, feature) in zip(selector.support_, X_train_fullscaled.columns) if boolean]
X_train_rmfeatures = X_train_fullscaled[selected_features]
# =====================================
# TODO: Move this to the bottom section
X_test_rmfeatures = X_test_fullscaled[selected_features]
COMMENT
Since the datasets is not big, it does not matter when we use all of its features for modeling. We would like to observe how the performance of the model descrease to keep the best one. Therefore we use RFECV here.
We use RandomForestRegressor as an estimator because it randomly selects observations and features to built the model. When using RFECV, we want to eliminate features randomly.
The plot shows that it is better to use more than 10 features in the model. Over 10 features, the performance of the model fluctuate on small ranges. Because each model has its own ways to learning, the optimal number of features in this model may not be the optimal on others. However, it gives us an intuition for model selection and tunning hyper-parameter.
# PARTIAL (numerical) STANDARDISATION
numerical_standardizer_X = StandardScaler().fit(X_train_num)
X_train_numscaled = pd.DataFrame(numerical_standardizer_X.transform(X_train_num),
columns=X_train_num.columns)
# Remove outliers, data-points that are more than 3.5 standard-devation from the mean
X_train_rmOutliers = X_train_numscaled[X_train_numscaled\
.apply(lambda x: np.abs(x - x.mean()) / x.std() < 3.5)\
.all(axis=1)]
X_train_rmOutliers.shape
train_merged = pd.concat([X_train_rmOutliers, X_train_cat_dummied, y_train], join='inner', axis=1)
[X_train_dummied, y_train_dummied] = get_X_y(train_merged, y_included=['SalePrice'])
COMMENT. Here we did several things
FIRST REMARK for this section.
Here we build 2 data-preparation pipelines using only the train-dataset, and don't look at the test-dataset. This is recommended because the test set won't be seen in practice.
For this, we will:
We tried several popular models for regression tasks such as RandomForestRegressor, ElasticNetCV, XGBRegressor, which are rosbust to the presence of outliers.
We also evaluate on two different datasets:
# Helper function to get the cross-validation results on a well-formed dataframe.
def get_cv_result(cv, model):
"""Get the cross-validation results on a well-formed datafram
Parameters:
cv (Dictionary): The return of cross-validation
model: The model which is used for cross-validation
Returns:
avg (DataFrame): The result dataframe of cross-validation
"""
avg = {}
for k,vals in cv.items():
avg[k] = np.around(np.mean(vals), 4)
avg['model'] = model
avg['test_score'] = - avg['test_score']
avg['train_score'] = - avg['train_score']
return avg
# This function is used to implemente cross-validation to evaluate performance of models
def model_selection(models, X, y, cv=5, scoring=make_scorer(rmse_log, greater_is_better=False)):
"""Evaluate models using cross-validation
Parameters:
models: A list of models
X (DataFrame): Feature set of the given data
y (DataFrame): Target set of the given data
scoring: The scoring function used to evaluate models
Returns:
DataFrame of cross-validation of models
"""
cv_results = []
for model in models:
cv_result = cross_validate(model, X, y, cv=cv, scoring=scoring, return_train_score=True)
cv_results.append(get_cv_result(cv_result, model))
return pd.DataFrame(cv_results)[['model', 'fit_time', 'score_time', 'test_score', 'train_score']].sort_values(by=['test_score'])
# Models
random_forests = RandomForestRegressor()
elasticNet = ElasticNetCV(random_state=0)
xgb = XGBRegressor(seed=42)
models = [random_forests, elasticNet, xgb]
# Models using training set with 78-preprocessed-features
print("\nCross-validation on training set with 78-preprocessed features")
models_78f = model_selection(models, X=X_train_fullscaled, y=y_train)
display(models_78f)
# Models using training set with 24-selected features
print("\n\nCross-validation on training set with 24-selected features")
models_24f = model_selection(models, X=X_train_rmfeatures, y=y_train)
display(models_24f)
COMMENT.
Performance:
Between the models:
The outputs show that XGBRegressor outperforms RandomForestRegressor and ElasticNet on both datasets. Its test_score using cross-validation is ~0.13 while the others are ~0.15 and ~0.24. Therefore, we will use the best model, XGBRegressor, for tunning parameters in next part.
Between sub-set and full-set:
Using fewer features than the full set but bigger than 10 (24/78 features), the performance of XGBRegressor just slightly descreases while the performance of ElasticNetCV almost remains. It can be explained through the graph on section 2. f. Feature selection.
Only RandomForestRegressor has the better result on the selected-feature set than on the full set. It becauses these features are chosen by itsefl, RandomForestRegressor estimator.
Model Complexity & Interpretability
scoring=make_scorer(rmse_log, greater_is_better=False)
xgb_tunned = XGBRegressor(colsample_bytree=0.4,
gamma=0,
learning_rate=0.07,
max_depth=3,
min_child_weight=1.5,
n_estimators=1000,
reg_alpha=0.75,
reg_lambda=0.45,
subsample=0.6,
seed=42)
# Models using data train with 78-preprocessed features
tunned_model = model_selection([xgb_tunned], X=X_train_fullscaled, y=y_train)
display(tunned_model)
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern
from sklearn.pipeline import Pipeline
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import reciprocal
# Using mutual_info_regression for feature-selection
mutual_info_selector = SelectKBest(score_func=mutual_info_regression)
# Choosing Matern-kernel with nu=0.5 for least smoothness
# and low computational-cost
matern_kernel = 1.0**2 * Matern(nu=0.5)
gaussian_regressor = GaussianProcessRegressor(kernel=matern_kernel, random_state=307)
# The combined Pipeline
gaussian_pipeline = Pipeline([
("feature_selector", mutual_info_selector),
("regressor", gaussian_regressor)
])
# Distribution of Hyper-parametrs
features_nb = X_train_dummied.shape[1]
params_distribution = {
'feature_selector__k': range(features_nb//2, features_nb*3//4),
'regressor__alpha': reciprocal(1e-6, 0.1)
}
COMMENT. Here we did several things:
First, instead of doing feature-selection seperately, we treated it as part of a Pipeline that needs to be tuned, for the following reasons:
Second, why mutual_info_regression? Why that strange Matern kernel?
REMARK 1. On choosing mutual_info_regression metric, and the (50%, 75%)-range for K in SelectKBest().
First, our prior is that while _fregression only captures linear dependency, mutual-info-regression can capture any kind of dependency. Together with the fact that our data doesn't seem that linear, we only tried with _fregression for several times.
Second, after seeing that _fregression did give worse results, most of our time were spent on picking the right-range-for-K using mutual-info-regression. The 50% to 75% range is just a heuristic after many try-and-err in which our prior was:
We did try, for a few times, both the lower range and the higher range to see if the results confirmed our prior. After that, we focus on picking the best middle-range and (50%, 75%) came (within our limited time).
REMARK 2. On choosing Matern Kernel metric, and the reciprocal-distribution for alpha in GaussianProcessRegressor.
First, the choice for reciprocal-distribution is quite reasonable:
Second, the strange Matern kernel we used is based on two priors:
from sklearn.metrics.scorer import make_scorer
gaussian_random_cv = RandomizedSearchCV(gaussian_pipeline, params_distribution,
scoring=make_scorer(rmse_log, greater_is_better=False),
cv=4, n_iter=60, verbose=2)
COMMENT. Here we run a Randomized Search with 60 turns of 4-fold CV to select the best model:
Why RandomizedSearch? Why 60 turns?
We also passed in our custom-build metric rmse_log. Please notice that this is a less-is-better metric and therefore scikit-learn will output a negative score. Take its absolute value and we'll be OK.
gaussian_random_cv.fit(X_train_dummied, y_train_dummied)
# Negate the accuracy-output because sciki-learn defaults
# less-is-better metrics to be negative so that
# it can consistently maximising
print("RMSE-Log Error: {}".format(-gaussian_random_cv.best_score_))
# Output the tuned hyper-parameters
print("Tuned number-of-features: {}".format(gaussian_random_cv\
.best_params_['feature_selector__k']))
print("Tuned gaussian-regressor-alpha: {}".format(gaussian_random_cv\
.best_params_['regressor__alpha']))
COMMENT & REMARKS.
First, the result of 0.11 is quite good for a single-estimator in this task. We believe that given more time, this performance can be improved by:
The optimal hyper-parameters of:
Model Complexity & Interpretability
Handling different data types:
Does the model return uncertainty along with prediction?:
# xgb_tunning
# A. Handle Missing Data
test_data = pd.read_csv('challenge_data/test.csv')
[test_data_nomissing] = fill_missing([test_data],
[(valNone_cat, 'Have-None'), (valNone_0, 0),
('GarageYrBlt', train_data['GarageYrBlt'].mode()[0])])
# B. Split into Numerical & Category
X_test, _ = get_X_y(test_data_nomissing, ['Id', 'MiscFeature'])
[X_test_num] = get_numeric_dataframe([X_test])
[X_test_cat] = get_categorical_dataframe([X_test])
# C. Scale numerical features
# Change to columns=X_train_num.columns
# As every transformation in test must use the train-set transformer
X_test_scaled = pd.DataFrame(standard_scaler_X.transform(X_test_num), columns=X_train_num.columns)
# E. Categorical Encoding (dummy variables)
# E1. Add dummy variables
X_test_cat_dummied = pd.get_dummies(X_test_cat, columns=X_test_cat.columns,
drop_first=True)
# E2. Remove variables in test, not train
redundant_cols = set(X_test_cat_dummied.columns) - set(X_train_cat_dummied.columns)
X_test_cat_dummied.drop(columns=list(redundant_cols), inplace=True)
# E3. Add variables in train, not test, default at 0
missing_cols = set(X_train_cat_dummied.columns) - set(X_test_cat_dummied.columns)
for mis_col in list(missing_cols):
X_test_cat_dummied[mis_col] = 0
# F. Merge numerical & dummies together
X_test_ = pd.concat([X_test_scaled, X_test_cat_dummied], axis=1)
TODO. Shortly comment what we did here, and why we move test-preprocessing down.
# Making Predictions and remember to rescale them
scaled_predictions = gaussian_random_cv.best_estimator_.predict(X_test_)
predictions = standard_scaler_y.inverse_transform(scaled_predictions)
# Building results dataframe and output
results = pd.DataFrame(data=predictions, columns=['SalePrice'])
results['Id'] = test_data['Id']
results = results[['Id', 'SalePrice']]
results.to_csv('submission.csv', index=False)
TODO. Shortly comment what we did here.
a. Running time of models:
The running time of XGBoost on this dataset is ~4.58 seconds. On the model selection using cross-validation, its running time (_fittime and _scoretime) is longer than RandomForestRegressor, ElasticNetCV and but shorter than _Gaussian process__.
b. Scalability
XGBoost is scalable in distributed as well as memory-limited settings. This scalability is due to several algorithmic optimizations:
+ Split finding algorithms: approximate algorithm.
+ Column block for parallel learning.
+ Weighted quantile sketch for approximate tree learning.
+ Sparsity-aware algorithm
+ Cache-aware access
+ Out-of-core computation.
+ Regularized Learning Objective.
c. Parallelise:
XGBoost implements a gradient boosting trees algorithm. A gradient boosting trees model trains a lot of decision trees or regression trees in a sequence, where only one tree is added to the model at a time, and every new tree depends on the previous trees. This nature limits the level of parallel computation, since we cannot build multiple trees simultaneously. Therefore, the parallel computation is introduced in a lower level, i.e. in the tree-building process at each step.
d. Training the model with smaller subsets of the data
As mentioned in section 3., using fewer features (24/78 features), the performance of XGBRegressor just slightly descreases (~-0.0045).