# House Pricing Prediction House Pricing

# Introduction

During this challenge, we are going to predict the final price of each house. We are given 2 data sets: train and test. Each data set contain many features that we will explore, try to find eventual correlations beween them and select the most useful ones to predict the house price. Finally, we will test and compare a few models trained on this data in order to select the model with the best performance.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline


# Data preparation

First, we import the training and testing data. We will be using pandas library to manipulate and display data.
# Import the training and testing data to explore them
print("Training samples:", len(df_train))
print("Testing samples:", len(df_test))

Training samples: 1200
Testing samples: 260


IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPub...0NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPub...0NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPub...0NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPub...0NaNNaNNaN0122008WDNormal250000

5 rows × 81 columns

This overview gave us an idea about the data structure we will be using as well as the different fields of the data set. We notice that there are so many features in this data set. So in the first part we will choose the most useful features to keep.

We start by dropping the 'Id' column which is totally useless for the prediction.
# Drop Id and extract target
df_train_target = df_train['SalePrice']
df_test_target = df_test['SalePrice']
df_train = df_train.drop(['Id','SalePrice'], axis=1)
df_test = df_test.drop(['Id','SalePrice'], axis=1)


## Numrical Fields

A quick discription of the numerical fields is quiet useful to have an idea about our data (e.g. the data range, min/max, mean of each column, etc).
# Descriptive summary of numeric fields
df_train.describe()


count1200.000000990.0000001200.0000001200.0000001200.0000001200.0000001200.0000001194.0000001200.0000001200.000000...1200.0000001200.0000001200.0000001200.0000001200.0000001200.0000001200.0000001200.0000001200.0000001200.000000
mean57.07500070.08686910559.4116676.1050005.5683331971.3508331984.987500103.962312444.88666745.260000...472.60416795.13666746.01666722.1783333.65333314.9808331.90916740.4533336.3116672007.810833
std42.68201223.70202910619.1355491.3834391.12013830.04840820.527221183.534953439.987844158.931453...212.722444124.03412965.67762961.50732329.99109954.76805733.148327482.3234442.6731041.319027
min20.00000021.0000001300.0000001.0000001.0000001875.0000001950.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0000002006.000000
25%20.00000059.0000007560.0000005.0000005.0000001954.0000001967.0000000.0000000.0000000.000000...334.5000000.0000000.0000000.0000000.0000000.0000000.0000000.0000005.0000002007.000000
50%50.00000070.0000009434.5000006.0000005.0000001973.0000001994.0000000.000000385.5000000.000000...478.0000000.00000024.0000000.0000000.0000000.0000000.0000000.0000006.0000002008.000000
75%70.00000080.00000011616.0000007.0000006.0000002000.0000002004.000000166.750000712.2500000.000000...576.000000168.00000068.0000000.0000000.0000000.0000000.0000000.0000008.0000002009.000000
max190.000000313.000000215245.00000010.0000009.0000002010.0000002010.0000001600.0000002260.0000001474.000000...1390.000000857.000000523.000000552.000000508.000000410.000000648.00000015500.00000012.0000002010.000000

8 rows × 36 columns

The range of values of each column varies widely. So normalizing the data is an important step in order to avoid any scaling problem (e.g distances between points won't be dominated by features with high values) and also to speed up the training process (e.g. gradient descent will converge faster).
c_ = df_train.describe().columns
df_train[c_] = (df_train[c_] - df_train[c_].mean()) / df_train[c_].std()
df_test[c_] = (df_test[c_] - df_test[c_].mean()) / df_test[c_].std()


### Correlation Matrix

As we can see we have so many features describing the data samples. Probably not all of these feature are importatnt to predict our target which is the selling price ('SalePrice').

So now, we are going to keep only columns which are relevent to this problem. So let's have a quick look to the correlation between different features using the correlation matrix.
#correlation matrix
corrmat = np.abs(pd.concat([df_train, df_train_target], axis=1).corr())
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(16, 12)) This heatmap gave us a great overview of our data as well as the relationship between different features.

We notice that there are many darkblue-colored squares: there are obvious correlations such as between the number of rooms at each level and the surface of that level (ground or higher floors), between surfaces of different levels, and also between surface and capacity of garages.

These columns give almost the same information, so we can get rit of extra ones.

We can also look to features that are directly correlated with the 'SalePrice': we get 'GrLivArea', 'TotalBsmtSF', and 'OverallQual'.
# SalePrice correlation matrix
n = 10
cols = corrmat.nlargest(n, 'SalePrice')['SalePrice'].index # List of highly correlated colums with 'SalePrice'
cm = np.corrcoef(pd.concat([df_train, df_train_target], axis=1)[cols].values.T)
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(12, 9))
ax = sns.heatmap(cm,
cbar=True,
annot=True,
square=True,
fmt='.2f',
annot_kws={'size': 10},
yticklabels=cols.values,
xticklabels=cols.values,
cmap="YlGnBu")
plt.show() To sum up:
• 'GrLivArea', 'TotalBsmtSF' and 'OverallQual' are highly correlated with 'SalePrice'.
• '1stFlrSF' and 'TotalBsmtSF': the first floor square feet and the total square feet of basement area are obviously highly correlated. We can keep only 'TotalBsmtSF'.
• 'TotRmsAbvGrd' and 'GrLivArea': total rooms above grade and above grade living area square feet. 'GrLivArea' to be kept.
• 'YearBuilt' and 'GarageYrBlt': it's probably the same year of building the house and the garage. 'GarageYrBlt' to be dropped.
• 'GarageArea' and 'GarageCars': the surface of the garage and it's capacity are highly related (obvious).
• ### Scatter Plot

sns.set()
cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'TotalBsmtSF', '1stFlrSF', 'TotRmsAbvGrd', 'YearBuilt', 'GarageYrBlt', 'GarageCars', 'GarageArea']
sns.pairplot(pd.concat([df_train, df_train_target], axis=1)[cols], size = 2.5)
plt.show(); As previously discussed, the sale price is linearly correlated with the material rates, ground and basement surface.

we can see an exponential relation between sale price and build year especially during the recent year.

Garage build year and build year are related but not always equal.

Garage capacity is a result of its area.

# Numerical columns to keep
numerical_cols = ['GrLivArea',
'TotalBsmtSF',
'OverallQual',
'TotalBsmtSF',
'GrLivArea',
'YearBuilt',
'GarageArea',
'FullBath',
'MSSubClass',

df_train_num = df_train[numerical_cols]
df_test_num = df_test[numerical_cols]
print("Training data:", df_train_num.shape)
print("Testing data:", df_test_num.shape)

Training data: (1200, 10)
Testing data: (260, 10)

df_num = pd.concat([df_train_num, df_test_num], ignore_index=True)


## Non-Numerical/Categorical Fields

In this part, we will treat non-numerical feature.

The first step is to find eventual correlations between these fields and if there are interesting relations with ‘SalePrice’ target.

Then, we will convert them to numerical fields using the One-Hot-Encoding and the Label Encoding methods. This step is very important in order to be able to train our model later.

Label Encoding will be used for fields with comparable values, e.g. ‘GarageQual’ whose values can be converted to 5, 4, 3, 2, 1 and 0 for Ex (Excellent), Gd (Good), TA (Typical/Average), Fa (Fair), Po (Poor) and NA (No Garage) respectively.

One-Hot-Encoding will be used for other categorical fields whose values are not comparable, e.g. ‘MSZoning’ for zone classification.

# Print fields' types
df_train.dtypes.unique()

array([dtype('float64'), dtype('O')], dtype=object)

# Extracting non-nemeric fields
df_train_non_num = df_train.select_dtypes(include=['O'])


MSZoningStreetAlleyLotShapeLandContourUtilitiesLotConfigLandSlopeNeighborhoodCondition1...GarageTypeGarageFinishGarageQualGarageCondPavedDrivePoolQCFenceMiscFeatureSaleTypeSaleCondition
0RLPaveNaNRegLvlAllPubInsideGtlCollgCrNorm...AttchdRFnTATAYNaNNaNNaNWDNormal
1RLPaveNaNRegLvlAllPubFR2GtlVeenkerFeedr...AttchdRFnTATAYNaNNaNNaNWDNormal
2RLPaveNaNIR1LvlAllPubInsideGtlCollgCrNorm...AttchdRFnTATAYNaNNaNNaNWDNormal
3RLPaveNaNIR1LvlAllPubCornerGtlCrawforNorm...DetchdUnfTATAYNaNNaNNaNWDAbnorml
4RLPaveNaNIR1LvlAllPubFR2GtlNoRidgeNorm...AttchdRFnTATAYNaNNaNNaNWDNormal

5 rows × 43 columns

df_train_non_num.nunique()

MSZoning          5
Street            2
Alley             2
LotShape          4
LandContour       4
Utilities         2
LotConfig         5
LandSlope         3
Neighborhood     25
Condition1        9
Condition2        7
BldgType          5
HouseStyle        8
RoofStyle         5
RoofMatl          6
Exterior1st      14
Exterior2nd      15
MasVnrType        4
ExterQual         4
ExterCond         5
Foundation        6
BsmtQual          4
BsmtCond          4
BsmtExposure      4
BsmtFinType1      6
BsmtFinType2      6
Heating           4
HeatingQC         5
CentralAir        2
Electrical        5
KitchenQual       4
Functional        7
FireplaceQu       5
GarageType        6
GarageFinish      3
GarageQual        5
GarageCond        5
PavedDrive        3
PoolQC            3
Fence             4
MiscFeature       3
SaleType          9
SaleCondition     6
dtype: int64

non_num_corr = pd.concat([df_train_non_num.apply(lambda x: x.factorize()), df_train_target], axis=1).corr()
with sns.axes_style("white"):
f, ax = plt.subplots(figsize=(16, 12)) # columns to convert to numerical values
label_encoding_cols = ['Utilities',
'LandSlope',
'ExterQual',
'ExterCond',
'BsmtQual',
'BsmtCond',
'BsmtExposure',
'BsmtFinType1',
'BsmtFinType2',
'HeatingQC',
'CentralAir',
'KitchenQual',
'Functional',
'FireplaceQu',
'GarageType',
'GarageFinish',
'GarageQual',
'GarageCond',
'PavedDrive',
'PoolQC',
'Fence']
one_hot_cols = ['MSZoning',
'Street',
'Alley',
'LotShape',
'LandContour',
'LotConfig',
'Neighborhood',
'Condition1',
'Condition2',
'BldgType',
'HouseStyle',
'RoofStyle',
'RoofMatl',
'Exterior1st',
'Exterior2nd',
'MasVnrType',
'Foundation',
'Heating',
'Electrical',
'MiscFeature',
'SaleType',
'SaleCondition']

all_data = pd.concat([df_train, df_test], ignore_index=True)
all_data.shape

(1460, 79)

# Convert categorical variable into indicator variables
df_cat_one_hot = pd.get_dummies(all_data[one_hot_cols], dummy_na=True, columns=one_hot_cols, drop_first=True)
df_cat_one_hot.shape

(1460, 162)

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

df_label_encoding = pd.DataFrame(columns=label_encoding_cols)
for col in label_encoding_cols:
labelencoder = LabelEncoder()
df_label_encoding[col] = labelencoder.fit_transform(all_data[col].astype(str))
# print(labelencoder.fit_transform(df_train[col].astype(str)))


UtilitiesLandSlopeExterQualExterCondBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinType2HeatingQC...KitchenQualFunctionalFireplaceQuGarageTypeGarageFinishGarageQualGarageCondPavedDrivePoolQCFence
00024233250...2651144234
10034231050...3641144234
20024232250...2641144234
30034313052...2625244234
40024230250...2641144234

5 rows × 21 columns

## Missing Data

# Missing Data
total = df_train.isnull().sum().sort_values(ascending=False)
percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])


TotalPercent
PoolQC11960.996667
MiscFeature11530.960833
Alley11250.937500
Fence9730.810833
FireplaceQu5640.470000
LotFrontage2100.175000
GarageType670.055833
GarageFinish670.055833
GarageQual670.055833
GarageCond670.055833
GarageYrBlt670.055833
BsmtExposure330.027500
BsmtFinType2330.027500
BsmtCond320.026667
BsmtFinType1320.026667
BsmtQual320.026667
MasVnrArea60.005000
MasVnrType60.005000
RoofStyle00.000000
RoofMatl00.000000
After reading the full description of each column available in the description file, we make sure that 'NaN' values are not missing data, but usually refer to a missing feature in the house, e.g. a 'NaN' value in 'PoolQC' colums means "no pool in the house". This is the case for the 5 columns with the highest missing data percent (PoolQC, MiscFeature, Alley, Fence, FireplaceQu).

So we will keep the missing values for these columns because they give us information about the house features.

df_train['LotFrontage'].unique()

array([-2.14617436e-01,  4.18239777e-01, -8.80459934e-02, -4.25569840e-01,
5.87001701e-01,  6.29192182e-01,  2.07287373e-01,             nan,
-8.05284168e-01, -8.47474649e-01, -3.66503167e-03,  8.82335067e-01,
8.07159301e-02, -1.72426955e-01,  1.30423988e+00, -5.52141283e-01,
-1.10061753e+00,  1.68395420e+00,  1.17766843e+00, -9.74046092e-01,
1.59957324e+00,  1.76833517e+00,  1.65096892e-01,  1.89490661e+00,
-3.83379360e-01, -9.31855611e-01, -1.56471282e+00, -7.63093687e-01,
1.26204939e+00, -1.94442715e+00,  7.97954105e-01, -2.98998398e-01,
2.49477854e-01,  4.60430258e-01,  1.05109699e+00, -4.58555126e-02,
-2.07099859e+00, -1.60690331e+00,  3.33858815e-01,  2.14804949e+00,
2.19023997e+00, -1.26937946e+00,  1.47300180e+00,  1.22906411e-01,
2.91668334e-01, -2.56807917e-01,  1.00890651e+00, -1.52252234e+00,
8.40144586e-01, -6.36522245e-01,  7.55763624e-01,  5.02620739e-01,
3.85254492e-02,  2.10585901e+00,  1.55738276e+00,  9.24525548e-01,
2.69652574e+00, -3.41188879e-01,  6.71382662e-01,  2.99185911e+00,
1.13547795e+00, -6.78712726e-01, -1.22718898e+00,  3.76049296e-01,
4.38414498e+00,  1.21985891e+00, -1.30236474e-01,  5.44811220e-01,
-1.14280802e+00,  1.38862084e+00,  9.66716029e-01, -1.69128427e+00,
2.48557334e+00,  2.94966863e+00, -1.48033186e+00, -1.39595090e+00,
2.02147805e+00,  7.13573143e-01,  1.93709709e+00,  3.37157344e+00,
1.72614468e+00, -8.89665130e-01,  1.09328747e+00, -4.67760321e-01,
-1.43814138e+00, -5.94331764e-01,  1.34643036e+00, -5.09950802e-01,
-1.35376042e+00,  1.64176372e+00,  2.52776382e+00, -7.20903207e-01,
2.82309719e+00, -1.05842705e+00,  1.51519228e+00,  1.43081132e+00,
-1.18499850e+00, -1.31156994e+00,  3.11843055e+00,  1.85271613e+00,
2.44338286e+00,  3.32938296e+00,  1.02486218e+01,  4.13100209e+00,
4.72166883e+00,  2.86528767e+00,  3.79347825e+00])

Regarding the 'LotFrontage' column, 'NaN' values are indeed missing data. Since missing data percent for this feature is quiet hight (about 17%), we can choose to drop this column or replace 'NaN' values with 0's.

The missing data for Garage features won’t affect our work (about 5% of missing data). But we can get rid of Garage and Basement features and only keep the surface because they are not relevent when buying a house.

Finally, we can consider that the masonry veneer features are not essential and have no direct effect on the sale price (‘MasVnrArea’ is correlated with ‘GrLivArea’ and ‘TotRmsAbvGrd’ which are considered as main features).

We can now decide what features to keep and to delete:
• Bsmt features to delete: BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, BsmtFullBath, BsmtHalfBath.
• Bsmt features to keep: TotalBsmtSF
• Garage features to delete: GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageQual, GarageCond
• Garage features to keep: GarageArea
• Masonry veneer features to delete: MasVnrArea, MasVnrType
• Masonry veneer features to keep: None
• # fianl df after selecting necessary features
df_final = pd.concat([df_num, df_cat_one_hot, df_label_encoding], axis=1)
df_final.shape

(1460, 193)

# Extract training and testing data
df_train = df_final.iloc[:1200]
df_test = df_final.iloc[1200:]

from scipy import stats
from sklearn import preprocessing

X_train, y_train = df_train.values, df_train_target.values
X_test, y_test = df_test.values, df_test_target.values


## Data Labels

Let's explore the target feature by calculating it's skewness and visualizing the distribution of data.
print("Skewness: %f" % df_train_target.skew())
sns.distplot(df_train_target, fit=stats.norm)

Skewness: 1.967215

<matplotlib.axes._subplots.AxesSubplot at 0x7f9d6e95af40> Skewed data can reduce the performance of our predictive model. As a first step, we are going to apply the log transformation to the target field.
print("Skewness: %f" % np.log(df_train_target).skew())
sns.distplot(np.log(df_train_target), fit=stats.norm)

Skewness: 0.132714

<matplotlib.axes._subplots.AxesSubplot at 0x7f9d6e82a700> The skew ceofficient went from 1.96 to 0.13 by applying the log transformation to the target field. As we can see in the last visualization, the data is now much closer to a normal distribution.

When making predictions, we have to reverse the log transformation (i.e. apply exp transformation) in order to get the ‘SalePrice’. Or we can choose to transform the target label of the testing data.

Note: Some data preprocessing ideas were inspired from the House Pricing Kaggle Competition (https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python).
#Applying log transformation
y_train = np.log(y_train)
y_test = np.log(y_test)


# Prediction Model

from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb
import numpy as np
import sklearn.datasets
import sklearn.metrics
from sklearn.model_selection import train_test_split
import tensorflow
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import optuna

We use cross validation error to compare our models.
# Define validation function
n_folds = 5

def rmsle_cv(model):
kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X_train)
rmse = np.sqrt(-cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv = kf))
score = np.mean(rmse)
return(score)

def rmsle(y, y_pred):
return np.sqrt(mean_squared_error(y, y_pred))

Train different simple models for later use.
ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
score = rmsle_cv(ENet)
print("Enet score is :", score)

Enet score is : 0.13682678960721764

KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)

score = rmsle_cv(KRR)
print("KRR score is :", score)

KRR score is : 0.13723828381390502

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
score = rmsle_cv(lasso)
print("Lasso score:", score)

Lasso score: 0.13675368241324085


## LightGBM + Optuna For Hyperparameters Optimization

Optuna is a library for bayesian hyper parameter optimisation.

It seeks the directions (parameters) that have the biggest impact on the loss of our model and follows this direction for choosing the hyperparameter for the next round.

Just give it the hyperparameters you want to update, how you want to sample from those (uniform / log uniform..) and an evaluation metric (score) and it will work on minimizing it for you.

dtrain = lgb.Dataset(X_train,label=y_train)
dtest = lgb.Dataset(X_test,label=y_test)

def objective(trial):

param = {
#"objective": "regression",
## "verbosity": -1,
#"device":"gpu",
"boostingtype":"goss",
"learning_rate":trial.suggest_loguniform("learning_rate",0.01,0.1),
'subsample': trial.suggest_uniform("subsample",0.5,1),
'feature_fraction':trial.suggest_uniform("subsample",0.5,1),
"num_leaves": int(trial.suggest_discrete_uniform("num_leaves", 2, 10,1)),
"n_estimators":int(trial.suggest_loguniform("n_estimators",300,1500)),
"min_data_in_leaf": int(trial.suggest_loguniform("min_data_in_leaf",4,40)),

}

# Add a callback for pruning.
pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "root_mean_squared_error")

model = lgb.LGBMRegressor(**param)
pipe = make_pipeline(RobustScaler(),model)

score = rmsle_cv(pipe)
print("Score is", score)
return score

if __name__ == "__main__":
study = optuna.create_study(
pruner=optuna.pruners.MedianPruner(n_warmup_steps=200), direction="minimize")

study.optimize(objective, n_trials=10)

print("Number of finished trials: {}".format(len(study.trials)))

print("Best trial:")
best_trial = study.best_trial

print("  Value: {}".format(best_trial.value))

print("  Params: ")
for key, value in best_trial.params.items():
print("    {}: {}".format(key, value))


Score is 0.13600313032409703

[I 2020-05-13 00:02:26,245] Finished trial#0 with value: 0.13600313032409703 with parameters: {'learning_rate': 0.01923119462002535, 'subsample': 0.6626468518675733, 'num_leaves': 9.0, 'n_estimators': 647.6352032192165, 'min_data_in_leaf': 21.033077119746185}. Best is trial#0 with value: 0.13600313032409703.

Score is 0.13640568127865232

[I 2020-05-13 00:02:27,995] Finished trial#1 with value: 0.13640568127865232 with parameters: {'learning_rate': 0.024613463875544094, 'subsample': 0.8507784648943293, 'num_leaves': 10.0, 'n_estimators': 493.03789482908246, 'min_data_in_leaf': 14.362849260435436}. Best is trial#0 with value: 0.13600313032409703.

Score is 0.1349189391273471

[I 2020-05-13 00:02:28,664] Finished trial#2 with value: 0.1349189391273471 with parameters: {'learning_rate': 0.059348839250226805, 'subsample': 0.9945366417050793, 'num_leaves': 3.0, 'n_estimators': 390.90546012907805, 'min_data_in_leaf': 9.898292162804019}. Best is trial#2 with value: 0.1349189391273471.

Score is 0.13871747406835083

[I 2020-05-13 00:02:30,442] Finished trial#3 with value: 0.13871747406835083 with parameters: {'learning_rate': 0.069324825051667, 'subsample': 0.7523892885875515, 'num_leaves': 7.0, 'n_estimators': 1014.0907893209081, 'min_data_in_leaf': 8.00434817044987}. Best is trial#2 with value: 0.1349189391273471.

Score is 0.138256553664015

[I 2020-05-13 00:02:31,248] Finished trial#4 with value: 0.138256553664015 with parameters: {'learning_rate': 0.018382734588964663, 'subsample': 0.7498495471593678, 'num_leaves': 4.0, 'n_estimators': 414.69291940907084, 'min_data_in_leaf': 6.168345054282901}. Best is trial#2 with value: 0.1349189391273471.

Score is 0.14428494968450611

[I 2020-05-13 00:02:31,859] Finished trial#5 with value: 0.14428494968450611 with parameters: {'learning_rate': 0.056910426348052066, 'subsample': 0.8949690215207301, 'num_leaves': 2.0, 'n_estimators': 455.8345176759971, 'min_data_in_leaf': 38.51505322114788}. Best is trial#2 with value: 0.1349189391273471.

Score is 0.13541023518338582

[I 2020-05-13 00:02:33,110] Finished trial#6 with value: 0.13541023518338582 with parameters: {'learning_rate': 0.010969369772362703, 'subsample': 0.6904482246527368, 'num_leaves': 4.0, 'n_estimators': 960.8874925988346, 'min_data_in_leaf': 19.22499177034413}. Best is trial#2 with value: 0.1349189391273471.

Score is 0.13210578425376807

[I 2020-05-13 00:02:34,644] Finished trial#7 with value: 0.13210578425376807 with parameters: {'learning_rate': 0.02184369730127071, 'subsample': 0.5128444945002342, 'num_leaves': 5.0, 'n_estimators': 1187.1548228495471, 'min_data_in_leaf': 15.979845843774275}. Best is trial#7 with value: 0.13210578425376807.

Score is 0.1682655819291983

[I 2020-05-13 00:02:35,218] Finished trial#8 with value: 0.1682655819291983 with parameters: {'learning_rate': 0.017935162873548946, 'subsample': 0.5229929455018336, 'num_leaves': 2.0, 'n_estimators': 357.96774663786687, 'min_data_in_leaf': 13.255524740231005}. Best is trial#7 with value: 0.13210578425376807.

Score is 0.1362679938439728

[I 2020-05-13 00:02:36,435] Finished trial#9 with value: 0.1362679938439728 with parameters: {'learning_rate': 0.016582307774637164, 'subsample': 0.8659955203836165, 'num_leaves': 3.0, 'n_estimators': 1017.6045021394999, 'min_data_in_leaf': 5.488615281157206}. Best is trial#7 with value: 0.13210578425376807.

Number of finished trials: 10
Best trial:
Value: 0.13210578425376807
Params:
learning_rate: 0.02184369730127071
subsample: 0.5128444945002342
num_leaves: 5.0
n_estimators: 1187.1548228495471
min_data_in_leaf: 15.979845843774275

params = {}
for key, value in best_trial.params.items():
if value > 1:
params[key] = int(value)
else :
params[key] = value

## Little trick to have better accuracy : Divide the learning rate and double the number of learners by same constant (doesn't work here :'( )

params["learning_rate"] *= 0.5
params["n_estimators"] *= 2

GBM = make_pipeline(RobustScaler(), lgb.LGBMRegressor(**params))
rmsle_cv(GBM)


0.13391835487283882


## Stacking Models For Predictions

In this approach, we add a meta-model on averaged base models and use the out-of-folds predictions of these base models to train our meta-model.

The procedure, for the training part, may be described as follows:

• Split the total training set into two disjoint sets (here train and .holdout)
• Train several base models on the first part (train)
• Test these base models on the second part (holdout)

Use the predictions from 3) (called out-of-folds predictions) as the inputs, and the correct responses (target variable) as the outputs to train a higher level learner called meta-model.

The first three steps are done iteratively . If we take for example a 5-fold stacking , we first split the training data into 5 folds. Then we will do 5 iterations. In each iteration, we train every base model on 4 folds and predict on the remaining fold (holdout fold).

So, we will be sure, after 5 iterations , that the entire data is used to get out-of-folds predictions that we will then use as new feature to train our meta-model in the step 4.

For the prediction part , We average the predictions of all base models on the test data and used them as meta-features on which, the final prediction is done with the meta-model.

class MetaModel(BaseEstimator, RegressorMixin, TransformerMixin):
def __init__(self, base_models, meta_model, n_folds=5):
self.base_models = base_models
self.meta_model = meta_model
self.n_folds = n_folds

# We again fit the data on clones of the original models
def fit(self, X, y):
self.base_models_ = [list() for x in self.base_models]
self.meta_model_ = clone(self.meta_model)
kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)

# Train cloned base models then create out-of-fold predictions
# that are needed to train the cloned meta-model
out_of_fold_predictions = np.zeros((X.shape, len(self.base_models)))
for i, model in enumerate(self.base_models):
for train_index, holdout_index in kfold.split(X, y):
instance = clone(model)
self.base_models_[i].append(instance)
instance.fit(X[train_index], y[train_index])
y_pred = instance.predict(X[holdout_index])
out_of_fold_predictions[holdout_index, i] = y_pred

# Now train the cloned  meta-model using the out-of-fold predictions as new feature
self.meta_model_.fit(out_of_fold_predictions, y)
return self

#Do the predictions of all base models on the test data and use the averaged predictions as
#meta-features for the final prediction which is done by the meta-model
def predict(self, X):
meta_features = np.column_stack([
np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
for base_models in self.base_models_])
return self.meta_model_.predict(meta_features)

meta = MetaModel(base_models = (ENet, GBM, KRR), meta_model = lasso)

score = rmsle_cv(meta)
print("Score is ", score)

Score is  0.13132544460806347

meta.fit(X_train, y_train)
meta.base_models_
pred = meta.predict(X_test)
rmsle(pred, y_test)

0.15162988008351325


## Deep Learning Model

While implementing this modele we use some experimental features from Tensorflow Addons, tensorflow's library for the latest research results.

For this model we used the NovoGrad Optimizer which usesgradient descent method with layer-wise gradient normalization and decoupled weight decay adapted from this paper.

We wrap the optimizer in the lookahead wrapper, the algorithm chooses a search direction by looking ahead at the sequence of K fast weights generated by another optimizer.

We use an exponentially decreasing cyclical learning rate scheduler . Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values.

We train the model using leaky ReLU activation and minimal dropout at the beginning of the net. Batchnormalisation to make loss function smoother. We also use early stopping with a patience of 200 epochs as we suppose that the model generalizes better at earlier stages of training as it didn’t overfit training set yet.

All these addons allow us to train a decent DNN with minimal hyperparameter tuning especially for learning rate.

import tensorflow_addons as tfa
from tensorflow.keras.layers import Dropout , BatchNormalization
from tensorflow.nn import leaky_relu
from tensorflow.keras import backend as K

model = Sequential()
n_cols = X_train.shape
print("N_cols=",n_cols)

def rmse(y_true, y_pred):
return K.sqrt(K.mean(K.square(y_pred - y_true)))
def LR(x):
return leaky_relu(x, alpha=0.1)

# Learning rate :

lr_schedule = tfa.optimizers.ExponentialCyclicalLearningRate(
initial_learning_rate=1e-4,
maximal_learning_rate=1e-2,
step_size=2000,
scale_mode="cycle",
gamma=0.96,
name="MyCyclicScheduler")

# Optimizer

##Early stopping
es = tensorflow.keras.callbacks.EarlyStopping(monitor='val_loss',restore_best_weights=True, patience=200)

model.compile(optimizer=opt, loss=rmse)

# train model
history = model.fit(X_train, y_train, verbose=0, batch_size=40, validation_split=0.2, epochs=1000, callbacks=[es])

# list all data in history
print(history.history.keys())

# summarize history for loss
plt.ylim([0,0.2])
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()

N_cols= 193
WARNING:tensorflow:From /home/mokhles/.local/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py:1813: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
dict_keys(['loss', 'val_loss']) y_pred = model.predict(X_test)
rmsle(y_pred, y_test)

0.13674783660080644


# Conclusion

For model choice we finally choose the DNN as it performs better on our Test dataset.

We conclude that a good DNN can beat a fine tuned MetaModel on this task.

DNN can deal better with categorical data with low cardinality than any other simple model.

In contrast to LGBM it can map complex interactions between features. ##### Mokhles Bouzaien
###### Master of Science in Engineering Student

A self-motivated graduate student in Engineering at IMT Atlantique.