House Pricing Prediction

House Pricing

Table of contents

  1. Introduction
  2. Data Preparation
    1. Numerical Fields
    2. Non-Numerical Fields
    3. Missing Data
    4. Data Labels
  3. Prediction Model
    1. LightGBM + Optuna For Hyperparameters Optimization
    2. Stacking Models For Predictions
    3. Deep Learning Model
  4. Conclusion

Introduction

During this challenge, we are going to predict the final price of each house. We are given 2 data sets: train and test. Each data set contain many features that we will explore, try to find eventual correlations beween them and select the most useful ones to predict the house price. Finally, we will test and compare a few models trained on this data in order to select the model with the best performance.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

Data preparation

First, we import the training and testing data. We will be using pandas library to manipulate and display data.
# Import the training and testing data to explore them
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')
print("Training samples:", len(df_train))
print("Testing samples:", len(df_test))
df_train.head()
Training samples: 1200
Testing samples: 260

IdMSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilities...PoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
0160RL65.08450PaveNaNRegLvlAllPub...0NaNNaNNaN022008WDNormal208500
1220RL80.09600PaveNaNRegLvlAllPub...0NaNNaNNaN052007WDNormal181500
2360RL68.011250PaveNaNIR1LvlAllPub...0NaNNaNNaN092008WDNormal223500
3470RL60.09550PaveNaNIR1LvlAllPub...0NaNNaNNaN022006WDAbnorml140000
4560RL84.014260PaveNaNIR1LvlAllPub...0NaNNaNNaN0122008WDNormal250000

5 rows × 81 columns

This overview gave us an idea about the data structure we will be using as well as the different fields of the data set. We notice that there are so many features in this data set. So in the first part we will choose the most useful features to keep.

We start by dropping the 'Id' column which is totally useless for the prediction.
# Drop Id and extract target
df_train_target = df_train['SalePrice']
df_test_target = df_test['SalePrice']
df_train = df_train.drop(['Id','SalePrice'], axis=1)
df_test = df_test.drop(['Id','SalePrice'], axis=1)

Numrical Fields

A quick discription of the numerical fields is quiet useful to have an idea about our data (e.g. the data range, min/max, mean of each column, etc).
# Descriptive summary of numeric fields
df_train.describe()

MSSubClassLotFrontageLotAreaOverallQualOverallCondYearBuiltYearRemodAddMasVnrAreaBsmtFinSF1BsmtFinSF2...GarageAreaWoodDeckSFOpenPorchSFEnclosedPorch3SsnPorchScreenPorchPoolAreaMiscValMoSoldYrSold
count1200.000000990.0000001200.0000001200.0000001200.0000001200.0000001200.0000001194.0000001200.0000001200.000000...1200.0000001200.0000001200.0000001200.0000001200.0000001200.0000001200.0000001200.0000001200.0000001200.000000
mean57.07500070.08686910559.4116676.1050005.5683331971.3508331984.987500103.962312444.88666745.260000...472.60416795.13666746.01666722.1783333.65333314.9808331.90916740.4533336.3116672007.810833
std42.68201223.70202910619.1355491.3834391.12013830.04840820.527221183.534953439.987844158.931453...212.722444124.03412965.67762961.50732329.99109954.76805733.148327482.3234442.6731041.319027
min20.00000021.0000001300.0000001.0000001.0000001875.0000001950.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.0000002006.000000
25%20.00000059.0000007560.0000005.0000005.0000001954.0000001967.0000000.0000000.0000000.000000...334.5000000.0000000.0000000.0000000.0000000.0000000.0000000.0000005.0000002007.000000
50%50.00000070.0000009434.5000006.0000005.0000001973.0000001994.0000000.000000385.5000000.000000...478.0000000.00000024.0000000.0000000.0000000.0000000.0000000.0000006.0000002008.000000
75%70.00000080.00000011616.0000007.0000006.0000002000.0000002004.000000166.750000712.2500000.000000...576.000000168.00000068.0000000.0000000.0000000.0000000.0000000.0000008.0000002009.000000
max190.000000313.000000215245.00000010.0000009.0000002010.0000002010.0000001600.0000002260.0000001474.000000...1390.000000857.000000523.000000552.000000508.000000410.000000648.00000015500.00000012.0000002010.000000

8 rows × 36 columns

The range of values of each column varies widely. So normalizing the data is an important step in order to avoid any scaling problem (e.g distances between points won't be dominated by features with high values) and also to speed up the training process (e.g. gradient descent will converge faster).
c_ = df_train.describe().columns
df_train[c_] = (df_train[c_] - df_train[c_].mean()) / df_train[c_].std()
df_test[c_] = (df_test[c_] - df_test[c_].mean()) / df_test[c_].std()

Correlation Matrix

As we can see we have so many features describing the data samples. Probably not all of these feature are importatnt to predict our target which is the selling price ('SalePrice').

So now, we are going to keep only columns which are relevent to this problem. So let's have a quick look to the correlation between different features using the correlation matrix.
#correlation matrix
corrmat = np.abs(pd.concat([df_train, df_train_target], axis=1).corr())
mask = np.zeros_like(corrmat)
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(16, 12))
    ax = sns.heatmap(corrmat, mask=mask, square=True, cmap="YlGnBu")

png

This heatmap gave us a great overview of our data as well as the relationship between different features.

We notice that there are many darkblue-colored squares: there are obvious correlations such as between the number of rooms at each level and the surface of that level (ground or higher floors), between surfaces of different levels, and also between surface and capacity of garages.

These columns give almost the same information, so we can get rit of extra ones.

We can also look to features that are directly correlated with the 'SalePrice': we get 'GrLivArea', 'TotalBsmtSF', and 'OverallQual'.
# SalePrice correlation matrix
n = 10
cols = corrmat.nlargest(n, 'SalePrice')['SalePrice'].index # List of highly correlated colums with 'SalePrice'
cm = np.corrcoef(pd.concat([df_train, df_train_target], axis=1)[cols].values.T)
mask = np.zeros_like(cm)
mask[np.triu_indices_from(cm)] = True
with sns.axes_style("white"):
    f, ax = plt.subplots(figsize=(12, 9))
    ax = sns.heatmap(cm, 
                     mask=mask, 
                     cbar=True, 
                     annot=True, 
                     square=True, 
                     fmt='.2f', 
                     annot_kws={'size': 10}, 
                     yticklabels=cols.values, 
                     xticklabels=cols.values,
                     cmap="YlGnBu")
    plt.show()

png

To sum up:
  • 'GrLivArea', 'TotalBsmtSF' and 'OverallQual' are highly correlated with 'SalePrice'.
  • '1stFlrSF' and 'TotalBsmtSF': the first floor square feet and the total square feet of basement area are obviously highly correlated. We can keep only 'TotalBsmtSF'.
  • 'TotRmsAbvGrd' and 'GrLivArea': total rooms above grade and above grade living area square feet. 'GrLivArea' to be kept.
  • 'YearBuilt' and 'GarageYrBlt': it's probably the same year of building the house and the garage. 'GarageYrBlt' to be dropped.
  • 'GarageArea' and 'GarageCars': the surface of the garage and it's capacity are highly related (obvious).
  • Scatter Plot

    sns.set()
    cols = ['SalePrice', 'OverallQual', 'GrLivArea', 'TotalBsmtSF', '1stFlrSF', 'TotRmsAbvGrd', 'YearBuilt', 'GarageYrBlt', 'GarageCars', 'GarageArea']
    sns.pairplot(pd.concat([df_train, df_train_target], axis=1)[cols], size = 2.5)
    plt.show();
    

    png

    As previously discussed, the sale price is linearly correlated with the material rates, ground and basement surface.

    we can see an exponential relation between sale price and build year especially during the recent year.

    Garage build year and build year are related but not always equal.

    Garage capacity is a result of its area.

    # Numerical columns to keep
    numerical_cols = ['GrLivArea',
                      'TotalBsmtSF',
                      'OverallQual',
                      'TotalBsmtSF',
                      'GrLivArea',
                      'YearBuilt',
                      'GarageArea',
                      'FullBath',
                      'MSSubClass',
                      'YearRemodAdd']
    
    df_train_num = df_train[numerical_cols]
    df_test_num = df_test[numerical_cols]
    print("Training data:", df_train_num.shape)
    print("Testing data:", df_test_num.shape)
    
    Training data: (1200, 10)
    Testing data: (260, 10)
    
    df_num = pd.concat([df_train_num, df_test_num], ignore_index=True)
    

    Non-Numerical/Categorical Fields

    In this part, we will treat non-numerical feature.

    The first step is to find eventual correlations between these fields and if there are interesting relations with ‘SalePrice’ target.

    Then, we will convert them to numerical fields using the One-Hot-Encoding and the Label Encoding methods. This step is very important in order to be able to train our model later.

    Label Encoding will be used for fields with comparable values, e.g. ‘GarageQual’ whose values can be converted to 5, 4, 3, 2, 1 and 0 for Ex (Excellent), Gd (Good), TA (Typical/Average), Fa (Fair), Po (Poor) and NA (No Garage) respectively.

    One-Hot-Encoding will be used for other categorical fields whose values are not comparable, e.g. ‘MSZoning’ for zone classification.

    # Print fields' types
    df_train.dtypes.unique()
    
    array([dtype('float64'), dtype('O')], dtype=object)
    
    # Extracting non-nemeric fields
    df_train_non_num = df_train.select_dtypes(include=['O'])
    df_train_non_num.head()
    

    MSZoningStreetAlleyLotShapeLandContourUtilitiesLotConfigLandSlopeNeighborhoodCondition1...GarageTypeGarageFinishGarageQualGarageCondPavedDrivePoolQCFenceMiscFeatureSaleTypeSaleCondition
    0RLPaveNaNRegLvlAllPubInsideGtlCollgCrNorm...AttchdRFnTATAYNaNNaNNaNWDNormal
    1RLPaveNaNRegLvlAllPubFR2GtlVeenkerFeedr...AttchdRFnTATAYNaNNaNNaNWDNormal
    2RLPaveNaNIR1LvlAllPubInsideGtlCollgCrNorm...AttchdRFnTATAYNaNNaNNaNWDNormal
    3RLPaveNaNIR1LvlAllPubCornerGtlCrawforNorm...DetchdUnfTATAYNaNNaNNaNWDAbnorml
    4RLPaveNaNIR1LvlAllPubFR2GtlNoRidgeNorm...AttchdRFnTATAYNaNNaNNaNWDNormal

    5 rows × 43 columns

    df_train_non_num.nunique()
    
    MSZoning          5
    Street            2
    Alley             2
    LotShape          4
    LandContour       4
    Utilities         2
    LotConfig         5
    LandSlope         3
    Neighborhood     25
    Condition1        9
    Condition2        7
    BldgType          5
    HouseStyle        8
    RoofStyle         5
    RoofMatl          6
    Exterior1st      14
    Exterior2nd      15
    MasVnrType        4
    ExterQual         4
    ExterCond         5
    Foundation        6
    BsmtQual          4
    BsmtCond          4
    BsmtExposure      4
    BsmtFinType1      6
    BsmtFinType2      6
    Heating           4
    HeatingQC         5
    CentralAir        2
    Electrical        5
    KitchenQual       4
    Functional        7
    FireplaceQu       5
    GarageType        6
    GarageFinish      3
    GarageQual        5
    GarageCond        5
    PavedDrive        3
    PoolQC            3
    Fence             4
    MiscFeature       3
    SaleType          9
    SaleCondition     6
    dtype: int64
    
    non_num_corr = pd.concat([df_train_non_num.apply(lambda x: x.factorize()[0]), df_train_target], axis=1).corr()
    mask = np.zeros_like(non_num_corr)
    mask[np.triu_indices_from(mask)] = True
    with sns.axes_style("white"):
        f, ax = plt.subplots(figsize=(16, 12))
        ax = sns.heatmap(non_num_corr, mask=mask, square=True, cmap="YlGnBu")
    

    png

    # columns to convert to numerical values
    label_encoding_cols = ['Utilities',
                    'LandSlope',
                    'ExterQual',
                    'ExterCond',
                    'BsmtQual',
                    'BsmtCond',
                    'BsmtExposure',
                    'BsmtFinType1',
                    'BsmtFinType2',
                    'HeatingQC',
                    'CentralAir',
                    'KitchenQual',
                    'Functional',
                    'FireplaceQu',
                    'GarageType',
                    'GarageFinish',
                    'GarageQual',
                    'GarageCond',
                    'PavedDrive',
                    'PoolQC',
                    'Fence']
    one_hot_cols = ['MSZoning',
                           'Street',
                           'Alley',
                           'LotShape',
                           'LandContour',
                           'LotConfig',
                           'Neighborhood',
                           'Condition1',
                           'Condition2',
                           'BldgType',
                           'HouseStyle',
                           'RoofStyle',
                           'RoofMatl',
                           'Exterior1st',
                           'Exterior2nd',
                           'MasVnrType',
                           'Foundation',
                           'Heating',
                           'Electrical',
                           'MiscFeature',
                           'SaleType',
                           'SaleCondition']
    
    all_data = pd.concat([df_train, df_test], ignore_index=True)
    all_data.shape
    
    (1460, 79)
    
    # Convert categorical variable into indicator variables
    df_cat_one_hot = pd.get_dummies(all_data[one_hot_cols], dummy_na=True, columns=one_hot_cols, drop_first=True)
    df_cat_one_hot.shape
    
    (1460, 162)
    
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    
    df_label_encoding = pd.DataFrame(columns=label_encoding_cols)
    for col in label_encoding_cols:
      labelencoder = LabelEncoder()
      df_label_encoding[col] = labelencoder.fit_transform(all_data[col].astype(str))
      # print(labelencoder.fit_transform(df_train[col].astype(str)))
    df_label_encoding.head()
    

    UtilitiesLandSlopeExterQualExterCondBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinType2HeatingQC...KitchenQualFunctionalFireplaceQuGarageTypeGarageFinishGarageQualGarageCondPavedDrivePoolQCFence
    00024233250...2651144234
    10034231050...3641144234
    20024232250...2641144234
    30034313052...2625244234
    40024230250...2641144234

    5 rows × 21 columns

    Missing Data

    # Missing Data
    total = df_train.isnull().sum().sort_values(ascending=False)
    percent = (df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    missing_data.head(20)
    

    TotalPercent
    PoolQC11960.996667
    MiscFeature11530.960833
    Alley11250.937500
    Fence9730.810833
    FireplaceQu5640.470000
    LotFrontage2100.175000
    GarageType670.055833
    GarageFinish670.055833
    GarageQual670.055833
    GarageCond670.055833
    GarageYrBlt670.055833
    BsmtExposure330.027500
    BsmtFinType2330.027500
    BsmtCond320.026667
    BsmtFinType1320.026667
    BsmtQual320.026667
    MasVnrArea60.005000
    MasVnrType60.005000
    RoofStyle00.000000
    RoofMatl00.000000
    After reading the full description of each column available in the description file, we make sure that 'NaN' values are not missing data, but usually refer to a missing feature in the house, e.g. a 'NaN' value in 'PoolQC' colums means "no pool in the house". This is the case for the 5 columns with the highest missing data percent (PoolQC, MiscFeature, Alley, Fence, FireplaceQu).

    So we will keep the missing values for these columns because they give us information about the house features.

    df_train['LotFrontage'].unique()
    
    array([-2.14617436e-01,  4.18239777e-01, -8.80459934e-02, -4.25569840e-01,
            5.87001701e-01,  6.29192182e-01,  2.07287373e-01,             nan,
           -8.05284168e-01, -8.47474649e-01, -3.66503167e-03,  8.82335067e-01,
            8.07159301e-02, -1.72426955e-01,  1.30423988e+00, -5.52141283e-01,
           -1.10061753e+00,  1.68395420e+00,  1.17766843e+00, -9.74046092e-01,
            1.59957324e+00,  1.76833517e+00,  1.65096892e-01,  1.89490661e+00,
           -3.83379360e-01, -9.31855611e-01, -1.56471282e+00, -7.63093687e-01,
            1.26204939e+00, -1.94442715e+00,  7.97954105e-01, -2.98998398e-01,
            2.49477854e-01,  4.60430258e-01,  1.05109699e+00, -4.58555126e-02,
           -2.07099859e+00, -1.60690331e+00,  3.33858815e-01,  2.14804949e+00,
            2.19023997e+00, -1.26937946e+00,  1.47300180e+00,  1.22906411e-01,
            2.91668334e-01, -2.56807917e-01,  1.00890651e+00, -1.52252234e+00,
            8.40144586e-01, -6.36522245e-01,  7.55763624e-01,  5.02620739e-01,
            3.85254492e-02,  2.10585901e+00,  1.55738276e+00,  9.24525548e-01,
            2.69652574e+00, -3.41188879e-01,  6.71382662e-01,  2.99185911e+00,
            1.13547795e+00, -6.78712726e-01, -1.22718898e+00,  3.76049296e-01,
            4.38414498e+00,  1.21985891e+00, -1.30236474e-01,  5.44811220e-01,
           -1.14280802e+00,  1.38862084e+00,  9.66716029e-01, -1.69128427e+00,
            2.48557334e+00,  2.94966863e+00, -1.48033186e+00, -1.39595090e+00,
            2.02147805e+00,  7.13573143e-01,  1.93709709e+00,  3.37157344e+00,
            1.72614468e+00, -8.89665130e-01,  1.09328747e+00, -4.67760321e-01,
           -1.43814138e+00, -5.94331764e-01,  1.34643036e+00, -5.09950802e-01,
           -1.35376042e+00,  1.64176372e+00,  2.52776382e+00, -7.20903207e-01,
            2.82309719e+00, -1.05842705e+00,  1.51519228e+00,  1.43081132e+00,
           -1.18499850e+00, -1.31156994e+00,  3.11843055e+00,  1.85271613e+00,
            2.44338286e+00,  3.32938296e+00,  1.02486218e+01,  4.13100209e+00,
            4.72166883e+00,  2.86528767e+00,  3.79347825e+00])
    
    Regarding the 'LotFrontage' column, 'NaN' values are indeed missing data. Since missing data percent for this feature is quiet hight (about 17%), we can choose to drop this column or replace 'NaN' values with 0's.

    The missing data for Garage features won’t affect our work (about 5% of missing data). But we can get rid of Garage and Basement features and only keep the surface because they are not relevent when buying a house.

    Finally, we can consider that the masonry veneer features are not essential and have no direct effect on the sale price (‘MasVnrArea’ is correlated with ‘GrLivArea’ and ‘TotRmsAbvGrd’ which are considered as main features).

    We can now decide what features to keep and to delete:
  • Bsmt features to delete: BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, BsmtFullBath, BsmtHalfBath.
  • Bsmt features to keep: TotalBsmtSF
  • Garage features to delete: GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageQual, GarageCond
  • Garage features to keep: GarageArea
  • Masonry veneer features to delete: MasVnrArea, MasVnrType
  • Masonry veneer features to keep: None
  • # fianl df after selecting necessary features
    df_final = pd.concat([df_num, df_cat_one_hot, df_label_encoding], axis=1)
    df_final.shape
    
    (1460, 193)
    
    # Extract training and testing data
    df_train = df_final.iloc[:1200]
    df_test = df_final.iloc[1200:]
    
    from scipy import stats
    from sklearn import preprocessing
    
    X_train, y_train = df_train.values, df_train_target.values
    X_test, y_test = df_test.values, df_test_target.values
    

    Data Labels

    Let's explore the target feature by calculating it's skewness and visualizing the distribution of data.
    print("Skewness: %f" % df_train_target.skew())
    sns.distplot(df_train_target, fit=stats.norm)
    
    Skewness: 1.967215
    
    
    
    
    
    <matplotlib.axes._subplots.AxesSubplot at 0x7f9d6e95af40>
    

    png

    Skewed data can reduce the performance of our predictive model. As a first step, we are going to apply the log transformation to the target field.
    print("Skewness: %f" % np.log(df_train_target).skew())
    sns.distplot(np.log(df_train_target), fit=stats.norm)
    
    Skewness: 0.132714
    
    
    
    
    
    <matplotlib.axes._subplots.AxesSubplot at 0x7f9d6e82a700>
    

    png

    The skew ceofficient went from 1.96 to 0.13 by applying the log transformation to the target field. As we can see in the last visualization, the data is now much closer to a normal distribution.

    When making predictions, we have to reverse the log transformation (i.e. apply exp transformation) in order to get the ‘SalePrice’. Or we can choose to transform the target label of the testing data.

    Note: Some data preprocessing ideas were inspired from the House Pricing Kaggle Competition (https://www.kaggle.com/pmarcelino/comprehensive-data-exploration-with-python).
    #Applying log transformation
    y_train = np.log(y_train)
    y_test = np.log(y_test)
    

    Prediction Model

    from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
    from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
    from sklearn.kernel_ridge import KernelRidge
    from sklearn.pipeline import make_pipeline
    from sklearn.preprocessing import RobustScaler
    from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
    from sklearn.model_selection import KFold, cross_val_score, train_test_split
    from sklearn.metrics import mean_squared_error
    import xgboost as xgb
    import lightgbm as lgb
    import numpy as np
    import sklearn.datasets
    import sklearn.metrics
    from sklearn.model_selection import train_test_split
    import tensorflow 
    import tensorflow.keras as keras
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense
    import optuna
    
    We use cross validation error to compare our models.
    # Define validation function
    n_folds = 5
    
    def rmsle_cv(model):
        kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(X_train)
        rmse = np.sqrt(-cross_val_score(model, X_train, y_train, scoring="neg_mean_squared_error", cv = kf))
        score = np.mean(rmse)
        return(score)
      
    def rmsle(y, y_pred):
        return np.sqrt(mean_squared_error(y, y_pred))
    
    Train different simple models for later use.
    ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))
    score = rmsle_cv(ENet)
    print("Enet score is :", score)
    
    Enet score is : 0.13682678960721764
    
    KRR = KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
    
    score = rmsle_cv(KRR)
    print("KRR score is :", score)
    
    KRR score is : 0.13723828381390502
    
    lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))
    score = rmsle_cv(lasso)
    print("Lasso score:", score)
    
    Lasso score: 0.13675368241324085
    

    LightGBM + Optuna For Hyperparameters Optimization

    Optuna is a library for bayesian hyper parameter optimisation.

    It seeks the directions (parameters) that have the biggest impact on the loss of our model and follows this direction for choosing the hyperparameter for the next round.

    Just give it the hyperparameters you want to update, how you want to sample from those (uniform / log uniform..) and an evaluation metric (score) and it will work on minimizing it for you.

    dtrain = lgb.Dataset(X_train,label=y_train)
    dtest = lgb.Dataset(X_test,label=y_test)
    
    
    def objective(trial):
     
    
        
        param = {
            #"objective": "regression",
            ## "verbosity": -1,
            #"device":"gpu",
            "boostingtype":"goss",
            "learning_rate":trial.suggest_loguniform("learning_rate",0.01,0.1),
            'subsample': trial.suggest_uniform("subsample",0.5,1),
            'feature_fraction':trial.suggest_uniform("subsample",0.5,1),
            "num_leaves": int(trial.suggest_discrete_uniform("num_leaves", 2, 10,1)),
            "n_estimators":int(trial.suggest_loguniform("n_estimators",300,1500)),
            "min_data_in_leaf": int(trial.suggest_loguniform("min_data_in_leaf",4,40)),
          
        }
    
        # Add a callback for pruning.
        pruning_callback = optuna.integration.LightGBMPruningCallback(trial, "root_mean_squared_error")
    
        model = lgb.LGBMRegressor(**param)
        pipe = make_pipeline(RobustScaler(),model)
        
        score = rmsle_cv(pipe)
        print("Score is", score)
        return score
      
    
    
    if __name__ == "__main__":
        study = optuna.create_study(
            pruner=optuna.pruners.MedianPruner(n_warmup_steps=200), direction="minimize")
        
        study.optimize(objective, n_trials=10)
    
        print("Number of finished trials: {}".format(len(study.trials)))
    
        print("Best trial:")
        best_trial = study.best_trial
    
        print("  Value: {}".format(best_trial.value))
    
        print("  Params: ")
        for key, value in best_trial.params.items():
            print("    {}: {}".format(key, value))
        
    
    Score is 0.13600313032409703
    
    
    [I 2020-05-13 00:02:26,245] Finished trial#0 with value: 0.13600313032409703 with parameters: {'learning_rate': 0.01923119462002535, 'subsample': 0.6626468518675733, 'num_leaves': 9.0, 'n_estimators': 647.6352032192165, 'min_data_in_leaf': 21.033077119746185}. Best is trial#0 with value: 0.13600313032409703.
    
    
    Score is 0.13640568127865232
    
    
    [I 2020-05-13 00:02:27,995] Finished trial#1 with value: 0.13640568127865232 with parameters: {'learning_rate': 0.024613463875544094, 'subsample': 0.8507784648943293, 'num_leaves': 10.0, 'n_estimators': 493.03789482908246, 'min_data_in_leaf': 14.362849260435436}. Best is trial#0 with value: 0.13600313032409703.
    
    
    Score is 0.1349189391273471
    
    
    [I 2020-05-13 00:02:28,664] Finished trial#2 with value: 0.1349189391273471 with parameters: {'learning_rate': 0.059348839250226805, 'subsample': 0.9945366417050793, 'num_leaves': 3.0, 'n_estimators': 390.90546012907805, 'min_data_in_leaf': 9.898292162804019}. Best is trial#2 with value: 0.1349189391273471.
    
    
    Score is 0.13871747406835083
    
    
    [I 2020-05-13 00:02:30,442] Finished trial#3 with value: 0.13871747406835083 with parameters: {'learning_rate': 0.069324825051667, 'subsample': 0.7523892885875515, 'num_leaves': 7.0, 'n_estimators': 1014.0907893209081, 'min_data_in_leaf': 8.00434817044987}. Best is trial#2 with value: 0.1349189391273471.
    
    
    Score is 0.138256553664015
    
    
    [I 2020-05-13 00:02:31,248] Finished trial#4 with value: 0.138256553664015 with parameters: {'learning_rate': 0.018382734588964663, 'subsample': 0.7498495471593678, 'num_leaves': 4.0, 'n_estimators': 414.69291940907084, 'min_data_in_leaf': 6.168345054282901}. Best is trial#2 with value: 0.1349189391273471.
    
    
    Score is 0.14428494968450611
    
    
    [I 2020-05-13 00:02:31,859] Finished trial#5 with value: 0.14428494968450611 with parameters: {'learning_rate': 0.056910426348052066, 'subsample': 0.8949690215207301, 'num_leaves': 2.0, 'n_estimators': 455.8345176759971, 'min_data_in_leaf': 38.51505322114788}. Best is trial#2 with value: 0.1349189391273471.
    
    
    Score is 0.13541023518338582
    
    
    [I 2020-05-13 00:02:33,110] Finished trial#6 with value: 0.13541023518338582 with parameters: {'learning_rate': 0.010969369772362703, 'subsample': 0.6904482246527368, 'num_leaves': 4.0, 'n_estimators': 960.8874925988346, 'min_data_in_leaf': 19.22499177034413}. Best is trial#2 with value: 0.1349189391273471.
    
    
    Score is 0.13210578425376807
    
    
    [I 2020-05-13 00:02:34,644] Finished trial#7 with value: 0.13210578425376807 with parameters: {'learning_rate': 0.02184369730127071, 'subsample': 0.5128444945002342, 'num_leaves': 5.0, 'n_estimators': 1187.1548228495471, 'min_data_in_leaf': 15.979845843774275}. Best is trial#7 with value: 0.13210578425376807.
    
    
    Score is 0.1682655819291983
    
    
    [I 2020-05-13 00:02:35,218] Finished trial#8 with value: 0.1682655819291983 with parameters: {'learning_rate': 0.017935162873548946, 'subsample': 0.5229929455018336, 'num_leaves': 2.0, 'n_estimators': 357.96774663786687, 'min_data_in_leaf': 13.255524740231005}. Best is trial#7 with value: 0.13210578425376807.
    
    
    Score is 0.1362679938439728
    
    
    [I 2020-05-13 00:02:36,435] Finished trial#9 with value: 0.1362679938439728 with parameters: {'learning_rate': 0.016582307774637164, 'subsample': 0.8659955203836165, 'num_leaves': 3.0, 'n_estimators': 1017.6045021394999, 'min_data_in_leaf': 5.488615281157206}. Best is trial#7 with value: 0.13210578425376807.
    
    
    Number of finished trials: 10
    Best trial:
      Value: 0.13210578425376807
      Params: 
        learning_rate: 0.02184369730127071
        subsample: 0.5128444945002342
        num_leaves: 5.0
        n_estimators: 1187.1548228495471
        min_data_in_leaf: 15.979845843774275
    
    params = {}
    for key, value in best_trial.params.items():
      if value > 1:
        params[key] = int(value)
      else :
        params[key] = value
    
    ## Little trick to have better accuracy : Divide the learning rate and double the number of learners by same constant (doesn't work here :'( )
    
    params["learning_rate"] *= 0.5
    params["n_estimators"] *= 2
    
    GBM = make_pipeline(RobustScaler(), lgb.LGBMRegressor(**params))
    rmsle_cv(GBM)
    
    
    0.13391835487283882
    

    Stacking Models For Predictions

    In this approach, we add a meta-model on averaged base models and use the out-of-folds predictions of these base models to train our meta-model.

    The procedure, for the training part, may be described as follows:

    • Split the total training set into two disjoint sets (here train and .holdout)
    • Train several base models on the first part (train)
    • Test these base models on the second part (holdout)

    Use the predictions from 3) (called out-of-folds predictions) as the inputs, and the correct responses (target variable) as the outputs to train a higher level learner called meta-model.

    The first three steps are done iteratively . If we take for example a 5-fold stacking , we first split the training data into 5 folds. Then we will do 5 iterations. In each iteration, we train every base model on 4 folds and predict on the remaining fold (holdout fold).

    So, we will be sure, after 5 iterations , that the entire data is used to get out-of-folds predictions that we will then use as new feature to train our meta-model in the step 4.

    For the prediction part , We average the predictions of all base models on the test data and used them as meta-features on which, the final prediction is done with the meta-model.

    class MetaModel(BaseEstimator, RegressorMixin, TransformerMixin):
        def __init__(self, base_models, meta_model, n_folds=5):
            self.base_models = base_models
            self.meta_model = meta_model
            self.n_folds = n_folds
       
        # We again fit the data on clones of the original models
        def fit(self, X, y):
            self.base_models_ = [list() for x in self.base_models]
            self.meta_model_ = clone(self.meta_model)
            kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)
            
            # Train cloned base models then create out-of-fold predictions
            # that are needed to train the cloned meta-model
            out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))
            for i, model in enumerate(self.base_models):
                for train_index, holdout_index in kfold.split(X, y):
                    instance = clone(model)
                    self.base_models_[i].append(instance)
                    instance.fit(X[train_index], y[train_index])
                    y_pred = instance.predict(X[holdout_index])
                    out_of_fold_predictions[holdout_index, i] = y_pred
                    
            # Now train the cloned  meta-model using the out-of-fold predictions as new feature
            self.meta_model_.fit(out_of_fold_predictions, y)
            return self
       
     
            #Do the predictions of all base models on the test data and use the averaged predictions as 
        #meta-features for the final prediction which is done by the meta-model
        def predict(self, X):
            meta_features = np.column_stack([
                np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)
                for base_models in self.base_models_])
            return self.meta_model_.predict(meta_features)
    
    meta = MetaModel(base_models = (ENet, GBM, KRR), meta_model = lasso)
    
    score = rmsle_cv(meta)
    print("Score is ", score)
    
    Score is  0.13132544460806347
    
    meta.fit(X_train, y_train)
    meta.base_models_
    pred = meta.predict(X_test)
    rmsle(pred, y_test)
    
    0.15162988008351325
    

    Deep Learning Model

    While implementing this modele we use some experimental features from Tensorflow Addons, tensorflow's library for the latest research results.

    For this model we used the NovoGrad Optimizer which usesgradient descent method with layer-wise gradient normalization and decoupled weight decay adapted from this paper.

    We wrap the optimizer in the lookahead wrapper, the algorithm chooses a search direction by looking ahead at the sequence of K fast weights generated by another optimizer.

    We use an exponentially decreasing cyclical learning rate scheduler . Instead of monotonically decreasing the learning rate, this method lets the learning rate cyclically vary between reasonable boundary values.

    We train the model using leaky ReLU activation and minimal dropout at the beginning of the net. Batchnormalisation to make loss function smoother. We also use early stopping with a patience of 200 epochs as we suppose that the model generalizes better at earlier stages of training as it didn’t overfit training set yet.

    All these addons allow us to train a decent DNN with minimal hyperparameter tuning especially for learning rate.

    import tensorflow_addons as tfa
    from tensorflow.keras.layers import Dropout , BatchNormalization 
    from tensorflow.nn import leaky_relu
    from tensorflow.keras import backend as K
    from tensorflow_addons.optimizers.novograd import NovoGrad
    
    model = Sequential()
    n_cols = X_train.shape[1]
    print("N_cols=",n_cols)
    
    def rmse(y_true, y_pred):
            return K.sqrt(K.mean(K.square(y_pred - y_true))) 
    def LR(x):
        return leaky_relu(x, alpha=0.1)
    
    model.add(Dense(600, activation=LR, input_shape= (n_cols,)))
    model.add(BatchNormalization())
    model.add(Dropout(0.15))
    model.add(Dense(400, activation=LR))
    model.add(BatchNormalization())
    model.add(Dropout(0.1))
    model.add(Dense(300, activation=LR))
    model.add(BatchNormalization())
    model.add(Dropout(0.1))
    model.add(Dense(300, activation=LR))
    model.add(BatchNormalization())
    model.add(Dense(200, activation=LR))
    model.add(BatchNormalization())
    model.add(Dense(150, activation=LR))
    model.add(BatchNormalization())
    model.add(Dense(1))
    
    # Learning rate :
    
    lr_schedule = tfa.optimizers.ExponentialCyclicalLearningRate(
                initial_learning_rate=1e-4,
                maximal_learning_rate=1e-2,
                step_size=2000,
                scale_mode="cycle",
                gamma=0.96,
                name="MyCyclicScheduler")
    
    # Optimizer
    
    opt = NovoGrad(learning_rate=lr_schedule)
    opt = tfa.optimizers.Lookahead(opt) ##WRAP OPTIMIZER WITH LookAhead layer 
    ##Early stopping
    es = tensorflow.keras.callbacks.EarlyStopping(monitor='val_loss',restore_best_weights=True, patience=200)
    
    model.compile(optimizer=opt, loss=rmse)
    
    # train model
    history = model.fit(X_train, y_train, verbose=0, batch_size=40, validation_split=0.2, epochs=1000, callbacks=[es])
    
    # list all data in history
    print(history.history.keys())
    
    # summarize history for loss
    plt.ylim([0,0.2])
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'test'], loc='upper left')
    plt.show()
    
    N_cols= 193
    WARNING:tensorflow:From /home/mokhles/.local/lib/python3.8/site-packages/tensorflow/python/ops/resource_variable_ops.py:1813: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
    Instructions for updating:
    If using Keras pass *_constraint arguments to layers.
    dict_keys(['loss', 'val_loss'])
    

    png

    y_pred = model.predict(X_test)
    rmsle(y_pred, y_test)
    
    0.13674783660080644
    

    Conclusion

    For model choice we finally choose the DNN as it performs better on our Test dataset.

    We conclude that a good DNN can beat a fine tuned MetaModel on this task.

    DNN can deal better with categorical data with low cardinality than any other simple model.

    In contrast to LGBM it can map complex interactions between features.

    Mokhles Bouzaien
    Mokhles Bouzaien
    Master of Science in Engineering Student

    A self-motivated graduate student in Engineering at IMT Atlantique.

    Next
    Previous