Datakick Analytics

Darryl Buswell August 28th, 2021

So you've exhausted all of your sources of data, but your model performance is still lagging. You could of course start tweaking parameters, but why not build a solid feature engineering pipeline?

Data scientists are always looking for ways to improve model performance. Of course, getting your hands on more data, trying different model types, and tweaking model parameters are all good options to get that better model fit. But what about feature engineering? And better yet, what about building a solid feature engineering pipeline? Just like most things with data science, there is a right way and a wrong way to approach feature engineering. So let's walk through an example and call out some common gotchas.

What is this all about?

So, what is feature engineering and what are pipelines? Well, when a data scientist approaches a problem where they need to predict something, like property prices, they will often use a technique called supervised machine learning. There are a whole set of model types which fall under the supervised ML umbrella, but they all have one thing in common. They take data as an input, which we call features, and produce an output.

Now, when data scientists approach these problems, they will try a whole range of things to try and get the best possible model performance. Obtaining additional data is an obvious first step, but they may also try a set of different ML model types, or try varying the parameters of those models. Beyond these measures however, there is another commonly used technique for model development called feature engineering. This technique involves taking raw input data, and manipulating it in order to produce abstracts of that data. These abstracts can be fed into the model, even along with the raw input data, in order to help improve model performance.

It may seem like an odd concept, as we aren't obtaining additional data. We are simply creating additional features from the data we already have. But in many cases, these additional features provide more flexibility for the model to provide a better fit and therefore better performance. This does raise some potential issues however. The most common, comes from an incorrect application of feature engineering which has the potential to cause data leakage. Without overcomplicating things, this problem occurs when the data scientist contaminates the data they intend to use to test model performance, with data they use to train the model. It's a big no-no in the data science domain, and can lead to models which provide little to no real-world value.

So, to avoid the issue of data leakage, many seasoned data scientists will instead build feature engineering pipelines. These pipelines include all of the intended feature engineering steps, but the steps are packaged in way that they can be easily applied over the data used to train the model, and then separately applied over the data used to test model performance. So, let's walk through how to build some solid feature engineering pipelines to improve the fit of a neural network model.

Framing up the problem

We are going to use a simple pre-canned set of data, known as the Boston house price dataset. This dataset covers a series of measures related to house prices within the Boston area, and has been used extensively throughout literature to benchmark data science algorithms. There are measures of the number of rooms, crime rate, property tax rate, and much more. And we will be using these features to predict the median value of properties.

The first thing we need to do is import the data, separate our features from the prices we are looking to predict, and then make a split of the data we are going to use to train the model versus test the model performance.


import pandas as pd

df_raw = pd.read_csv('DATA/boston.txt')

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MDEV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2


from sklearn.model_selection import train_test_split

X_all = df_raw.drop('MDEV', axis=1)
y_true_all = df_raw[['MDEV']].values.ravel()

X_train, X_test, y_true_train, y_true_test = train_test_split(X_all, y_true_all, test_size=0.3)

Fit a neural network regressor

To kick things off. Let's fit a neural network regressor and check the model performance without any feature engineering. This is just one of many machine learning model types, but neural networks tend to respond well to feature engineering. So this will form a great baseline for us.


from sklearn.neural_network import MLPRegressor

clf = MLPRegressor()

clf.fit(X_train, y_true_train)

y_pred_train = clf.predict(X_train.values)
y_pred_test = clf.predict(X_test.values)

This gives us a r^2 score of 0.4862 over the training set of data, and 0.501 over the test set of data. Not a great start. But definitely gives us something to work with.

Add standard feature scaling

Let's add some standard feature scaling as our first engineering step. This scaling will help center and rescale each of our features based on their mean and variance. But most importantly, because we are building our scaling step into a pipeline, we won't be introducing any data leakage. That's because we will be calculating the mean and variance of our training set of data, separate to our test set of data.


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('mlp', MLPRegressor())
])

pipe.fit(X_train, y_true_train)

y_pred_train = pipe.predict(X_train.values)
y_pred_test = pipe.predict(X_test.values)

This gives us a r^2 score of 0.7027 over the training set of data, and 0.5214 over the test set of data. So a pretty modest improvement, but an improvement nonetheless.

Add PCA components

Next up, we're going to add an extra pre-processing step and include a set of principal components. This type of technique involves stripping the raw data down to a set of components which are able to best explain the variance of the data. There is a fair bit to this technique, so if you want to learn more, have a read here. In short, PCA is a great way to 'summarize' the data into what is most important, and collapse the features which don't have any real additional explanatory power.

Now in our case, we're going to include our PCA components alongside our standardized features. So we are merging the two datasets to create a deeper set of features. When dealing with this amount of data in particular, there is no disadvantage to simply include the full set of data. And fortunately, sklearn has a handy feature union method which makes it simple to merge datasets within the same pipeline.


from sklearn.pipeline import FeatureUnion
from sklearn.decomposition import PCA

pipe = Pipeline(steps=[
    ('feat', FeatureUnion(transformer_list=[
        ('scaler', StandardScaler()),
        ('pca', PCA()),
    ])),
    ('mlpr', MLPRegressor())
])

pipe.fit(X_train, y_true_train)

y_pred_train = pipe.predict(X_train.values)
y_pred_test = pipe.predict(X_test.values)

This gives us a r^2 score of 0.7728 over the training set of data, and 0.6918 over the test set of data. A good improvement over our previous pipeline which included only the scaled features.

Add encoded clusters

Last up, we are going to add one additional feature engineering step in the form of encoded clusters. This technique involves separating our features into a set of distinct groups based on the similarity of the data points. And, once separated into distinct groups, we will look to create a zero or one label to indicate a true or false as to whether that data record exists within the particular cluster. So there are two steps involved here, one is to calculate the clusters, and the other to assign our data records to those clusters.

Unfortunately, sklearn isn't set up to handle these two steps without some modifications. However, the below code can get us there by creating a wrapper around the sklearn predict function.


import sklearn.base
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.cluster import KMeans

class transform_predict(sklearn.base.BaseEstimator, sklearn.base.TransformerMixin):

    def __init__(self, clf: sklearn.base.BaseEstimator):
        self.clf = clf

    def fit(self, *args, **kwargs):
        self.clf.fit(*args, **kwargs)

        return self

    def transform(self, X: np.ndarray, **transform_params):
        pred = self.clf.predict(X)

        return pred.reshape(-1, 1) if len(pred.shape) == 1 else pred

pipe = Pipeline(steps=[
    ('feat', FeatureUnion(transformer_list=[
        ('onehot', Pipeline(steps=[
            ('kmeans', transform_predict(KMeans(n_clusters=6))),
            ('onehot', OneHotEncoder(categories='auto'))
        ])),
        ('scaler', StandardScaler()),
        ('pca', PCA())
    ])),
    ('mlpr', MLPRegressor())
])

pipe.fit(X_train, y_true_train)

y_pred_train = pipe.predict(X_train.values)
y_pred_test = pipe.predict(X_test.values)

This gives us a r^2 score of 0.8512 over the training set of data, and 0.7428 over the test set of data. Another good step in performance, and definitely a big jump from our original performance measures of 0.4862 and 0.501.

So there you have it. We have managed to take a good step forward in model performance using nothing other than feature engineering via pipelines. This really is only scratching the surface however. There's an entire range of preprocessing techniques available as part of sklearn's library, and the pipeline wrapper makes experimentation straightforward and painless.

If you have any questions on the content, or prediction applications which you would like help with. Feel free to reach out to our Datakick team.