Data Preprocessing for Machine Learning in Auger

Generating a usable dataset for prediction and classification problems is usually the most time-consuming part of large data science problems. Most machine learning algorithms work only with well-structured data, generally tables with only numerical values. But most real-world data contains many exceptions to this requirement:

  • missing values (e.g. which were not observed for some reason)

  • categorical data represented by strings (e.g. day of the week)

  • categories represented by numbers (which could be interpreted as numeric values)

  • dates and times

Although, specific algorithms can handle some of these data issues out of the box (e.g. catboost), Auger provides a unified framework for all machine learning algorithms. This allows us to jointly optimize several ML algorithms and allow the the analyst to choose the best models from a leaderboard.

Feature extraction is a crucial part of most data science problems. Auger uses core preprocessors as well as several custom feature extractors for advanced users. Moreover, we tried to automate as many steps as possible. This makes it possible to build a full pipeline without any human intervention.

Some AutoML software (such TPOT) try to optimize data preprocessing together with model’s hyperparameters. However, this might take enormous amount of time to get the right sequence of preprocessing steps. Instead, Auger uses best practices from data science for processing the data and spend most of time for tuning ML models, rather than massaging the data.

Preprocessing Work

Auger provides the following core preprocessors applied to all datasets.

Missing values

If there are less than 5% of samples with missing values it is suggested to drop them. If some feature contains more than 95% missing values the feature should be eliminted. However, if a categorical feature contains missing values, they can be used as is, i.e. as a separate category

Otherwise missing values should be substituted with mean value of the feature (Auger uses Imputer).There are more advanced techniques (e.g. regression on missing values), but they usually lead to overfitting. We do not supported these in Auger today.

Categorical features

Auger uses OneHotEncoder to generate categorical features. OneHotEncoder can produce too many features. In order to tackle this issue one can either merge rare categories into one category (e.g. encode only top fifty categories and merge others into a single category). We also apply Principal Component Analysis (PCA) over the encoded data.

Sparse features

A feature is called sparse if it contains the same value (usually zero) for most observations. Auger reduces the number of features in datasets with many sparse features using dimensionality reduction (e.g. PCA). The number of components can be tuned as a part of pipeline. As an additional benefit, PCA automatically eliminates highly correlated features.

Date and time features

One can extract the following features from the date/time column:

  • Absolute time (numerical)

  • Day of year (numerical, cyclic)

  • Day of week (categorical)

  • Month day (numerical, cyclic)

  • Month of year (categorical)

  • Hour of day (categorical)

  • Minute of hour (numerical, cyclic

There is no need to manually choose which features to extract. Auger’s feature elimination step will automatically remove redundant features.

Feature scaling

For k-Nearest Neighbor, support vector machine, logistic regression (and any other algorithm which uses a metric) it’s important to scale features before training. This preprocessor scales features to [0, 1] range (using MinMaxScaler). However, decision tree based methods (such as random forest) are not affected by scaling and hence this technique is not applied for such execution.

Feature elimination

Auger removes features with variance less than 5% (Auger uses VarianceThreshold). It also removes features that are highly correlated with absolute correlation more than 95%

Auxiliary preprocessors

Auger provides optional preprocessors that the user can decide to optionally enable on their dataset.

Cyclic features

A feature is called cyclic if its largest possible value is close to the smallest one.

For example the 31st month day is very close to the 1st day of the next month.

Each cyclic feature is replaced with two following features:

  • sin(2pi * scale(X))

  • cos(2pi * scale(X))

Interaction between features

Some features interact in multiplicative ways, For example house prices depend on area=width*height, rather than just width or height alone. If the Auger user chooses interaction between features X, Y, Z their combinations X*Y, Y*Z, X*Z will be also added to the dataset.

Summary

Auger provides a powerful and complementary suite of feature creation and data preprocessing and capabilities that will move your data science problem to getting results from Auger’s suite of machine learning algorithms much faster. Let us know what kinds of feature creation and modification problems you are having with your data and we will work with you to tackle your unique problems.

Adam Blum