“Auto-What?” — A Taxonomy Of Automated Machine Learning
AutoML is one of the most robust areas of innovation in applied machine learning. New products in this space from the likes of Google and new AI-focused startups (such as our own Auger.AI) are appearing constantly, all of which promise to make machine learning accessible to the masses without the need for trained data scientists. At its base, AutoML involves some selection and configuration of machine learning algorithms. However, each product seems to have its own take of what parts of the machine learning process to automate and how they do it.
Based on questions we have seen, we at Auger.AI believe the industry could use a taxonomy of capabilities of AutoML tools. These capabilities include the following: choosing algorithms, setting hyperparameters, controlling model search and training time, cross-validation, data preprocessing, and feature creation. While Gartner has yet to offer a Magic Quadrant for AutoML, perhaps this overview can help inform a future effort as the automated machine learning sector matures.
For most users starting in machine learning, guidance on which algorithms to use from an AutoML tool is appropriate. It’s also valuable even for experienced data scientists trying to wring out a bit more predictive accuracy from their models. For regression problems, the tool should include basic algorithms such as linear and polynomial regression, k-nearest neighbor and random forest. Additionally, deep learning neural networks nicely complement these traditional machine learning approaches, so a good AutoML package should offer them as well. It is also important that all of these have appropriate hyperparameters exposed and settable by AutoML.
Many AutoML packages, such as AutoSKLearn, only search and utilize traditional algorithms, and still provide real value with that approach. Other AutoML offerings, including Google Cloud AutoML, are quite useful for image recognition applications by searching and configuring neural network architectures. Both of these restricted searches are still of course AutoML.
An emerging area for AutoML is ensemble generation, which automatically combines multiple algorithms that provide better results than each individual constituent algorithm alone. Some AutoML packages don’t offer automated ensemble generation, while a few are now combining “leaderboard” winners to achieve the highest accuracy possible.
The primary capability for any AutoML product should be the ability to choose a suitable algorithm with which to build a model. However, most appear to be more focused on hyperparameter optimization than anything else. For some prediction applications random forest has become preeminent and often no other algorithm is necessary. From that perspective the focus on hyperparameter optimization makes sense.
Hyperparameters are just the options for the algorithm that are set before training occurs. As an example, using a random forest algorithm for classification problems, the hyperparameters include: the number of trees in the forest, and the number of features in each tree. The number of features in each tree node would typically be the square root of the number of overall features. But there is nothing magic about that value and this is a parameter that is easily susceptible to optimization.
Virtually every AutoML solution available includes some level of hyperparameter optimization. There are hyperparameter optimization open source packages such as hyperopt and moe. There are even standalone hosted hyperparameter optimization services such as SigOpt and Google Cloud ML Engine’s Hyperparameter Optimization Service. For the dedicated data scientist amongst you, if you are skeptical of AutoML blackboxes, you owe it to yourself to at least try hyperparameter optimization. And yes probably before doing full AutoML to pick an algorithm. This of course assumes you are a data scientist with deep opinions on which algorithm your model should be based on.
Controlling Model Search and Training
Model training across a large search space is extremely expensive in time and money (when using a cloud computing service like AWS versus your own machine). All AutoML packages have some method of controlling overall algorithm/hyperparameter search time. This is in the form of a limit of number of trials or a limit on time. In addition, individual training executions can be limited to specific amounts of time which becomes more important datasets grow larger. It’s also important to allow data scientists to restrict the algorithm and hyperparameter search based on their own judgment. Hard-won instincts about which algorithms to search among can drastically reduce execution time.
Any attempt to modify hyperparameters should be accompanied by a cross-validation step. Instead of simply dividing the training data into a training set and a test set, data is divided into “k folds” (where k is typically 5 or 10). k-1 folds are used to train, and the last fold is used for testing. This is repeated where each fold is used once as a test set. The result is average. k-fold validation isn’t an absolute requirement, but given an attempt to tune hyperparameters, some form of more sophisticated cross-validation needs to be performed to avoid overfitting. Exhausted cross-validation methods (such as Leave One Out) would typically be too expensive to be performed during AutoML.
Even though most AutoML solutions insist on the data being available in some normalized form like a CSV or ARFF file, most models benefit from further preprocessing. This includes handling missing values, scaling feature values (typically to between 0 and 1 as floating point values) for algorithms that need it, handling cyclic features, and removing low variance or highly correlated features (features whose values correlate strongly with other features). Many products help in creating “flat CSVs” for features by either turning JSON data fields into separate columns, or joining a an external CSV into a main CSV based on a common key. Data preprocessing is particularly important in AutoML because the dataset may not have been optimized for a particular algorithm. This is an example set of data preprocessing features for Auger.AI.
Examples of feature creation include creating multiple features from date/time fields (including non-obvious ones like “holiday or not?”, “season” and “weekday or not?”). Another example is creating binary category features from a single category feature, as performed by tools like SciKit-Learn’s OneHotEncoder. As AutoML matures, it is likely that many of the products trying to be “full lifecycle automated machine learning” will do more feature creation capabilities, and the extent of techniques in this category will expand.
As automated machine learning products mature, we hope this will spark a discussion of which features define an AutoML tool so end-users can accurately compare products based on a common definition.