Comparing Open Source AutoML Tools

Auger vs. H2O vs. TPOT On Sample Datasets

We often get asked how Auger compares to other AutoML tools. Luckily in these days of open source tools this is possible to do in a way that can be validated and reproduced by other users.

First let’s describe the choice of datasets. It was important that they all be publicly available and commonly used. We also wanted a cross-section of regression and classification models, in a variety of different industries. On the Auger.AI website we actually use all of these datasets as “demo datasets” to train on before the user supplies their AWS credentials. We acknowledge that there may be some unconscious selection bias present here, as we have used each of these datasets for some time. That said, we can’t quite just test against, say, every OpenML dataset on a reasonable budget. We welcome seeing similar comparative tests from third parties (we would even donate you some free compute time if you feel so inspired). Also if you notice that some of the datasets we offer in demo mode are not here, it’s because they are time series datasets, which H2O and TPOT do not support. Note that H2O’s Driverless AI web-based service does support time series, but we are referring to the H2O open source offering which does not.

With that, here are the datasets we used for this evaluation:

  • CoverType, classify forest cover types from cartographic variables

  • Bank Marketing, classify if banking client will subscribe to a term deposit based on marketing data.

  • Thyroid Disease, classify if a patient is hypothyroid: normal (not hypothyroid),hyperfunction and subnormal functioning.

  • Credit Ratings, classify if a person is a good or bad credit risk

  • Mercedes-Benz, predict test bench time for a car based on its configuration

  • Bike Rentals, predict number of bike rentals for a given day

It is important that we give H2O and TPOT at least a level playing field. We allowed each tool one hour of compute time. This is an arbitrary cutoff of course. But its a threshold commonly used for testing.

We actually chose to give H2O and TPOT more compute time and horsepower than Auger. Auger results were evaluated with on two node clusters with two CPUs per node. So effectively Auger had four CPU hours per dataset. H20 and TPOT were evaluated on an eight core computer with a one hour limitation, for an effective eight CPU hours per dataset.

Below are the results for each dataset:

compare-open-source-automl-tools.png

Walking through the list: the result on the CoverType model is more than a percentage point higher (as measured in accuracy) than the next best result (with TPOT). It was also just a tiny bit higher than the best published metric.

On Bank Marketing H2O and Auger finished in a dead heat of 0.86 accuracy (H2O is just a tiny bit higher), and all tools beat the best published result from the contest. Similarly Thyroid Disease resulted in a three way dead heat for accuracy which all virtually tied the highest published result. For Credit Ratings (a popular kaggle contest) Auger got an accuracy virtually identical to the best published metric. Both the contest metric and Auger handily outperformed TPOT and H2O.

On the Mercedes-Benz data set Auger outperformed even first place on the contest. And outdid TPOT and H2O by quite a large margin (as measured in r²). On Bike Rentals the r² of Auger was 1 as was the r² of TPOT. H2O was very close to 1 as well.

It’s important to mention that because of the arbitrary one hour cutoff it could very well happen that with enough time, one of the other AutoML tools would pull ahead of Auger. In subsequent posts we will explore longer time horizon tests with additional tools. We welcome each of you trying these datasets with Auger, H2O and TPOT. Please let us know if you have suggestions for other datasets to try. Or other tools to evaluate.

Adam Blum