The purpose of this site is to provide general information about the hot new field of automated machine learning (AutoML) and to provide links to our own PennAI accessible artificial intelligence system and Tree-Based Pipeline Optimization Tool (TPOT) algorithm and software for AutoML using Python and the scikit-learn machine learning library. We also provide links to some other commonly used AutoML methods and software.
The goal of AutoML is to make machine learning more accessible by automatically generating a data analysis pipeline that can include data pre-processing, feature selection, and feature engineering methods along with machine learning methods and parameter settings that are optimized for your data. Each of these steps can be time-consuming for the machine learning expert and can be debilitating for the novice. These methods enable data science using machine learning thus making this powerful technology more widely accessible for those hoping to make use of big data. A new book reviews several AutoML methods including TPOT.
Below is an example of a hypothetical machine learning pipeline that could be discovered using a method such as TPOT. Here, the data are analyzed using a random forest (RF) with feature selection performed using the importance scores. The selected features then undergo a polynomial transformation before being analyzed using k nearest neighbors (kNN). The predictions made by kNN are then treated as a new engineered feature and passed to a decision tree (DT). In parallel, the data are also engineered using principal components analysis (PCA). The principal components are then passed as new features to a support vector machine (SVM) whose output is passed as an engineered feature to the DT with the other engineered feature. The DT then makes a final prediction. Each of the methods in this pipeline are included in the scikit-learn library.