Most of us love a good chocolate brownie. The existence of chocolate is not the only reason. We also love a good brownie because of its texture, convenient serving size, the presence of nuts (or other confections), and its “just right” density (otherwise, it may be too fluffy like a chocolate cake; or it may be too heavy like a dense fudge). We therefore love the chocolate brownie because it has several delightful, distinguishing, and delicious features.
Selecting good features in our data collection similarly delights us in many ways. The best features are explanatory and informative: they explain the characteristic behaviors of the objects represented within our data, and they inform our choice of models.
Good features are also empowering: they empower us to build the most accurate predictive models, to discover the most informative trends and patterns in our data, to maximize the interpretability of our models, and to choose the most descriptive parameters for data visualizations. Therefore, it is no surprise that feature mining is one aspect of data science that appeals to all data scientists. Feature mining includes: (1) feature generation (from combinations of existing attributes), (2) feature selection (for mining and knowledge discovery), and (3) feature extraction (for operational systems, decision support, and reuse in various analytics processes, dashboards, and pipelines).
Many machine learning algorithms require an input feature vector. This is the feature (parameter) vector of a data object, preferably chosen parsimoniously. That is, the feature vector should contain only those features that describe, characterize, (ideally) define, and uniquely identify each object. The set of feature vectors for all objects to be modeled becomes the fundamental input for the machine learning algorithm.
The feature vector usually contains only a subset of all possible parameter values from the database (e.g., the actual height, or width, or color of the brownie may be less important and thus can be ignored, compared to the brownie’s other characteristics). Ideally, for classification applications, the feature set should be the smallest subset of object parameters that are sufficient to predict class labels well, to avoid overfitting on the seen data and to achieve better model generalization on unseen data.
There are several objective means for selecting the optimal set of features: PCA (Principal Component Analysis), correlation analysis, Gini coefficient, mutual information, or information gain (used in Decision Tree calculations). There are also subjective means for selecting the feature set: good old-fashioned domain knowledge (i.e., the subject matter expert often knows which parameters are most useful, most meaningful, and most effective in classifying the objects that they know best).
So, the next time you start building models for a data science competition, fortify yourself with some good snacks (like chocolate brownies) and strengthen your models with some good feature selection.
—Written by Kirk Borne