3 Methods for Feature Creation and Data Transformation in Data Science

3 Methods for Feature Creation and Data Transformation in Data Science

3 Methods for Feature Creation and Data Transformation in Data Science

By January 20, 2015 Data Science No Comments

3 Methods for Feature Creation and Data Transformation in Data Science

By January 20, 2015 Data Science No Comments

Building on Paul Yacci’s earlier post on the importance of feature selection in data science and data analysis, the creation of new features from your existing data set can play a large role in the performance of your model in data science. There are multiple methods of feature creation and data transformation. Often, finding the right transformation of your data can reveal relationships that would be difficult see otherwise, and may also make it easier for your model to separate classes. In the simple case of creating linear models for regression, this can take the form of squaring terms, taking their logarithm, or other functional transformations that bring the fundamental problem back into a linear relationship through the transformation. For image data in particular, there are a number of useful transformations in data analysis.

1) Spatial morphology. The shape and structure of objects plays a huge role in how humans and your models can classify objects. Are the objects of interest elliptical? How many segments do the animal have? Does it have a tail? How solid is the image (i.e. is it solid or filled)?

In the case of plankton data analysis, it can be useful to explore the ways that experts classify the images of plankton (see the plankton portal field guide and try hand classifying a few images to get a feel for the data). Also, don’t forget to look out for strange creatures like the siphonophores that may defy simple morphological classification.

2) Fourier and Wavelet Transformations. These transformations change the problem from the spatial domain into the frequency domain and, in the case of wavelet transformations, allow you to trade off between the two, revealing signatures that might not have been visible otherwise. Using either the power spectra or selective wavelet coefficients can allow you to capture some of the information about how much structure is present in the images (see the popular practical guide for wavelet analysis and this example for wavelets being applied to describe structure in overhead imagery).

3) Feature Space Descriptors. SIFT or (Scale invariant feature transformations) or more recent algorithms such as SURF (Speeded-Up Robust Features), RIFT, etc. are used to describe features around points of interest in images by creating vector representations of the differences of image intensity at multiple scales. The collections of these features can be clustered and quantized in a bag of visual words method to classify images. The feature descriptors alone can also be a good proxy for texture.

—Written by Aaron Sander