Let’s discuss the Advanced Techniques for Handling Skewed Datasets in Machine Learning


By Anurag Sinha, Co-Founder & Managing Director, Wissen Technology
While data fuels machine learning, not all data is equal. Skewed data can cause
significant havoc on machine learning models as most parametric machine learning models like Linear Regression or LDA assume that data is normally distributed. The models fail to provide accurate predictions when this assumption fails.

Anurag Sinha

What Is Skewed Data?
In machine learning, the graph of a data set created with normal distribution is symmetrical and follows a bell-shaped curve. In a normal distribution, the probability distribution is such that the skewness stands at 0 and can have a standard deviation of
The mean, median, and mode are equal.

Skewness provides insights into the distribution of data. The data is positively skewed when the tail on the right side of the distribution is fatter or longer. Negatively skewed data has a fatter or longer tail on the left side of the distribution. When the skewness ranges between -0.5 to +0.5, then the data can be considered fairly symmetrical.

The Impact of Skewed Data
Skewed data is the data that creates an asymmetrical or skewed curve on a graph. While a little bit of data skewness can be managed, too much skewness in the data can impact statistical and machine-learning models.

Too much skewness degrades the model’s ability to describe typical cases because the model has to deal with rare cases on extreme values.

In the skewed data, the tail region acts as an outlier for the statistical model and can impact the model’s performance. This is especially true for regression-based models. However, statistical models such as tree-based models can handle such outliers. To ensure that machine learning outcomes are accurate and that the model capabilities remain unaffected, this skewed data has to be converted to approximate a normal distribution.

Skewed data can be transformed to a normal distribution using the following techniques:
● Log Transformation
Log transformation is a popular way to transform skewed data. This involves using the natural log of each of the data set’s values and is a convenient way to transform highly skewed variables into normalised datasets.
Log transformation reduces the variability of data, including those that have outlying observations.

● Square Root Transformation

Square root transformation has a moderate effect on distribution shape and can be used to reduce right skewness. It can also be applied to zero values and is commonly applied to counted data. It works especially well when the values are small and when the distribution of the residuals or error terms of a regression model is not normal.

Square root transformation reduces the heteroscedasticity of the residuals in linear regression. However, it is important to be careful when using this transformation on variables that have negative values, and this method can lead to missing values.

● Cube Root Transformation
Cube root Transformation is a fairly strong transformation that has a substantial
effect on the data distribution shape. This transformation is stronger than the
square root transformation as it can be used for reducing right skewness and also
works on zero and negative values.

● Box-Cox Transformation
This transformation works well to linearize data when it looks like an exponential
curve. This method uses exponents to transform a data set and move it to normal distribution.

The exponent, lambda, varies from -5 to 5 and is at the core of the box-cox
transformation. This method considers all values of λ along with the optimal value (the best approximation of a normal distribution curve) for the selected data. This transformation, however, works well only when all the data is positive and greater than 0. It is important to convert negative data to positive data by adding a constant value when using this transformation method.

● Removing Outliers
It is also possible to handle skewness by identifying outliers and removing them.
This makes it possible to return the values at extreme distances from the mean.
Once these outliers are identified, it becomes easy to reduce the skewness.

● Power Transformer
When data is highly skewed and looks nearly impossible to correct, it needs
transformation on numeric features applied to linear models to make them normally distributed. However, when data is highly skewed, mean, median, minimum, and maximum values, which are the core metrics of the distribution, get affected. Power transformer employs logarithmic transform to convert skewed features into a normal distribution and allows the model to achieve higher accuracy.

Summing Up

Transforming the data from the skewed state to a normal state to enhance its quality, relevance, and usefulness is critical for machine learning models to capture patterns and make accurate predictions.  Identifying data types, distributions, and potential issues such as missing values or outliers is important to selecting the most appropriate transformation technique. It is also key to avoid over or under-transforming the data as it can introduce further noise to the data.

Enterprises also should implement data transformation as a part of the machine
learning pipeline for consistent and reproducible transformations. Taking these
measures ensures the performance and interpretability of machine learning models.


Please enter your comment!
Please enter your name here