# Augmenting categorical datasets with synthetic data for machine learning.

**Gaussian Mixture Models for label validation in augmented, synthetic categorical data**

Consider a hypothetical but common scenario. You need to build a classifier to assign a sample to a population group. You have a sizable training dataset of one million samples. It has been cleaned, prepared and labeled. The few continuous variables are already normalized, and categorical variables, representing the majority of features, are rolled out using a one-shot encoding scheme. Now building a simple neural network classifier, or a random forrest, or another preferred method from your quiver should be a breeze. But since you are a thorough data scientist, you do a quick descriptive analysis on the dataset and find a problem. Unbalanced base rates.

The population maps to ten classes but two of those represent 98% of the data. The remaining eight classes cover 2 meager percent, hardly enough for a comfortable convergence of your statistical model. What can you do?

One approach is to augment the data and synthesize new samples for the under-represented classes with an aim to balance the base rate. The approach is quite simple and commonly used in image classification. Each image is cropped, rotated, shifted multiple times to increase the number of available samples for the classifier to learn. There is a soft guarantee that minor perturbations to the image will not change the classification label. So the technique works very well. Why not use a similar approach for our dataset of continuous and categorical variables? Can’t we randomly, but minutely perturb the dataset and expand our under-represented classes?

We can not. In fact, we can run into a serious trouble. While augmenting images typically preserves the relationship between pixels and thus the label assignment, perturbing a dataset with the mostly binary categorical features can shift the perturbed sample into an entirely different class. For an example, think of a dataset of runners and swimmers, then consider randomly flipping a gender feature between 0 or 1. This perturbation may move a female swimmer into a male swimmer category during augmentation but will not change the label. And, in the end, a pair of useless swimming speedos may be recommended to someone looking for a one-piece suit.

I will describe an approach that addressed the label shift and worked for our projects in the past. The primary thrust of the idea is

Select all samples from the under-represented classes

Copy each class in a subset N times to achieve a reasonable representation

Randomly perturb features in each subset using the distribution’s mean and standard deviation as perturbation bounds, and normalize to 0 and 1 for categorical values

Validate that each new sample belongs to a proper class, or reassign to a new class using Gaussian Mixture Models and Mahalanobis Distance.

The practical mechanics of the implementation slightly diverges from the meta algorithm above, but retains the theoretical structure. We will discuss it next.

## Gaussian Mixture Models for a sample analysis.

For intuition, let’s think of Gaussian Mixture Models (GMMs) as a generalization of K-Means clustering, since most people worked with K-Means at some point. One difference is that GMMs take into account an ellipsoid shape of the multi-dimensional distribution and allow for a multi-component assignment of each sample, whereas K-Means works on a spherical distribution assumption and a single component assignment. I will not be discussing the theory behind GMMs, but will use their benefits to our cause.

Mahalanobis Distance (MD) is a multi-dimensional generalization of how many standard deviations away a sample is from the mean of a distribution. The measure is unit-less and scale invariant, and it respects the elliptical shape of the data, increasing as samples move away from the mean along each principle component axis. Here I would like to insert some LaTex code to illustrate the function but Medium does not support it :-(.

As a reminder, our goal is to synthesize new class samples by perturbing existing samples, in order to balance the base rate for a successful classifier training. Our problem is the high number of categorical features that make the perturbations unstable — a small change to categorical features may precipitate a large change to a label — a new sample may become a member of a different class. Thus, we must validate each new sample, and reassign a new label if the perturbation shifts the sample to a new class.

Here is how we do it for our N class example with N = 10. I will assume you had some experience with GMMs and Pandas. Note for the Python police: the code is written for clarity, not efficiency — yes, there are redundancies, and there is indeed a better way to do things.

1) Before any augmentations let’s train 10 separate single component Gaussian Mixture Models. We extract the subset of all samples from each class then successively train a single GMM per class and store μ and σ values in a dictionary. We will use scikit implementation of GMMs.

2) Let’s calculate the average Mahalanobis Distance (MD) for each distribution, we will use it later for membership checks.

3) Let’s use Mahalanobis Distance (MD) as a sanity check to validate that a subset of our known samples maps to its corresponding class represented by the newly trained GMMs. The idea here is to calculate Mahalanobis Distance from each sample to all GMMs. The minimum distance gives us the class to which we assign the sample. We use the correct and incorrect assignments to calculate accuracy. *Note: if the accuracy of the known sample mapping is low, then your classes are not easily separable and the approach will not work. We aim for a 90% accuracy as the lower bound, but the accuracy target should depend on the specifics of your problem.*

4) Extract all underrepresented classes into a separate subset, and capture distribution statistics per class. We will use means and standard deviation values as parameters to randomly generate new samples for each class using Gaussian priors. This is equivalent to perturbing an existing sample set. It is a practical deviation from the meta algorithm I presented earlier but accomplishes the same goal.

5) Iterate through new samples and measure MD to each GMM. The minimum MD will indicate the class assignment. If the new sample is the closest to the correct GMM then retain the class label. If the new sample is the closest to a different GMM, then first check that the sample’s MD is within the range of the average we calculated for that GMM in step 2. If yes, then re-assign the sample class; if not, discard the sample.

This process augments continuous and categorical data with a reliable label preservation. Conceptually, the process is similar to the image augmentation which is widespread in machine vision. However, many datasets are not amenable to this approach. You have to know your data and reasonably judge the probability of a successful augmentation.

This approach helped us to train classifiers with unbalanced datasets that generalized well to unseen data. I hope it comes useful in your work.