Bring multiclass classification to balance in python

Question

Good afternoon. I have the following problem, there is a classification task. Trayne 50,000 lines, Y 60 labels. But the data is unbalanced (in one class, 35,000 values, in the other 59 classes, 15,000 values, of which in some 30 values). If the example, that is, X (column_1, column_2, column_3) and Y:

colum_1 colum_2 colum_2 Y 0.5 1 2 1 0.5 1.1 2 1 0.55 0.95 3 1 0.1 1 2 2 2 0.9 3 3

And you need to add "noisy" data so that there is no imbalance, conditionally, so that all values become the same:

 colum_1 colum_2 colum_2 Y 0.5 1 2 1 0.5 1.1 2 1 0.55 0.95 3 1 0.1 1 2 2 0.15 0.99 2 2 0.05 1.01 2 2 2 0.9 3 3 1.95 0.95 3 3 2.05 0.85 3 3

Only this is a toy example, and I have many meanings. Thank.

What kind of data do you have - pictures, digitized / vectorized text, something else?
In reality, there are simply more X columns (about 400) and more Y labels (60).
Without understanding the nature of the data, it is difficult to give good advice.
For example, for pictures there are special functions that do the "data augmentation" .
SVM is the name of the algorithm, not the module / library ... Give the corresponding part of the code ( How to create a minimal, self-sufficient and reproducible example )

passant passant 1,358 3 eight · Answer 1 · 2018-06-13T08:03:39

If you really have a classification task, then why do you need to balance the volume of classes? Is this a task in itself, or is it necessary because the SVM methods did not work for you?

On the other hand, there are methods (the truth is “two-class”, but nonetheless) classifications that can carry out a classification even if you do not have any examples of the second class. Google on the phrase "One-class classification"

Finally, if you do decide on balancing (and this is very risky, since you, in your case, essentially change the distribution law of the original sample), look towards resampling and bootstrap methods.

For seed I can offer:

1 V.K. Shitikov, G.S. Rosenberg. Randomization and bootstrap: statistical analysis in biology and ecology using R

2. S.Anatolyev Basics of bootstrapping.

3. https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/

Bring multiclass classification to balance in python

1 answer 1

More articles: