Good afternoon. I have the following problem, there is a classification task. Trayne 50,000 lines, Y 60 labels. But the data is unbalanced (in one class, 35,000 values, in the other 59 classes, 15,000 values, of which in some 30 values). If the example, that is, X (column_1, column_2, column_3) and Y:

colum_1 colum_2 colum_2 Y 0.5 1 2 1 0.5 1.1 2 1 0.55 0.95 3 1 0.1 1 2 2 2 0.9 3 3 

And you need to add "noisy" data so that there is no imbalance, conditionally, so that all values ​​become the same:

 colum_1 colum_2 colum_2 Y 0.5 1 2 1 0.5 1.1 2 1 0.55 0.95 3 1 0.1 1 2 2 0.15 0.99 2 2 0.05 1.01 2 2 2 0.9 3 3 1.95 0.95 3 3 2.05 0.85 3 3 

Only this is a toy example, and I have many meanings. Thank.

  • What kind of data do you have - pictures, digitized / vectorized text, something else? What module / package is used for classification? - MaxU 2:21 pm
  • Sample data is the same as I gave. In reality, there are simply more X columns (about 400) and more Y labels (60). Further classification is done in the SVM. - Rudolf Morkovskyi
  • Without understanding the nature of the data, it is difficult to give good advice. For example, for pictures there are special functions that do the "data augmentation" . Regarding the "noise" data - what are the limits of noise? why colum_2 does not change? SVM is the name of the algorithm, not the module / library ... Give the corresponding part of the code ( How to create a minimal, self-sufficient and reproducible example ) - MaxU

1 answer 1

If you really have a classification task, then why do you need to balance the volume of classes? Is this a task in itself, or is it necessary because the SVM methods did not work for you?

On the other hand, there are methods (the truth is “two-class”, but nonetheless) classifications that can carry out a classification even if you do not have any examples of the second class. Google on the phrase "One-class classification"

Finally, if you do decide on balancing (and this is very risky, since you, in your case, essentially change the distribution law of the original sample), look towards resampling and bootstrap methods.

For seed I can offer:

1 V.K. Shitikov, G.S. Rosenberg. Randomization and bootstrap: statistical analysis in biology and ecology using R

2. S.Anatolyev Basics of bootstrapping.

3. https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/