It is necessary to build a multiple classifier (5 classes) on a highly unbalanced sample.

> table(d$class) 0 0.3 0.5 0.7 1 12385 736 733 25 1869 

If you just run RandomForest , then nothing happens. So, we must somehow balance it. But I don’t know how to do this. All packages found by balancing assume only binary classification. Watched packages:

  • unbalanced
  • Rose

Read this article .

I thought maybe in the RandomForest there is an opportunity to set the cost or sampling - it was also not found. How can this problem be solved?

  • Tried to build individual models on the principle of "one against all"? You can first learn how to allocate, for example, the rarest class (the remaining 4 will be combined into one). - Ogurtsov
  • one
    The topic is covered in the book dmkpress.com/catalog/computer/programming/978-5-97060-273-7 , chapter 3. Original in English ( cambridge.org/gb/academic/subjects/computer-science/… ) you can google. - Ogurtsov
  • @Ogurtsov Is it fundamentally to single out the rarest class first? On the contrary, I would first catch the most popular class (“0”) - I would learn to select it, and then I would simply throw it away from the training set. That is, I would conduct follow-up training "not one against 4", but "one against 3." Or you can not do that at all? - Yury Arrow
  • And try, compare different approaches (without an example of data, no one will check anything for good reason). Tomorrow I will try to post an excerpt from the book I am referring to, this should add clarity. - Ogurtsov
  • I have this book. But it is not clear how to predict positive and negative cases with the help of a forest ... - Yury Arrow

2 answers 2

There are two functions in the caret package: upSample() and downSample() that solve this problem. Balancing classes is necessary regardless of the classification methods used.

Update

If you answer widely, then you need to specify that class balancing is one of the many important steps that you need to complete before you begin to train the model. I will simply list: selection and evaluation of input variables, separation into training and test samples (preferably stratified), class balancing (training set only), preprocessing (normalization, standardization, etc.), mixing the training set, etc. On the quality of these 80% of work depends on the quality of the result of the simulation.

If the answer is so widely required article of good volume. The need for class balancing is confirmed by numerous experiments (not just mine) with many models. Just compare the results of the classification of identical sets with and without balancing. You will see for yourself.

    This is the function that solves this problem.

     my.strata <- function(v) { tmp <- as.vector(table(v)); num_clases <- length(tmp); min_size <- tmp[order(tmp,decreasing=FALSE)[1]]; rep(min_size,num_clases); } randomForest(.... sampsize=my.strata( ) ... )