I received data at the input with a large number of gaps. The gaps cannot be discarded because there will be very little data. Data types: numeric and categorical. I work with Python and Pandas.

Tell me how to fill in the gaps in the data? What fill strategies are there? When should you use which one?

My final goal is to evaluate the correlation of variables (numeric, categorical) with the target binary variable.

    2 answers 2

    Not so many meaningful options for filling in the missing data exist - random filling, root-mean-square (median) filling, data recovery according to the distribution law, an attempt to build a predicative model (in various modifications) filling.

    The trouble is that no matter what method you use, the accuracy of the solution to your main task will decrease significantly and most importantly, it is not always possible to estimate by how much. With a really large amount of missing data, it may be more rational to build a less accurate model from the data remaining after deleting the missing values ​​than theoretically more accurate, but with “recovered” values.

    Here are some links where options are described, some with Python codes. It can be useful.

    https://gallery.azure.ai/Experiment/Methods-for-handling-missing-values-1 https://towardsdatascience.com/the-art-of-cleaning-your-data-b713dbd49726 https: // towardsdatascience. com / the-tale-of-missing-values-in-python-c96beb0e8a9d https://towardsdatascience.com/working-with-missing-data-in-machine-learning-9c0a430df4ce https://www.analyticsvidhya.com/ blog / 2016/01 / guide-data-exploration /

    • Good day! Thank you very much for your answer! Tell me, please, could you add to your answer the most important of the publications on the links? Links tend to fade out. - Nicolas Chabanovsky ♦
    • one
      Most importantly, I — I think so — have already pointed out: first, a list of basic, fundamental approaches, and second, reasoning about the accuracy that decreases when restoring missing values. Everything else is specific details of specific methods. You can, of course, do the translation of these articles, but you do not want to "plagiarize." You can also write your own - but a quality article on this topic will definitely not be in a “formatted” format and definitely not “right now." and cleaning. - passant
    • Thanks for the clarifications! I agree, the topic is quite voluminous. - Nicolas Chabanovsky ♦

    General remark

    When choosing an approach to filling in empty values, it is necessary to consider what will be done with the data after filling. If you plan to read the correlation, be sure to look at the basis of what idea the correlation is considered.

    Specifically in this matter

    For numeric types, it is probably worth filling the gaps with the average, because then the correlation between the binary and real variables will be considered, and this is done through the difference between the mat. expectations ( E[X1|X2=1] - E[X1|X2=0] ), that is, it is important to keep the mat. waiting unchanged.

    For categorical traits, the correlation with a binary trait can be calculated using Cramer’s V coefficient:

     chi2 = stats.chi2_contingency(confusion_matrix)[0] n = confusion_matrix.sum() return np.sqrt(chi2 / (n*(min(confusion_matrix.shape)-1))) 

    That is, the input is the table of contingency. If there is enough data, then I would exclude all omissions. The second approach is to introduce a new category for each feature, for example, "no_value" , but in this case it will also appear in the contingency table.

    • 1. If we are talking about the problem of finding the correlation of two variables - real and binary, then for real variables the expectation can be preserved both through filling with averages and due to deleting lines. By complexity, it is easier to throw out lines than to count the expectation. For multidimensional problems, filling with averages deliberately worsens the result. - passant
    • 2. For correlation between binary and categorical features, the rank-bisterial correlation coefficient is usually used. And when it is used, the use of the filling medium becomes unequivocally incorrect. - passant
    • 3. The Kramer coefficient, being a modification of the Pearson coupling coefficient, as well as the Chuprov criterion and the Romanowski criterion are used to detect the correlation of signs measured on nominal scales. For other cases, other, more statistically powerful criteria are used. - passant
    • @passant Thank you! Please add comments in reply. The information is very useful! - Nicolas Chabanovsky ♦