I receive data with 40 categorical signs at the entrance. There are null values ​​in the data. The number of categories of each feature is not known. Categories are string. The task: to calculate the correlation with the target binary variable using Kramer’s coefficient V, which accepts a contingency table as input. I consider it as follows:

# ΠŸΠΎΠ΄ΡΡ‡ΠΈΡ‚Π°Π½Π½Ρ‹Π΅ значСния коррСляции ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠΎΠ² categorical_corrs = list() for column in data.columns: # Для ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ ΠΏΡ€ΠΈΠ·Π½Π°ΠΊΠ° ΠΏΠΎΠ»ΡƒΡ‡Π°ΡŽ список ΡƒΠ½ΠΈΠΊΠ°Π»ΡŒΠ½Ρ‹Ρ… Π·Π½Π°Ρ‡Π΅Π½ΠΈΠΉ, # Π·Π° Π²Ρ‹Ρ‡Π΅Ρ‚ΠΎΠΌ ΠΏΡ€ΠΎΠΏΡƒΡ‰Π΅Π½Π½Ρ‹Ρ… ячССк categories = data[column].dropna().unique() confusion_matrix = [[], []] for category in categories: # Для ΠΊΠ°ΠΆΠ΄ΠΎΠΉ ΠΊΠ°Ρ‚Π΅Π³ΠΎΡ€ΠΈΠΈ считаСм количСство Ρ€Π΅Π°Π»ΠΈΠ·Π°Ρ†ΠΈΠΉ для Π·Π½Π°Ρ‡Π΅Π½ΠΈΠΉ 0 ΠΈ 1 confusion_matrix[0].append( len(data.loc[(labels[0] == 0) & (data[column] == category), column]) ) confusion_matrix[1].append( len(data.loc[(labels[0] == 1) & (data[column] == category), column]) ) result = cramers_stat(np.array(confusion_matrix)) # ΠŸΡ€ΠΎΠ²Π΅Ρ€ΠΊΠ° Π½Π° ΠΈΡΠΊΠ»ΡŽΡ‡ΠΈΡ‚Π΅Π»ΡŒΠ½Ρ‹Π΅ случаи if result == -1: print column, categories, confusion_matrix categorical_corrs.append(result) 

Each feature has 40,000 entries (including omissions). The execution of the code above takes quite a long time. Tell me, is it possible to calculate the contingency table more efficiently?

PS Data can be downloaded from here ("small" dataset)

  • Can you give a reproducible sample data? - MaxU 7:47 pm
  • @MaxU I work with a small data set KDD Cup 2009 . Or do you need some subsample of several elements? - Nicolas Chabanovsky ♦
  • What is labels[0] equal to? - MaxU
  • @MaxU I downloaded them from here . - Nicolas Chabanovsky ♦
  • but there are only two values -1 and 1 , and in your code - 0 and 1 ? - MaxU

2 answers 2

Try using the function to calculate the Kramer’s ratio V from this answer :

 import scipy.stats as ss def cramers_corrected_stat(confusion_matrix): """ calculate Cramers V statistic for categorial-categorial association. uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328 """ chi2 = ss.chi2_contingency(confusion_matrix)[0] n = confusion_matrix.sum().sum() phi2 = chi2/n r,k = confusion_matrix.shape phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1)) rcorr = r - ((r-1)**2)/(n-1) kcorr = k - ((k-1)**2)/(n-1) return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1))) 

to calculate confusion_matrix , you can use the pd.crosstab () function

Example:

 try: from pathlib import Path except ImportError: from pathlib2 import Path WORK_DIR = Path(r'D:\data\927487') train = pd.read_csv(WORK_DIR / 'orange_small_train.data', sep='\t') labels = pd.read_csv(WORK_DIR / 'orange_small_train_appetency.labels', header=None, squeeze=True, dtype='int8') 

 In [51]: confusion_mx = pd.crosstab(labels, train['Var1']) In [52]: confusion_mx Out[52]: Var1 0.0 8.0 16.0 24.0 32.0 40.0 48.0 56.0 64.0 72.0 80.0 120.0 128.0 152.0 360.0 392.0 536.0 680.0 0 -1 371 134 80 46 21 9 6 5 1 3 1 0 2 1 1 1 1 1 1 9 4 1 0 2 1 0 0 0 0 0 1 0 0 0 0 0 0 In [53]: cramers_corrected_stat(confusion_mx) Out[53]: 0.20395161570145692 

PS Verification of categorical data correlation is a very difficult process and often requires a good understanding of domain / business data.

    In the Python library Pandas there is a crosstab function. It builds the contingency tables you need. Try, maybe (and most likely, because the implementation is direct, in C ++), its implementation is faster than yours.