How to effectively create a contingency table for categorical trait?

Question

I receive data with 40 categorical signs at the entrance. There are null values in the data. The number of categories of each feature is not known. Categories are string. The task: to calculate the correlation with the target binary variable using Kramer’s coefficient V, which accepts a contingency table as input. I consider it as follows:

# Подсчитанные значения корреляции признаков categorical_corrs = list() for column in data.columns: # Для каждого признака получаю список уникальных значений, # за вычетом пропущенных ячеек categories = data[column].dropna().unique() confusion_matrix = [[], []] for category in categories: # Для каждой категории считаем количество реализаций для значений 0 и 1 confusion_matrix[0].append( len(data.loc[(labels[0] == 0) & (data[column] == category), column]) ) confusion_matrix[1].append( len(data.loc[(labels[0] == 1) & (data[column] == category), column]) ) result = cramers_stat(np.array(confusion_matrix)) # Проверка на исключительные случаи if result == -1: print column, categories, confusion_matrix categorical_corrs.append(result)

Each feature has 40,000 entries (including omissions). The execution of the code above takes quite a long time. Tell me, is it possible to calculate the contingency table more efficiently?

PS Data can be downloaded from here ("small" dataset)

but there are only two values -1 and 1 , and in your code - 0 and 1 ?

Accepted Answer · 2019-01-02T10:44:21

Try using the function to calculate the Kramer’s ratio V from this answer :

 import scipy.stats as ss def cramers_corrected_stat(confusion_matrix): """ calculate Cramers V statistic for categorial-categorial association. uses correction from Bergsma and Wicher, Journal of the Korean Statistical Society 42 (2013): 323-328 """ chi2 = ss.chi2_contingency(confusion_matrix)[0] n = confusion_matrix.sum().sum() phi2 = chi2/n r,k = confusion_matrix.shape phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1)) rcorr = r - ((r-1)**2)/(n-1) kcorr = k - ((k-1)**2)/(n-1) return np.sqrt(phi2corr / min( (kcorr-1), (rcorr-1)))

to calculate confusion_matrix , you can use the pd.crosstab () function

Example:

 try: from pathlib import Path except ImportError: from pathlib2 import Path WORK_DIR = Path(r'D:\data\927487') train = pd.read_csv(WORK_DIR / 'orange_small_train.data', sep='\t') labels = pd.read_csv(WORK_DIR / 'orange_small_train_appetency.labels', header=None, squeeze=True, dtype='int8')

 In [51]: confusion_mx = pd.crosstab(labels, train['Var1']) In [52]: confusion_mx Out[52]: Var1 0.0 8.0 16.0 24.0 32.0 40.0 48.0 56.0 64.0 72.0 80.0 120.0 128.0 152.0 360.0 392.0 536.0 680.0 0 -1 371 134 80 46 21 9 6 5 1 3 1 0 2 1 1 1 1 1 1 9 4 1 0 2 1 0 0 0 0 0 1 0 0 0 0 0 0 In [53]: cramers_corrected_stat(confusion_mx) Out[53]: 0.20395161570145692

PS Verification of categorical data correlation is a very difficult process and often requires a good understanding of domain / business data.

Answer 2 · 2019-01-01T18:58:31

In the Python library Pandas there is a crosstab function. It builds the contingency tables you need. Try, maybe (and most likely, because the implementation is direct, in C ++), its implementation is faster than yours.

How to effectively create a contingency table for categorical trait?

2 answers 2

More articles: