I receive data with 40 categorical signs at the entrance. There are null values ββin the data. The number of categories of each feature is not known. Categories are string. The task: to calculate the correlation with the target binary variable using Kramerβs coefficient V, which accepts a contingency table as input. I consider it as follows:
# ΠΠΎΠ΄ΡΡΠΈΡΠ°Π½Π½ΡΠ΅ Π·Π½Π°ΡΠ΅Π½ΠΈΡ ΠΊΠΎΡΡΠ΅Π»ΡΡΠΈΠΈ ΠΏΡΠΈΠ·Π½Π°ΠΊΠΎΠ² categorical_corrs = list() for column in data.columns: # ΠΠ»Ρ ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ ΠΏΡΠΈΠ·Π½Π°ΠΊΠ° ΠΏΠΎΠ»ΡΡΠ°Ρ ΡΠΏΠΈΡΠΎΠΊ ΡΠ½ΠΈΠΊΠ°Π»ΡΠ½ΡΡ
Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ, # Π·Π° Π²ΡΡΠ΅ΡΠΎΠΌ ΠΏΡΠΎΠΏΡΡΠ΅Π½Π½ΡΡ
ΡΡΠ΅Π΅ΠΊ categories = data[column].dropna().unique() confusion_matrix = [[], []] for category in categories: # ΠΠ»Ρ ΠΊΠ°ΠΆΠ΄ΠΎΠΉ ΠΊΠ°ΡΠ΅Π³ΠΎΡΠΈΠΈ ΡΡΠΈΡΠ°Π΅ΠΌ ΠΊΠΎΠ»ΠΈΡΠ΅ΡΡΠ²ΠΎ ΡΠ΅Π°Π»ΠΈΠ·Π°ΡΠΈΠΉ Π΄Π»Ρ Π·Π½Π°ΡΠ΅Π½ΠΈΠΉ 0 ΠΈ 1 confusion_matrix[0].append( len(data.loc[(labels[0] == 0) & (data[column] == category), column]) ) confusion_matrix[1].append( len(data.loc[(labels[0] == 1) & (data[column] == category), column]) ) result = cramers_stat(np.array(confusion_matrix)) # ΠΡΠΎΠ²Π΅ΡΠΊΠ° Π½Π° ΠΈΡΠΊΠ»ΡΡΠΈΡΠ΅Π»ΡΠ½ΡΠ΅ ΡΠ»ΡΡΠ°ΠΈ if result == -1: print column, categories, confusion_matrix categorical_corrs.append(result) Each feature has 40,000 entries (including omissions). The execution of the code above takes quite a long time. Tell me, is it possible to calculate the contingency table more efficiently?
PS Data can be downloaded from here ("small" dataset)
labels[0]equal to? - MaxU-1and1, and in your code -0and1? - MaxU