The problem with the correlation coefficient - pd.corr ()

Question

I'm trying to count the coefficient. correlations for each pair of columns in the datarama.

data = pd.read_csv('data.txt', sep=" ", index_col="Id") print(data) 505 506 507 \ Id 0 NaN NaN NaN 1 37.2 107.0 69.0 2 NaN 130.0 72.0 for i in range(0, 3): for j in range(0, 3): if(j > i): a = data[data.columns[i:i+1]] b = data[data.columns[j:j+1]] r = a.corr(b)

An error.

 ValueError Traceback (most recent call last) <ipython-input-31-d85ca44785ee> in <module>() 5 b = data[data.columns[j:j+1]] 6 ----> 7 r = a.corr(b) ~\Anaconda3\envs\ML\lib\site-packages\pandas\core\frame.py in corr(self, method, min_periods) 5487 mat = numeric_df.values 5488 -> 5489 if method == 'pearson': 5490 correl = libalgos.nancorr(_ensure_float64(mat), minp=min_periods) 5491 elif method == 'spearman': ~\Anaconda3\envs\ML\lib\site-packages\pandas\core\generic.py in __nonzero__(self) 1119 raise ValueError("The truth value of a {0} is ambiguous. " 1120 "Use a.empty, a.bool(), a.item(), a.any() or a.all()." -> 1121 .format(self.__class__.__name__)) 1122 1123 __bool__ = __nonzero__ ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I assume that on a = data[data.columns[i:i+1]] or r = a.corr(b)
a = df[df.columns[0:1]] b = df[df.columns[1:2]] r = a.corr(b)

Answer 1 · 2018-06-28T08:29:57

 In [62]: data Out[62]: 505 506 507 Id 0 NaN NaN NaN 1 37.2 107.0 69.0 2 NaN 130.0 72.0 In [63]: data.corr() Out[63]: 505 506 507 505 NaN NaN NaN 506 NaN 1.0 1.0 507 NaN 1.0 1.0

Here is an example with more believable data:

 In [85]: df = pd.DataFrame(np.random.rand(10,5), columns=list('abcde')) In [86]: df.iloc[::3, [1,4]] = np.nan In [87]: df.iloc[1::4, [2,3]] = np.nan In [88]: df Out[88]: abcde 0 0.292516 NaN 0.488364 0.235351 NaN 1 0.150342 0.497728 NaN NaN 0.498478 2 0.936061 0.533680 0.488616 0.069263 0.306257 3 0.728724 NaN 0.841414 0.026519 NaN 4 0.970898 0.531654 0.508176 0.890823 0.608585 5 0.748113 0.662562 NaN NaN 0.877368 6 0.900048 NaN 0.781662 0.799514 NaN 7 0.067932 0.074228 0.678235 0.476592 0.453969 8 0.426238 0.986512 0.865430 0.139393 0.352072 9 0.440932 NaN NaN NaN NaN In [89]: df.corr() Out[89]: abcde a 1.000000 0.353552 -0.130399 0.265960 0.213295 b 0.353552 1.000000 0.431961 -0.367241 0.004186 c -0.130399 0.431961 1.000000 -0.167799 -0.304030 d 0.265960 -0.367241 -0.167799 1.000000 0.997583 e 0.213295 0.004186 -0.304030 0.997583 1.000000

The problem with the correlation coefficient - pd.corr ()

1 answer 1

More articles: