Given data of 374 rows x 31 columns. The first column is the date, the remaining columns are the stock prices of 30 companies. I need to apply the principal component method. For this, I wrote the following code:

import numpy as np import pandas as pd Location1 = r'C:\Users\...\close_prices.csv' df = pd.read_csv(Location1) from sklearn.decomposition import PCA X = df.drop('date', 1) pca = PCA(n_components=10) pca.fit(X) print(pca.explained_variance_ratio_) # первая компонента объясняет больше всего вариации признаков (цены 30-ти компаний) # теперь применяю преобразование к исходным данным X1 = pca.transform(X) X1.shape # (374, 10) # необходимо взять первую компоненту => я беру (374, 1) X11 = X1[:,0] X11.shape # (374,) 

The error occurs when I want to calculate the Pearson correlation coefficient

 df2 = pd.read_csv('djia_index.csv') X2 = df2.drop('date', 1) X2.shape #(374, 1) from numpy import corrcoef corr1 = corrcoef(X2, X11) ValueError: all the input array dimensions except for the concatenation axis must match exactly 

Why the dimension does not match? how to fix it?

    3 answers 3

    Easier and faster to show an example than to describe in words:

     In [62]: A = np.arange(25).reshape(5,5) In [63]: A Out[63]: array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14], [15, 16, 17, 18, 19], [20, 21, 22, 23, 24]]) In [64]: A[:, 0] Out[64]: array([ 0, 5, 10, 15, 20]) In [65]: A[:, 0].shape Out[65]: (5,) In [66]: A[:, [0]] Out[66]: array([[ 0], [ 5], [10], [15], [20]]) In [67]: A[:, [0]].shape Out[67]: (5, 1) 

    The problem with np.corrcoef() arises in my opinion when calculating the covariance matrix: np.cov() and it seems that when calculating np.cov() for a matrix consisting of a single column, the matrix consisting of all nan always obtained:

     In [149]: x = np.random.randint(0, 10, (5,1)) In [150]: x Out[150]: array([[4], [7], [3], [0], [0]]) In [151]: np.cov(x) c:\envs\py35\lib\site-packages\numpy\lib\function_base.py:2487: RuntimeWarning: Degrees of freedom <= 0 for slice warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning) Out[151]: array([[ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan], [ nan, nan, nan, nan, nan]]) In [152]: x = np.random.randint(0, 10, (5,2)) In [153]: x Out[153]: array([[7, 0], [0, 8], [4, 2], [1, 5], [7, 1]]) In [154]: np.cov(x) Out[154]: array([[ 24.5, -28. , 7. , -14. , 21. ], [-28. , 32. , -8. , 16. , -24. ], [ 7. , -8. , 2. , -4. , 6. ], [-14. , 16. , -4. , 8. , -12. ], [ 21. , -24. , 6. , -12. , 18. ]]) 
    • It turns out I need to combine these two matrices consisting of one column (X2 and X11) into a matrix with two columns - user21
    • array ([[7, 0], [0, 8], [4, 2], [1, 5], [7, 1]])? - user21
    • @ user214410, you can try ... I haven’t worked before with the Pearson Correlation Coefficient, so I don’t quite understand what should happen as a result ... - MaxU
    • I will try, I will write the result. Thank. - user21

    The error noted above was the result of a mismatch of dimensions X2 (374, 1) and X11 (374,). Because when I added brackets to the X11 = X1[:,0] command X11 = X1[:,0] I got the required dimension that exactly matches the dimension of X2 (374, 1)

     X11 = X1[:,[0]] X11.shape # (374, 1) 

    Then I tried again to calculate the Pearson regression coefficient:

      corr1 = corrcoef(X2, X11) C:\Users\...\Anaconda3\lib\site-packages\numpy\lib\function_base.py:2079:RuntimeWarning: Degrees of freedom <= 0 for slice warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning) 

    But the following message appeared. What does it mean? Why if I put additional brackets then a missing column appears? Moreover, the whole matrix of correlation coefficients consists of nan!

      corr1 Out[48]: array([[ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], ..., [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan], [ nan, nan, nan, ..., nan, nan, nan]]) 

    nan in the entire matrix is ​​taken because of the data structure of the variable X11. They have negative values. And since the Pearson correlation coefficient is considered: Pearson correlation coefficients are calculated by the formula

    But where did the negative values ​​come from at all?

    • I have not tried to combine two vectors into one matrix (correct me if I am mistaken in the name). Because I do not know how then to set the correlation function. But I experimented and came to the conclusion that the second square brackets inside the vector interfere (I do not know what they mean). X = [0,0,1,1,0] Y = [1,1,0,1,1] gives the result array ([[1., -0.61237244], [-0.61237244, 1.]]) Y = [[1], [1], [0], [1], [1]] X = [[0], [0], [1], [1], [0]] gives the result nan: array ( [[nan, nan, nan, nan, nan, nan, nan, ...]] - user21 pm

    I do not know how to explain how I got the result. If someone knows the reason or value of such a data construct:

     np.array([[],[],...,[]]) 

    Please write about this in the comments.

    the solution was this, I just transposed both vectors (or matrixes of columns)

     X11 = X11.T X2 = X2.T corr1 = corrcoef(X11, X2) corr1 array([[ 1. , 0.90965222], [ 0.90965222, 1. ]])