There is a correlation matrix. I need to get maximum ignoring certain columns. To do this, I attached another column to the matrix with labels as a filter for filtering such columns.

The task is to calculate the maximum, excluding from the calculation of the row, where in the column "CLASS" is one. Since the matrix is ​​square, the "CLASS" column can also be used as a string.

Below is the code for my implementation. It works correctly, but very slowly on large tables! Help me find a quick way to calculate vector pandas. I do not have enough memory to work with large dataframes with this approach.

df = pd.read_csv('https://st.storeland.ru/9/2418/212/demo10.csv', sep=';', index_col=0) def noise_porog(Series_cor): noise_list = list() priznaki = list(df['CLASS']) CLASS = 1 for idx, crl in enumerate(Series_cor[:-1]): if priznaki[idx] != CLASS: noise_list.append(Series_cor[idx]) # список шумов else: pass return pd.Series(max(noise_list)) df['max_01'] = df.apply(noise_porog, axis=1) 

Here is a screen for clarity. I selected the first iteration, where I select the necessary columns and from them I get the maximum and write to the new column "max_01": enter image description here

Here is a larger matrix:

 df = pd.read_csv('https://st.storeland.ru/6/2418/067/demo.csv', sep=',', index_col=-1) df = df.loc[:, ~df.columns.str.contains('unnamed', case=False)].T df 

For convenience, I made the values ​​from "CLASS" as column names.

    1 answer 1

    Try this:

     df = pd.read_csv('https://st.storeland.ru/9/2418/212/demo10.csv', sep=';', index_col=0) mask = df['CLASS']!=1 df['max_01'] = df.loc[:, df.columns.drop('CLASS',1)[mask]].max(axis=1) 

    Result:

     In [63]: df Out[63]: 2 3 4 5 6 7 8 9 10 11 CLASS max_01 2 0.000000 0.107562 0.202508 0.082104 0.099218 0.166363 -0.138255 -0.030342 0.040025 0.236721 0 0.236721 3 0.107562 0.000000 0.069416 0.213758 0.167404 0.137428 -0.048976 0.056551 0.039009 0.270039 1 0.270039 4 0.202508 0.069416 0.000000 0.056688 0.302428 0.090878 0.032381 0.120947 0.414783 0.117498 -1 0.414783 5 0.082104 0.213758 0.056688 0.000000 0.247694 0.171819 -0.028765 0.157801 0.184200 0.465918 0 0.465918 6 0.099218 0.167404 0.302428 0.247694 0.000000 -0.096407 -0.184963 0.198542 0.222838 0.190360 -1 0.302428 7 0.166363 0.137428 0.090878 0.171819 -0.096407 0.000000 0.056020 0.144441 0.105880 0.119886 -1 0.171819 8 -0.138255 -0.048976 0.032381 -0.028765 -0.184963 0.056020 0.000000 -0.051127 0.027271 -0.050593 0 0.056020 9 -0.030342 0.056551 0.120947 0.157801 0.198542 0.144441 -0.051127 0.000000 0.212784 -0.019487 1 0.212784 10 0.040025 0.039009 0.414783 0.184200 0.222838 0.105880 0.027271 0.212784 0.000000 0.146514 -1 0.414783 11 0.236721 0.270039 0.117498 0.465918 0.190360 0.119886 -0.050593 -0.019487 0.146514 0.000000 -1 0.465918 

    PS It is better not to add the CLASS column as a DataFrame column, but simply store it separately as numpy.ndarray or as Pandas.Series .

    • Thank! As I understand it, your solution just finds the maximum, but I need to learn how to exclude some columns from the calculations using the condition. How to calculate, for example, a maximum of columns whose index is divisible by 3 without a remainder? - (columns 3, 6, and 9 come under the condition). - Mavar
    • @Mavar, do you mean by index the column ordinal number or its name? - MaxU
    • column name. I have a matrix of several tens of thousands of rows. And I need to count the maxima on the basis of some condition. - Mavar February
    • Sorry for the confusing explanation. The column that you delete "CLASS" is a mask. I impose it on the matrix and, on the basis of its values, choose which rows to count the maximum or which to skip. Try replacing column names with values ​​from the CLASS column. - Mavar February
    • @Mavar, is there anything else in the answer? - MaxU February