Faced a problem, there is an array of numbers, for example

[603, 21, 25, 23, 2, 57, 19, 148, 160, 182, 501, 60, 26, 21, 25] 

We need to make it so that the algorithm somehow selects all the numbers for the new array so as to remove all deviations and find the average between the remaining numbers. (the format of the formula is any, but in the priority of Python, Excel)

My attempts:

  1. As I tried, it made the condition that if the deviation of the next number is more than 10%, do not add, otherwise add to the array - then there was 1 problem, in the example above, because 1 is taken to start operation 1, it turns out that the basis is the number 603, from which it follows that in the end the average of all the numbers in the array will be 603, because there are no more numbers in the array that fit the condition.

  2. Further to get out of this situation, made it more difficult, made the creation of "categories", worked like this, 2 cycles with the condition checked whether the number fit the current category, if not, a new one is created and as a result something like this came out [[603], [21, 23, 19, 21], [25, 26, 25], [57, 60], [148, 160, 182], [501]] , further sorted by length and selected the longest array.

In the end, neither the option nor the one in the end did not fit, maybe the formulas of the square deviation or the mean geometer. values ​​somehow use, no ideas yet.

  • four
    Show me an example of a good result - MBo
  • Well, it seems like a standard approach ... we consider the average, consider the confidence interval, consider the maximum deviation, if it is more than the confidence interval - discard the maximum deviating value and repeat. Ps. I don’t understand at all - at the beginning it’s just like an array, and then positional dependence emerges in method 2 ... - Akina
  • @MBo is a good example, it’s in the end, for example, that I showed above there should be an average of about 22, meaning the numbers that are much more than 20 and less should be discarded, otherwise the arithmetic average will be much wrong - 100THE BONUS

3 answers 3

Solution using the Scikit-Learn module - similar methods are often used in machine learning and data processing tasks (the input data set must be two-dimensional, so in the case of a simple list, it will have to be converted to a table with one column).

 import numpy as np from sklearn.covariance import EmpiricalCovariance, MinCovDet a = np.array([603, 21, 25, 23, 2, 57, 19, 148, 160, 182, 501, 60, 26, 21, 25]) # reshape 1D array to 2D matrix X = a.reshape(-1, 1) 

It turned out a table with 15 rows and one column:

 In [70]: X.shape Out[70]: (15, 1) 

Consider Minimum Covariance Determinant (MCD) :

 robust_cov = MinCovDet().fit(X) 

find anomalies using MinCovDet (). mahalanobis () :

 In [73]: a[robust_cov.mahalanobis(X) > 1] Out[73]: array([603, 2, 57, 148, 160, 182, 501, 60]) 

"good" data:

 In [74]: a[robust_cov.mahalanobis(X) <= 1] Out[74]: array([21, 25, 23, 19, 26, 21, 25]) 

UPDATE: with strongly correlated data, this method may produce the following error:

 In [267]: robust_cov = MinCovDet().fit(X) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\sklearn\covariance\robust_covariance.py:677: RuntimeWarning: invalid value encountered in true_divide self.dist_ /= correction C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\sklearn\covariance\robust_covariance.py:716: RuntimeWarning: invalid value encountered in less mask = self.dist_ < chi2(n_features).isf(0.025) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\sklearn\covariance\robust_covariance.py:720: RuntimeWarning: Mean of empty slice. location_reweighted = data[mask].mean(0) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\numpy\core\_methods.py:73: RuntimeWarning: invalid value encountered in true_divide ret, rcount, out=ret, casting='unsafe', subok=False) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\numpy\lib\function_base.py:1128: RuntimeWarning: Mean of empty slice. avg = a.mean(axis) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\sklearn\covariance\empirical_covariance_.py:81: RuntimeWarning: Degrees of freedom <= 0 for slice covariance = np.cov(XT, bias=1) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\numpy\lib\function_base.py:3109: RuntimeWarning: divide by zero encountered in double_scalars c *= 1. / np.float64(fact) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\numpy\lib\function_base.py:3109: RuntimeWarning: invalid value encountered in multiply c *= 1. / np.float64(fact) ... skipped ... ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). 

In this case, you must explicitly specify support_fraction

Example:

 In [285]: robust_cov = MinCovDet(support_fraction=1).fit(X) In [286]: a = np.array('16 2 8 3 2 3 3 3 4 12 3 3 3 3 4 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 6 2 3 3 2 2 3 3 3 4 3 4 3 3 4 3 4 3 3 3 2 2 3 3 3 2'.split()).astype(int) In [287]: X = a.reshape(-1, 1) In [288]: robust_cov = MinCovDet(support_fraction=1).fit(X) In [289]: a[robust_cov.mahalanobis(X) <= 1] Out[289]: array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]) In [290]: a[robust_cov.mahalanobis(X) > 1] Out[290]: array([16, 2, 8, 2, 4, 12, 4, 4, 6, 2, 2, 2, 4, 4, 4, 4, 2, 2, 2]) 
  • But can you clarify what the mahalanobis (X)? I put the library, everything is good up to this point, swears that there is no such method at all, if you know, it would be great - 100ROZH
  • @ 100ROZH, what is your version of scikit-learn? - MaxU
  • figured it out like this, the editor swears, but the code seems to be executed, strange, but the reshape now breaks everything, if the number is two-digit and normal, it goes like this [[3] [16]], and the error is ValueError: Input contains NaN, infinity or a value too large for dtype ('float64'). Ie it turns out, it is necessary to get rid of the spaces and it seems to work, yes? - 100ROZH
  • @ 100ROZH, .reshape(-1, 1) needed only if you have input data in the form of a list / vector. Can you give an example of the data on which the code breaks? - MaxU
  • [16 2 8 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 3 3 3 2] is this a normal display by the way? this is after np.array () it became so, initially a normal list with the same data - 100ROZH

You can use this solution :

 import numpy as np def reject_outliers(data, m = 2.): data = np.array(data) d = np.abs(data - np.median(data)) mdev = np.median(d) s = d/mdev if mdev else 0. return data[s<m] 

Example:

 In [20]: l = [603, 21, 25, 23, 2, 57, 19, 148, 160, 182, 501, 60, 26, 21, 25] In [21]: reject_outliers(l) Out[21]: array([21, 25, 23, 2, 57, 19, 60, 26, 21, 25]) In [22]: reject_outliers(l).mean() Out[22]: 27.9 

or so (for mu = 1 ):

 In [25]: reject_outliers(l, 1) Out[25]: array([21, 25, 23, 19, 26, 21, 25]) In [26]: reject_outliers(l, 1).mean() Out[26]: 22.857142857142858 

Graphic representation:

 import pandas as pd import matplotlib.pyplot as plt import matplotlib matplotlib.style.use('ggplot') pd.Series(l).plot.kde(grid=True) 

enter image description here

 pd.Series(reject_outliers(l)).plot.kde(grid=True) 

enter image description here

 pd.Series(reject_outliers(l, 1)).plot.kde(grid=True) 

enter image description here

PS I also advise you to familiarize yourself with the methods of finding anomalies in the Scikit-Learn module.

    It’s not surprising that you didn’t succeed in using the "frontal" methods that you tried to use.

    The fact is that the calculation of the mean, then the standard deviation, then the DIs will work correctly only when the data of your sample will be distributed according to the normal law. Try to build a histogram of your array and see that your distribution is far from normal. It can be considered very approximately exponential, but it is still necessary to prove it.

    In any case, I recommend not to reinvent the wheel, but to get acquainted with the theory. The section is called “anomaly detection” (sometimes also called “outlier detection”). As part of DataScience, similar tasks are also solved in a case that is associated with "clearing the source data." Methods and means of solving such problems are worked out in sufficient detail for many real cases.

    I want to note that, depending on what the applied task is, the methods and approaches may be different - this is extremely important, because there is no single, “canonized” method and cannot be in this case.

    Literature is more than enough. There will be questions - ask.

    PS Since, after all, the bicycle continues to be invented, I will try to simplify the lives of the inventors, complementing my previous answer. Here is a good and not complicated article on how to do this “according to science”. However, it is worth noting that right in the first lines there it is indicated that one of a number of methods is described. If you wish, you can familiarize yourself with others. http://mycroftbs.ru/grabbs/

    • +1 for “there is no single" canonized "method" - jfs