Solution using the Scikit-Learn module - similar methods are often used in machine learning and data processing tasks (the input data set must be two-dimensional, so in the case of a simple list, it will have to be converted to a table with one column).
import numpy as np from sklearn.covariance import EmpiricalCovariance, MinCovDet a = np.array([603, 21, 25, 23, 2, 57, 19, 148, 160, 182, 501, 60, 26, 21, 25]) # reshape 1D array to 2D matrix X = a.reshape(-1, 1)
It turned out a table with 15 rows and one column:
In [70]: X.shape Out[70]: (15, 1)
Consider Minimum Covariance Determinant (MCD) :
robust_cov = MinCovDet().fit(X)
find anomalies using MinCovDet (). mahalanobis () :
In [73]: a[robust_cov.mahalanobis(X) > 1] Out[73]: array([603, 2, 57, 148, 160, 182, 501, 60])
"good" data:
In [74]: a[robust_cov.mahalanobis(X) <= 1] Out[74]: array([21, 25, 23, 19, 26, 21, 25])
UPDATE: with strongly correlated data, this method may produce the following error:
In [267]: robust_cov = MinCovDet().fit(X) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\sklearn\covariance\robust_covariance.py:677: RuntimeWarning: invalid value encountered in true_divide self.dist_ /= correction C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\sklearn\covariance\robust_covariance.py:716: RuntimeWarning: invalid value encountered in less mask = self.dist_ < chi2(n_features).isf(0.025) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\sklearn\covariance\robust_covariance.py:720: RuntimeWarning: Mean of empty slice. location_reweighted = data[mask].mean(0) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\numpy\core\_methods.py:73: RuntimeWarning: invalid value encountered in true_divide ret, rcount, out=ret, casting='unsafe', subok=False) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\numpy\lib\function_base.py:1128: RuntimeWarning: Mean of empty slice. avg = a.mean(axis) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\sklearn\covariance\empirical_covariance_.py:81: RuntimeWarning: Degrees of freedom <= 0 for slice covariance = np.cov(XT, bias=1) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\numpy\lib\function_base.py:3109: RuntimeWarning: divide by zero encountered in double_scalars c *= 1. / np.float64(fact) C:\Users\Max\Anaconda3_5.0\envs\py36\lib\site-packages\numpy\lib\function_base.py:3109: RuntimeWarning: invalid value encountered in multiply c *= 1. / np.float64(fact) ... skipped ... ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
In this case, you must explicitly specify support_fraction
Example:
In [285]: robust_cov = MinCovDet(support_fraction=1).fit(X) In [286]: a = np.array('16 2 8 3 2 3 3 3 4 12 3 3 3 3 4 3 3 3 3 3 4 3 3 3 3 3 3 3 3 3 6 2 3 3 2 2 3 3 3 4 3 4 3 3 4 3 4 3 3 3 2 2 3 3 3 2'.split()).astype(int) In [287]: X = a.reshape(-1, 1) In [288]: robust_cov = MinCovDet(support_fraction=1).fit(X) In [289]: a[robust_cov.mahalanobis(X) <= 1] Out[289]: array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]) In [290]: a[robust_cov.mahalanobis(X) > 1] Out[290]: array([16, 2, 8, 2, 4, 12, 4, 4, 6, 2, 2, 2, 4, 4, 4, 4, 2, 2, 2])