I can not understand how to use indexes from Sklearn - KFold

Question

I can not get the result of the separation KFold.

kf = KFold(n_splits=2, shuffle=True, random_state=1) train1, test1, train2, test2 = kf.split(X_data)

Gives an error message:

 ValueError: need more than 2 values to unpack

Tell me who knows. Thank.

And what is the dimension of X_data ( print(X_data.shape) )?
when I start printing through a cycle, everything is OK, but how to get them as an array
for train_index, test_index in kf.split (X_data): print ("TRAIN:", train_index, "TEST:", test_index)

Answer 1 · 2017-08-17T12:32:18

kf.split(X_data) returns a list of tuples (list of tuples) with n_split elements, where each element of the list is a tuple consisting of train and test vectors.

Those. for n_split=2 - kf.split(X_data) will return:

 [(train0, test0), (train1, test1)]

for n_split=3 :

 [(train0, test0), (train1, test1), (train2, test2)]

where each train* / test* is a vector (1D Numpy array)

Here is an example:

Setup:

 In [287]: X = np.arange(24).reshape(-1,3) In [288]: X Out[288]: array([[ 0, 1, 2], [ 3, 4, 5], [ 6, 7, 8], [ 9, 10, 11], [12, 13, 14], [15, 16, 17], [18, 19, 20], [21, 22, 23]]) In [289]: kf = KFold(n_splits=2, shuffle=True, random_state=1)

Primitive (naive) option - works only for n_splits=2 :

 In [302]: (train1, test1), (train2, test2) = kf.split(X) In [303]: X[train1] Out[303]: array([[ 0, 1, 2], [ 9, 10, 11], [12, 13, 14], [15, 16, 17]]) In [304]: X[test1] Out[304]: array([[ 3, 4, 5], [ 6, 7, 8], [18, 19, 20], [21, 22, 23]])

but it is better to do more flexibly:

 In [290]: train, test = zip(*kf.split(X)) In [291]: train Out[291]: (array([0, 3, 4, 5]), array([1, 2, 6, 7])) In [292]: test Out[292]: (array([1, 2, 6, 7]), array([0, 3, 4, 5])) In [293]: X[train[0]] Out[293]: array([[ 0, 1, 2], [ 9, 10, 11], [12, 13, 14], [15, 16, 17]]) In [294]: X[test[0]] Out[294]: array([[ 3, 4, 5], [ 6, 7, 8], [18, 19, 20], [21, 22, 23]]) In [295]: X[train[1]] Out[295]: array([[ 3, 4, 5], [ 6, 7, 8], [18, 19, 20], [21, 22, 23]]) In [296]: X[test[1]] Out[296]: array([[ 0, 1, 2], [ 9, 10, 11], [12, 13, 14], [15, 16, 17]])

And now the same for n_splits=3 :

 In [297]: kf = KFold(n_splits=3, shuffle=True, random_state=1) In [298]: train, test = zip(*kf.split(X)) In [299]: train Out[299]: (array([0, 3, 4, 5, 6]), array([1, 2, 3, 5, 7]), array([0, 1, 2, 4, 6, 7])) In [300]: test Out[300]: (array([1, 2, 7]), array([0, 4, 6]), array([3, 5]))

When substituting them into datasets, it gives an error: when X_data [train1] KeyError: '[0 1 3 5 6 7 8 9 11 12 13 14 16 18 20 21 22 24 \ n 25 26 28 29 30 37 41 43 45 47 49 50 51 52 57 60 61 63 \ n 64 68 70 71 72 73 75 76 79 83 84 86 87 89 92 101 103 \ n 107 108 109 114 115] not in index '
@ AlekseyZakharenkov, how did you get train1 - it is very similar to string representation ...
(train1, test1), (train2, test2) = kf.split (X_data) print train1 [0 1 3 5 6 7 8 9 11 12 13 14 16 18 20 21 22 24 25 26 28 29 30 37 41 43 45 47 49 50 51 52 57 60 61 63 64 68 70 71 72 73 75 76 79 83 84 86 87 89 92 101 103 106 107 108 109 114 115] X_data [train1]

I can not understand how to use indexes from Sklearn - KFold

1 answer 1

More articles: