I have data without a description (categorical and continuous features) and a target variable (class: 0 or 1). I need to select the traits that are best for describing the target variable.

As a solution, the following seems interesting:

  1. Count the metric on the full dataset.
  2. Take turns eliminating one feature at a time, counting the metric without it.
  3. Discard all signs, without which the metric has increased.
  4. Go to paragraph (1). Repeat until the number of signs ceases to decrease.

At first glance, at least, it seems rather doubtful that we are calculating the metric with the absence of one sign, but we will exclude it several times. It is also not true to exclude immediately after counting, since in this case the next sign will not be calculated on the complete data set.

How to build a similar algorithm correctly in order to take into account the mutual influence of features on the target variable?

  • Can you give an example of a training data set? - MaxU
  • one
    @MaxU All the same data as before. - Nicolas Chabanovsky ♦
  • What is your goal? Reducing the number of signs? - MaxU
  • Maybe you should use something like RFECV scikit-learn.org/stable/modules/generated/… - Viktorov
  • @MaxU (I apologize for the delay in answering, I moved.) There is an assumption that not all data is useful for the model (for example, among the categorical there may be encoded full names). Due to the large number of features, the total feature space is quite large. I want to reduce, choosing the most useful signs. - Nicolas Chabanovsky ♦

3 answers 3

Generally speaking, the question of the selection of signs is not such an easy task as it seems. By the way, in the English-language literature, this task is called "feature selection".

Your decision, as one of the most simple and intuitive - has the right to life (it is “beautifully” described here - http://www.machinelearning.ru/wiki/images/2/21/PZAD2017_09_featureselection.pdf ). If you want to select the signs more deeply or mathematically correctly, then first of all you need to understand - why? Options - a few. For example, to simplify or, rather, reduce the time spent on training, reduce the risk of retraining, improve accuracy (because the presence of non-informative features reduces it), etc. Next, you need to understand how many signs you initially have - a dozen (for example, in medical diagnostics) or ten thousand (in the classification of images. In the analysis of texts there is a special case. Naturally, you need to understand, you have the task of classification or clustering, etc. For each case, its own methods, approaches and a single solution - even more so apart from the semantics of the problem - not uschestvuet.

An overview of the current methods of selecting signs can be found here

https://www.sciencedirect.com/user/login?returnURL=/getaccess/pii/S0045790613003066

or here:

http://www.mathnet.ru/links/3a35d35d65879c35692c5cfa56af64af/ia429.pdf

Here, the description of the problems with a good "roadmap" for the researcher:

https://machinelearningmastery.com/an-introduction-to-feature-selection/

Vorontsov has a separate lecture on the topic:

https://www.youtube.com/watch?v=n4qKbFd25Sk

Here, too, the material seems to be well described:

https://ru.coursera.org/lecture/unsupervised-learning/odnomiernyi-otbor-priznakov-t0xdz

A good overview and description of the tools - here:

https://www.sciencedirect.com/user/login?returnURL=/getaccess/pii/S0045790613003066

Well, I left out of the brackets the Method of main components, which can also be used in mathematical terms - to reduce the dimension of the attribute space, and in human language - a special reduction in the number of attributes.

  • Thank you very much for your answer! Looked at the presentation about the selection of signs. It is very interesting, though, an example of the realization of all that it is talking about would help me a lot. - Nicolas Chabanovsky ♦

Your approach seems to me quite working if you exclude the multicollinearity of features and if you plan to train and test the model for each set of columns as a metric. As indicated in some of the links from the @passant response , “successful” feature sets may differ for different classification algorithms.

Therefore, many methods from sklearn.feature_selection expect the estimator (in your case, the object of the corresponding classifier) ​​as an input parameter.

This can be helped by sklearn.feature_selection.SelectFromModel , which as the estimator accepts any algorithm that has the attribute feature_importances_ or coef_ after the model training.

    I tried several ways to select parameters. As it turned out, my “naive” approach showed the best result on my data , because:

    • Tree based methods set the importance of attributes, but not the optimal set.
    • Recursive exception methods take as input the “algorithm” that will be used to build the model. In the case of categorical data and logistic regression, we transmit a lot of “dummy” features.

    I was looking for a method that will find a set of signs that will give the best result. As it turned out, this set is different from the first most "important" N signs.


    Below is a sample code of some of the ways I tried:

    Simple bust

    prepare_data_to_process - converts cat. data, uses RobustScaler for real, etc. At the output, pd.DataFrame and the names of the features used, separated by type.

     def test_feature_set(data_categorical, data_numerical, y): X, numeric_cols, categorical_cols = prepare_data_to_process( data_categorical, data_numerical, data_categorical.columns, data_numerical.columns ) if X.shape[0] == 0 or X.shape[1] == 0: return [-1.] return cross_val_score(LogisticRegression(), X, y) def leave_only_best(data_categorical, data_numerical, cat_columns, num_columns, labels): baseline = test_feature_set( data_categorical[cat_columns], data_numerical[num_columns], labels ) current_feature_num = len(cat_columns) + len(num_columns) print "\r\n\r\nBaseline: ", np.mean(baseline), "feature num:", current_feature_num num_result_set = list() cat_result_set = list() for index in range(len(num_columns)-1): columns = num_columns[:index] + num_columns[index+1:] scores = test_feature_set( data_categorical, data_numerical[columns], labels ) num_result_set.append(( num_columns[index], scores )) print "Without %s:" % num_columns[index], np.mean(scores) for index in range(len(cat_columns)-1): columns = cat_columns[:index] + cat_columns[index+1:] scores = test_feature_set( data_categorical[columns], data_numerical, labels ) cat_result_set.append(( cat_columns[index], scores )) print "Without %s:" % cat_columns[index], np.mean(scores) new_num_columns = list() new_cat_columns = list() for column, scores in num_result_set: if np.mean(baseline) >= np.mean(scores): new_num_columns.append(column) for column, scores in cat_result_set: if np.mean(baseline) >= np.mean(scores): new_cat_columns.append(column) new_feature_num = len(new_num_columns) + len(new_cat_columns) new_baseline = test_feature_set( data_categorical[new_cat_columns], data_numerical[new_num_columns], labels ) if np.mean(new_baseline) < np.mean(baseline): return num_columns, cat_columns if new_feature_num < current_feature_num: return leave_only_best( data_categorical, data_numerical, new_cat_columns, new_num_columns, labels ) return new_num_columns, new_cat_columns def find_optimal_feature_set(scoring): data_categorical, data_numerical, y = load_and_split() return leave_only_best( data_categorical, data_numerical, data_categorical.columns.tolist(), data_numerical.columns.tolist(), y ) num_columns, cat_columns = find_optimal_feature_set(scoring) 

    Search for the features that make the most input with the ExtraTreesClassifier .

    The code is borrowed from scikit-learn spawka

     def most_important_features(): data_categorical, data_numerical, y = load_and_split() encoder = LabelEncoder() data_categorical = data_categorical.apply(lambda x: encoder.fit_transform(x)).astype('str') data_categorical = fill_missed_categorical(data_categorical).astype('category') X = data_numerical.join(data_categorical) forest = ExtraTreesClassifier(n_estimators=250, random_state=0) forest.fit(X, y) importances = forest.feature_importances_ std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0) indices = np.argsort(importances)[::-1] print("Feature ranking:") for f in range(X.shape[1]): print("%d. feature %s (%f)" % (f + 1, X.columns[indices[f]], importances[indices[f]])) plt.figure(figsize=(16,9)) plt.title("Feature importances") plt.bar(range(X.shape[1]), importances[indices], color="r", yerr=std[indices], align="center") plt.xticks(range(X.shape[1]), [f for f in range(1, X.shape[1]+1)]) plt.xlim([-1, X.shape[1]]) plt.show() return indices, importances, X.columns indices, importances, columns = most_important_features() 

    You should have something like this:

    Importance of symptoms

    Your images will be denoted names of signs with a numerical "contribution". According to the schedule, you can select the “ranges” of importance, then add signs in the cycle and see how the algorithm's running time and the final result change.

    RFECV - Recursive Feature Exclusion

    Allows you to sort the signs by their importance (in groups!)

     def test_rfecv(): data_categorical, data_numerical, y = load_and_split() X, dummy_train, scaler, numeric_cols, categorical_cols = prepare_data_to_process( data_categorical, data_numerical, [], data_numerical.columns ) print "result data:", X.shape selector = RFECV(LogisticRegression(), cv=get_cv(), scoring=scoring) selected = selector.fit(X, y) return selector, selected, X.columns selector, selected, select_columns = test_rfecv() for item in zip(select_columns, selected.ranking_): print item