I tried several ways to select parameters. As it turned out, my “naive” approach showed the best result on my data , because:
- Tree based methods set the importance of attributes, but not the optimal set.
- Recursive exception methods take as input the “algorithm” that will be used to build the model. In the case of categorical data and logistic regression, we transmit a lot of “dummy” features.
I was looking for a method that will find a set of signs that will give the best result. As it turned out, this set is different from the first most "important" N signs.
Below is a sample code of some of the ways I tried:
Simple bust
prepare_data_to_process - converts cat. data, uses RobustScaler for real, etc. At the output, pd.DataFrame and the names of the features used, separated by type.
def test_feature_set(data_categorical, data_numerical, y): X, numeric_cols, categorical_cols = prepare_data_to_process( data_categorical, data_numerical, data_categorical.columns, data_numerical.columns ) if X.shape[0] == 0 or X.shape[1] == 0: return [-1.] return cross_val_score(LogisticRegression(), X, y) def leave_only_best(data_categorical, data_numerical, cat_columns, num_columns, labels): baseline = test_feature_set( data_categorical[cat_columns], data_numerical[num_columns], labels ) current_feature_num = len(cat_columns) + len(num_columns) print "\r\n\r\nBaseline: ", np.mean(baseline), "feature num:", current_feature_num num_result_set = list() cat_result_set = list() for index in range(len(num_columns)-1): columns = num_columns[:index] + num_columns[index+1:] scores = test_feature_set( data_categorical, data_numerical[columns], labels ) num_result_set.append(( num_columns[index], scores )) print "Without %s:" % num_columns[index], np.mean(scores) for index in range(len(cat_columns)-1): columns = cat_columns[:index] + cat_columns[index+1:] scores = test_feature_set( data_categorical[columns], data_numerical, labels ) cat_result_set.append(( cat_columns[index], scores )) print "Without %s:" % cat_columns[index], np.mean(scores) new_num_columns = list() new_cat_columns = list() for column, scores in num_result_set: if np.mean(baseline) >= np.mean(scores): new_num_columns.append(column) for column, scores in cat_result_set: if np.mean(baseline) >= np.mean(scores): new_cat_columns.append(column) new_feature_num = len(new_num_columns) + len(new_cat_columns) new_baseline = test_feature_set( data_categorical[new_cat_columns], data_numerical[new_num_columns], labels ) if np.mean(new_baseline) < np.mean(baseline): return num_columns, cat_columns if new_feature_num < current_feature_num: return leave_only_best( data_categorical, data_numerical, new_cat_columns, new_num_columns, labels ) return new_num_columns, new_cat_columns def find_optimal_feature_set(scoring): data_categorical, data_numerical, y = load_and_split() return leave_only_best( data_categorical, data_numerical, data_categorical.columns.tolist(), data_numerical.columns.tolist(), y ) num_columns, cat_columns = find_optimal_feature_set(scoring)
Search for the features that make the most input with the ExtraTreesClassifier .
The code is borrowed from scikit-learn spawka
def most_important_features(): data_categorical, data_numerical, y = load_and_split() encoder = LabelEncoder() data_categorical = data_categorical.apply(lambda x: encoder.fit_transform(x)).astype('str') data_categorical = fill_missed_categorical(data_categorical).astype('category') X = data_numerical.join(data_categorical) forest = ExtraTreesClassifier(n_estimators=250, random_state=0) forest.fit(X, y) importances = forest.feature_importances_ std = np.std([tree.feature_importances_ for tree in forest.estimators_], axis=0) indices = np.argsort(importances)[::-1] print("Feature ranking:") for f in range(X.shape[1]): print("%d. feature %s (%f)" % (f + 1, X.columns[indices[f]], importances[indices[f]])) plt.figure(figsize=(16,9)) plt.title("Feature importances") plt.bar(range(X.shape[1]), importances[indices], color="r", yerr=std[indices], align="center") plt.xticks(range(X.shape[1]), [f for f in range(1, X.shape[1]+1)]) plt.xlim([-1, X.shape[1]]) plt.show() return indices, importances, X.columns indices, importances, columns = most_important_features()
You should have something like this:

Your images will be denoted names of signs with a numerical "contribution". According to the schedule, you can select the “ranges” of importance, then add signs in the cycle and see how the algorithm's running time and the final result change.
RFECV - Recursive Feature Exclusion
Allows you to sort the signs by their importance (in groups!)
def test_rfecv(): data_categorical, data_numerical, y = load_and_split() X, dummy_train, scaler, numeric_cols, categorical_cols = prepare_data_to_process( data_categorical, data_numerical, [], data_numerical.columns ) print "result data:", X.shape selector = RFECV(LogisticRegression(), cv=get_cv(), scoring=scoring) selected = selector.fit(X, y) return selector, selected, X.columns selector, selected, select_columns = test_rfecv() for item in zip(select_columns, selected.ranking_): print item