Good day. Engaged in regression tasks using sclearn and xgboost. However, the task of forecasting is slightly different, can you give an example in what form should the data and sample code in python be used to work with using xgboost. In regression tasks, I similarly collected data in the form of a similar set of rows

1;34;234;234;123;2;321;2;123213;24534;3;278 

Where the input vector was n-1 parameters, and the predicted value is the last column. As I understand it in forecasting tasks, the string itself is the same parameter that changes over time, while in regression problems there are several different parameters for which we predict one. The question is - how do tasks of planning and regression differ at the programming level? Those. input data is

  x0 x1 x2 x3 x4 x5 y 0.392689 0.117810 0.242750 0.931792 0.972802 0.898693 0.429941 0.569055 0.622889 0.762683 0.095271 0.101407 0.510155 0.542256 0.939509 0.993534 0.772005 0.164555 0.800897 0.591883 0.190720 0.284297 0.292773 0.290652 0.045383 0.564894 0.347683 0.014610 

where we predict the value of y. And the data have the following form:

 x1 0.284297 0.292773 0.290652 0.045383 0.564894 0.347683 0.014610 0.961696 x2 0.939509 0.993534 0.772005 0.164555 0.800897 0.591883 0.190720 0.040162 

where it is necessary to predict further values ​​of x1, x2, .. what is the difference in terms of code?

 import xgboost as xgb import pandas as pd import numpy as np import math from sklearn.metrics import confusion_matrix, mean_squared_error from sklearn.datasets import load_iris, load_digits, load_boston from sklearn.metrics import mean_absolute_error,mean_squared_error,median_absolute_error, accuracy_score df = pd.read_csv('file1.csv',";",header=None) X_train = df.drop(7,axis=1) Y_train = df[7] test_data = pd.read_csv('file2.csv',";",header=None) X_test = test_data.drop(7,axis=1) Y_test = test_data[7] xgb_model = xgb.XGBRegressor(max_depth) cl = xgb_model.fit(X_train,Y_train) predictions = cl.predict(X_test) actuals = Y_test print(mean_absolute_error(actuals, predictions)) print(mean_squared_error(actuals, predictions)) print(median_absolute_error(actuals, predictions)) 

Conventionally, this code will equally predict the future value of x1 for the second sample and y for the first sample?

  • Do you want to predict several output parameters at once? - MaxU
  • Yes, we get for regression problems, it is enough to determine the last parameter. And for forecasting problems, if the string of n parameters is the last n / 3 parameters. - KordDEM
  • MB will become clearer. If our data is: x0 x1 x2 x3 x4 x5 x6 y 0.284297 0.292773 0.290652 0.045383 0.564894 0.347683 0.014610 0.961696 and we predict y. And if the data have the following form x1 0.284297 0.292773 0.290652 0.045383 0.564894 0.347683 0.014610 0.961696 and it is necessary to predict further values ​​of x0 what is the difference in terms of code? - KordDEM
  • I did not understand how your columns (x1, x2) became lines - or did you simply present them for convenience? PS Your code will predict y for the X parameter set. x he won't predict - MaxU
  • We have transposed the first file and now each line is a separate parameter that we need to predict. - KordDEM

1 answer 1

Answer to question before editing

How to break into a training and test sample?

For such tasks, the Pandas and Numpy modules are ideal, which allow working without cycles (vectorized solutions) with whole matrices and vectors, which is orders of magnitude faster than processing in cycles.

Here is a small example:

 import pandas as pd import numpy as np # generate random DataFrame (shape: 10, 8) In [13]: df = pd.DataFrame(np.random.rand(10,8)).add_prefix('x') In [14]: df.columns = df.columns[:-1].tolist() + ['y'] In [15]: df Out[15]: x0 x1 x2 x3 x4 x5 x6 y 0 0.392689 0.117810 0.242750 0.931792 0.972802 0.898693 0.429941 0.619093 1 0.569055 0.622889 0.762683 0.095271 0.101407 0.510155 0.542256 0.848998 2 0.939509 0.993534 0.772005 0.164555 0.800897 0.591883 0.190720 0.040162 3 0.284297 0.292773 0.290652 0.045383 0.564894 0.347683 0.014610 0.961696 4 0.065868 0.974128 0.749756 0.778895 0.872915 0.585320 0.851837 0.408333 5 0.818768 0.343451 0.985583 0.860080 0.876103 0.554149 0.132387 0.506820 6 0.713177 0.567278 0.587488 0.459199 0.082245 0.677964 0.229960 0.265138 7 0.751670 0.902665 0.353395 0.975563 0.823437 0.742916 0.760047 0.567249 8 0.106809 0.068440 0.075260 0.435980 0.412090 0.226181 0.909518 0.714608 9 0.281475 0.641496 0.695424 0.993351 0.958840 0.457999 0.203841 0.007968 

Now we can cut the data as it is convenient for us

For example, if we need all the columns except the last:

 In [16]: df.iloc[:, :-1] Out[16]: x0 x1 x2 x3 x4 x5 x6 0 0.392689 0.117810 0.242750 0.931792 0.972802 0.898693 0.429941 1 0.569055 0.622889 0.762683 0.095271 0.101407 0.510155 0.542256 2 0.939509 0.993534 0.772005 0.164555 0.800897 0.591883 0.190720 3 0.284297 0.292773 0.290652 0.045383 0.564894 0.347683 0.014610 4 0.065868 0.974128 0.749756 0.778895 0.872915 0.585320 0.851837 5 0.818768 0.343451 0.985583 0.860080 0.876103 0.554149 0.132387 6 0.713177 0.567278 0.587488 0.459199 0.082245 0.677964 0.229960 7 0.751670 0.902665 0.353395 0.975563 0.823437 0.742916 0.760047 8 0.106809 0.068440 0.075260 0.435980 0.412090 0.226181 0.909518 9 0.281475 0.641496 0.695424 0.993351 0.958840 0.457999 0.203841 

or all columns starting with x :

 In [17]: df.filter(regex='^x\d+') Out[17]: x0 x1 x2 x3 x4 x5 x6 0 0.392689 0.117810 0.242750 0.931792 0.972802 0.898693 0.429941 1 0.569055 0.622889 0.762683 0.095271 0.101407 0.510155 0.542256 2 0.939509 0.993534 0.772005 0.164555 0.800897 0.591883 0.190720 3 0.284297 0.292773 0.290652 0.045383 0.564894 0.347683 0.014610 4 0.065868 0.974128 0.749756 0.778895 0.872915 0.585320 0.851837 5 0.818768 0.343451 0.985583 0.860080 0.876103 0.554149 0.132387 6 0.713177 0.567278 0.587488 0.459199 0.082245 0.677964 0.229960 7 0.751670 0.902665 0.353395 0.975563 0.823437 0.742916 0.760047 8 0.106809 0.068440 0.075260 0.435980 0.412090 0.226181 0.909518 9 0.281475 0.641496 0.695424 0.993351 0.958840 0.457999 0.203841 

we divide the data set into training and test samples, having previously shuffled the data:

 In [19]: df_train, df_test = np.split(df.sample(frac=1), [6]) In [20]: df_train Out[20]: x0 x1 x2 x3 x4 x5 x6 y 3 0.284297 0.292773 0.290652 0.045383 0.564894 0.347683 0.014610 0.961696 2 0.939509 0.993534 0.772005 0.164555 0.800897 0.591883 0.190720 0.040162 9 0.281475 0.641496 0.695424 0.993351 0.958840 0.457999 0.203841 0.007968 1 0.569055 0.622889 0.762683 0.095271 0.101407 0.510155 0.542256 0.848998 8 0.106809 0.068440 0.075260 0.435980 0.412090 0.226181 0.909518 0.714608 7 0.751670 0.902665 0.353395 0.975563 0.823437 0.742916 0.760047 0.567249 In [21]: df_test Out[21]: x0 x1 x2 x3 x4 x5 x6 y 4 0.065868 0.974128 0.749756 0.778895 0.872915 0.585320 0.851837 0.408333 0 0.392689 0.117810 0.242750 0.931792 0.972802 0.898693 0.429941 0.619093 6 0.713177 0.567278 0.587488 0.459199 0.082245 0.677964 0.229960 0.265138 5 0.818768 0.343451 0.985583 0.860080 0.876103 0.554149 0.132387 0.506820 

PS Practically all machine learning libraries I know accept Pandas.DataFrame's, Pandas.Series, Numpy.Arra's as input.

Those. sort of:

 clf = xgb.XGBClassifier(...) clf.fit(df_train.filter(regex='^x'), df_train.loc[:, 'y']) 
  • The question was not how to work with matrices in python and what modules to do it, I can just as well handle everything with other modules. The question is how much the forecasting task differs from the regression tasks in which several parameters are used for the prediction, while in forecasting 1, but changing. - KordDEM
  • @KordDEM, regressions also relate to forecasting - the forecast of non-discrete values ​​(values). What do you mean by в то время как в прогнозировании 1, но изменяющийся ? - MaxU
  • Clarified. In short, even in the example you presented, you only predict the last column, the only difference is that the line here is the change of one parameter over time, and if you take the regression task, then each column is a separate parameter. I hope I correctly formulated) I.Z. Why shuffle strings when it is completely independent records? - KordDEM
  • a string (except the last element) is a set of parameters with which we want to predict the last value (target). This is true for regression tasks and classification problems - MaxU
  • Those. The difference depends only on the data collected? But the tasks of classification and regression on the wording given by you do not differ? Just interesting to get to the bottom. - KordDEM