I want to build a regression with multiple variables. In my data, I have n = 23 variables and m = 13000 training examples. Here is the schedule of my training data (apartment area vs price): enter image description here

Here, the graph shows 13,000 training data. As you can see, this is quite noisy data. My question is: which regression algorithm is more suitable and justified for use in my case. I mean it is logical to use simple linear regression or is it better to use some non-linear regression algorithm.

For clarity, I will give examples. Here is an abstract example of linear regression: enter image description here

As well as an abstract example of non-linear regression: enter image description here

Here are some examples with hypothetical regression lines for my data: enter image description here

As far as I understand, the primitive linear regression for my data will produce a large total error (error cost), since this data is noisy and scattered. On the other hand, there is also no clear non-linear dependence (for example, sinusoidal). Which regression algorithm is more rational to use in my case (apartment prices) in order to get a more accurate prediction of prices. And why is this algorithm (linear or non-linear) more rational?

Addition:
Here is my graph of linear dependence of price on all 23 parameters: enter image description here

I do not know what the NONLINEAR DEPENDENCE would look like in this case. And it would be more rational than linear.

  • you have too few parameters used, in this case even an increase in the number of experiments will not give any result. Add more parameters (on a vskidku - availability of repair, distance from the center, district, etc.) and then the prediction will be more adequate - BOPOH
  • As I wrote in the question I have 23 parameters. In the question I called them "variables." Among them are the options you listed. At the moment, my price linearly depends on all 23 parameters. And I think it would be more logical to use linear or some non-linear regression. - Erba Aitbayev
  • 2
    You are trying to determine the adequacy of the schedule, but for 3 parameters the schedule should be 3-dimensional, not 2, for 23 parameters - nothing can be said on the schedule at all. Look at your error-cost, if it decreases, then your model approaches the data. Only by reducing the error you risk getting an inadequate model that will only copy your input data. On another real set, the error may be more than a rough estimate. If the topic is interesting - look at the lectures, there is enough theory on the topic (although the practice there is very lame, I just did not like the course because of this) - BOPOH
  • one
    "apartment area vs price" see habrahabr.ru/post/148782 - Stack
  • one

1 answer 1

For comparison of statistical models, information criteria are commonly used, for example, the Akaike information criterion . If you are writing to R, then look at the stepAIC function — it allows you to simplify the linear model by throwing out predictors one by one, in order of increasing importance for the model.