Which regression algorithm to choose for noisy (scattered) data?

Question

I want to build a regression with multiple variables. In my data, I have n = 23 variables and m = 13000 training examples. Here is the schedule of my training data (apartment area vs price):

Here, the graph shows 13,000 training data. As you can see, this is quite noisy data. My question is: which regression algorithm is more suitable and justified for use in my case. I mean it is logical to use simple linear regression or is it better to use some non-linear regression algorithm.

For clarity, I will give examples. Here is an abstract example of linear regression:

As well as an abstract example of non-linear regression:

Here are some examples with hypothetical regression lines for my data:

As far as I understand, the primitive linear regression for my data will produce a large total error (error cost), since this data is noisy and scattered. On the other hand, there is also no clear non-linear dependence (for example, sinusoidal). Which regression algorithm is more rational to use in my case (apartment prices) in order to get a more accurate prediction of prices. And why is this algorithm (linear or non-linear) more rational?

Addition:
Here is my graph of linear dependence of price on all 23 parameters:

I do not know what the NONLINEAR DEPENDENCE would look like in this case. And it would be more rational than linear.

you have too few parameters used, in this case even an increase in the number of experiments will not give any result.
Add more parameters (on a vskidku - availability of repair, distance from the center, district, etc.) and then the prediction will be more adequate
At the moment, my price linearly depends on all 23 parameters.
And I think it would be more logical to use linear or some non-linear regression.
You are trying to determine the adequacy of the schedule, but for 3 parameters the schedule should be 3-dimensional, not 2, for 23 parameters - nothing can be said on the schedule at all.
Look at your error-cost, if it decreases, then your model approaches the data.
Only by reducing the error you risk getting an inadequate model that will only copy your input data.
On another real set, the error may be more than a rough estimate.
If the topic is interesting - look at the lectures, there is enough theory on the topic (although the practice there is very lame, I just did not like the course because of this)
Cross-posted on Stats.SE, Stack Overflow, SO.RU, and DataScience.SE: stats.stackexchange.com/q/188291/2921 , stackoverflow.com/q/34474767/781723 , ru.stackoverflow.com/q/ 486133 , datascience.stackexchange.com/q/9529/8560 .

Accepted Answer · 2016-01-26T09:32:16

For comparison of statistical models, information criteria are commonly used, for example, the Akaike information criterion . If you are writing to R, then look at the stepAIC function — it allows you to simplify the linear model by throwing out predictors one by one, in order of increasing importance for the model.

Which regression algorithm to choose for noisy (scattered) data?

1 answer 1

More articles: