We plan to use R to calculate the regression model for 12 million rows of database records of approximately 150 million rows. So far there is no serious experience with the system, but the first results seemed very interesting. Put the server version of R-studio under CentOS. To calculate regressions, use the LMLIST from the NLME library. Currently, we have tested for 300 thousand records (control sample) and ~ 10 million records (training sample), the process takes about 30 minutes and looks like this:

  • loading data from an Oracle database server

  • calculation of regression models (several (about 4-6) models)

  • saving data on Oracle database server

If the number of records is much larger, the calculation process may take about 20 hours (linear relationship). The calculation will be performed every night.

Accordingly questions:

  • what are the solutions in the system R to speed up the calculations (the iron is already powerful, are interested in the settings that allow the maximum parallelization of the calculations)

  • what are the technical solutions in the system R to ensure the reliability and resiliency of the system - i.e. if the system “crashes” on one of the servers in the course of 20 hourly calculations, it automatically switches to another server and continues the calculations from the moment it crashes (i.e., already performed calculations are not lost)

1 answer 1

The question is very general, so the answer is the same.

For high-performance computing there is such a list: https://cran.r-project.org/web/views/HighPerformanceComputing.html Pay attention to the biglm and speedglm . About parallel computing and computing on the GPU there too.

But first it is important to understand what time is spent in the sequence of reading from the database - building a model - writing to the database . Maybe reading and writing takes a lot of time? Try once to unload data into a text file and read / write using fread / fwrite from the data.table package. Moreover, from the linear model, save only what is needed (the base function lm stores both initial data and residuals in the model object).

If you need to build several models, you can simply do this on different machines or in several copies of R running on the same multi-core server.

The issue of fault tolerance, it seems to me, needs to be solved outside of R. That is, deploy a system like Hadoop or Spark, and R, if necessary, be used as an interface to them. Although for voiced volumes this is all shooting from a cannon on a sparrow.

Ps If repeated calculations are done on old data + portion of new ones, then you can use the method from https://stats.stackexchange.com/questions/11872/updating-linear-regression-efficiently-when-adding-observations-and-or-predictor