We plan to use R to calculate the regression model for 12 million rows of database records of approximately 150 million rows. So far there is no serious experience with the system, but the first results seemed very interesting. Put the server version of R-studio under CentOS. To calculate regressions, use the LMLIST from the NLME library. Currently, we have tested for 300 thousand records (control sample) and ~ 10 million records (training sample), the process takes about 30 minutes and looks like this:
loading data from an Oracle database server
calculation of regression models (several (about 4-6) models)
saving data on Oracle database server
If the number of records is much larger, the calculation process may take about 20 hours (linear relationship). The calculation will be performed every night.
Accordingly questions:
what are the solutions in the system R to speed up the calculations (the iron is already powerful, are interested in the settings that allow the maximum parallelization of the calculations)
what are the technical solutions in the system R to ensure the reliability and resiliency of the system - i.e. if the system “crashes” on one of the servers in the course of 20 hourly calculations, it automatically switches to another server and continues the calculations from the moment it crashes (i.e., already performed calculations are not lost)