Question on scikit-learn: what is the maximum amount of training data this library can handle? If I understand correctly, the data for processing is first loaded into RAM. Can this library handle the amount of data in 10 GB (at once or in parts), if the RAM allows you to load much more?
1 answer
It is difficult to give an unequivocal answer to this too general question.
Practically all methods and functions known to me in sklearn can work with data only in memory. Moreover, many methods create additional copies of data (or parts of data) in memory.
But some methods (for example, CountVectorizer , TfidfVectorizer ) return compressed sparsed matrix matrices that take up much less memory space (all zero values, which are usually 90 +% - take up little memory space) - this saves memory.
In general, you may experience problems with lack of memory for the same data, depending on what you do and how you do it ...
- oneIt is a pity that I do not yet have system knowledge about this. From what I understood when reading articles in scikit-learn, some classifiers support partial_fit partial-learning functions. On the website of this library there is an article devoted to the processing of data volumes that do not fit in the RAM. In addition, there is an example of using partial_fit. While I understand. - GlassedMichail
|