How to work correctly with a large number of records with SQLAlchemy?
I have two signs. In the first 5 million records of the form: question_id , view_count , counted . The second table contains the sum of view_count for each unique question_id . If we have taken into account the record from the first table to the second, counted set to true.
Now it looks like this:
def update_most_viewed(): query = QuestionViewHistory.query.filter_by(counted=False).distinct() question_count = query.count() frame_size = 1000 counter = 0 while counter <= question_count: all_questions = query.offset(counter*frame_size).limit(frame_size).all() counter = counter + frame_size for question in all_questions: most_viewed_question = MostViewedQuestion.query.filter_by(question_id=question.question_id).first() if most_viewed_question is None: most_viewed_question = MostViewedQuestion(question.question_id, question.view_count) db.session.add(most_viewed_question) else: most_viewed_question.view_count += question.view_count question.counted = True db.session.commit() I cause function from the console. Initialization:
app = Flask(__name__) db = SQLAlchemy(app) The problem is that with each pass, the time increases exponentially: after the fifth pass, everything hangs. If you run the program again, everything repeats one to one.
As far as I understand, the problem is that with every commit call, SQLAlchemy updates all attributes of all objects in the session, but unfortunately I did not find a way to fix it.
Update
Classes of models that appear in the query.
class MostViewedQuestion(db.Model): __tablename__ = 'most_viewed_question' id = db.Column(db.Integer, primary_key=True) question_id = db.Column(db.Integer) view_count = db.Column(db.Integer) is_associated = db.Column(db.Boolean) can_be_associated = db.Column(db.Boolean) title = db.Column(db.String(500)) body = db.Column(db.String(30000)) tags = db.Column(db.String(500)) last_update_date = db.Column(db.DateTime) def __init__(self, question_id, view_count, is_associated=False): self.question_id = question_id self.view_count = view_count self.is_associated = is_associated self.can_be_associated = True self.last_update_date = datetime.datetime.now() def __repr__(self): return '<MostViewedQuestion %s>' % str(self.id) class QuestionViewHistory(db.Model): __tablename__ = 'question_view_history' id = db.Column(db.Integer, primary_key=True) question_id = db.Column(db.Integer) view_count = db.Column(db.Integer) view_date = db.Column(db.DateTime) counted = db.Column(db.Boolean) def __init__(self, question_id, view_count, view_date): self.question_id = question_id self.view_count = view_count self.view_date = view_date self.counted = False def __repr__(self): return '<QuestionViewHistory %s>' % str(self.id) The code for the entire project is available on GitHub , all models are in the models.py file, the update_most_viewed function in the database.py file. In the folder cvs_data_ru data for tests.
bulk_...methods. For example, bulk_save_objects . With this method, saving will look something like this:db.session.bulk_save_objects([most_viewed_q1, most_viewed_q2, most_viewed_q3 ... most_viewed_qn])+commit. - m9_psycommit). Everything passes very quickly. - Nicolas Chabanovsky ♦id,question_id,view_count. In the second, I write down the change in both tablets and send them to the database. That is, there is nothing to track when sending ORM data, in fact. Chasing tests yet. If all is well, I will post the answer. - Nicolas Chabanovsky ♦