📜 ⬆️ ⬇️

"Deleting" objects in Django



Sooner or later, developers are faced with the task of deleting unnecessary data. And the more complex the service, the more nuances must be taken into account. In this article, I will tell you how we implemented a “delete” in a database with a hundred links.

Prehistory


To monitor the performance of most Mail.ru Group and VKontakte projects, a proprietary service called Monitoring is used . Starting its history from the end of 2012, for 6 years the project has grown into a huge system, which has acquired a lot of functionality. Monitoring regularly checks the availability of servers and the correctness of responses to requests, collects statistics on memory usage, processor utilization, etc. When the parameters of the monitored server are beyond the allowed values, those responsible for the server receive notifications in the system and via SMS.

All checks and incidents are logged to track the dynamics of server characteristics, so the database has reached the order of hundreds of millions of records. Periodically, new servers appear, and the old ones are no longer used. Information about unused servers must be removed from the Monitoring system in order to: a) not overload the interface with unnecessary information, and b) release the unique identifiers .

Deletion


It is not for nothing that in the title of the article the word “deletion” was written in quotes. You can remove an object from the system in several ways:


Iteration # 1


Initially, the first approach was used when we simply executed object.delete() and the object was deleted with all dependencies. But over time, we had to abandon this approach, since one object could have dependencies with millions of other objects, and the cascading deletion tightly blocked the tables. And since the service performs thousands of checks every second and logs them, locking the tables led to a serious slowdown in the service, which was unacceptable for us.

Iteration # 2


To avoid long locks, we decided to delete the data in chunks. This would allow to record actual monitoring data during the intervals between deletions of objects. The list of all objects that will be deleted in cascade can be obtained using the method that is used in the admin panel when deleting an object (when confirming the deletion):

 from django.contrib.admin.util import NestedObjects from django.db import DEFAULT_DB_ALIAS collector = NestedObjects(using=DEFAULT_DB_ALIAS) collector.collect([obj]) objects_to_delete = collector.nested() # Recursive delete objects 

The situation has improved: the load was distributed over time, new data began to be recorded faster. But we immediately ran into the next pitfall. The fact is that the list of deleted objects is formed at the very beginning of the deletion, and if in the process of “portioned” deletion new dependent objects are added, then the parent element cannot be deleted.

We immediately abandoned the idea in case of an error in recursive deletion again to collect data on new dependencies or prohibit adding dependent entries during deletion, because a) you can go into an infinite loop or b) you have to find all the add code of all dependent objects .

Iteration # 3


We thought about the second type of deletion, when the data is marked and hidden from the interface. Initially, this approach was rejected, because finding all requests and adding a filter to the absence of a remote parent element seemed to be a task for at least a week. In addition, there was a high probability of missing the necessary code, which would lead to unpredictable consequences.

Then we decided to use decorators to override the query manager. Further, it is better to see the code than to write a hundred words.

 def exclude_objects_for_deleted_hosts(*fields): """ Decorator that adds .exclude({field__}is_deleted=True) for model_class.objects.get_queryset :param fields: fields for exclude condition """ def wrapper(model_class): def apply_filters(qs): for field in filter_fields: qs = qs.exclude(**{ '{}is_deleted'.format('{}__'.format(field) if field else ''): True, }) return qs model_class.all_objects = copy.deepcopy(model_class.objects) filter_fields = set(fields) get_queryset = model_class.objects.get_queryset model_class.objects.get_queryset = lambda: apply_filters(get_queryset()) # save info about model decorator setattr(model_class, DECORATOR_DEL_HOST_ATTRIBUTE, filter_fields) return model_class return wrapper 

The exclude_objects_for_deleted_hosts(fields) decorator for the specified fields of the fields model automatically adds an exclude filter for each request, which just removes the records that should not be displayed in the interface.

Now it is enough for all models, which will be affected in any way by the removal, to add a decorator:

 @exclude_objects_for_deleted_hosts('host') class Alias(models.Model): host = models.ForeignKey(to=Host, verbose_name='Host', related_name='alias') 

Now, in order to remove the Host object, it is enough to change the is_deleted attribute:

 host.is_deleted = True # after this save the host and all related objects will be inaccessible host.save() 

All requests will automatically exclude records referencing deleted objects:

 # model decorator @exclude_objects_for_deleted_hosts('checker__monhost', 'alias__host') CheckerToAlias.objects.filter( alias__hostname__in=['cloud.spb.s', 'cloud.msk.s'] ).values('id') 

It turns out such a SQL query:

 SELECT monitoring_checkertoalias.id FROM monitoring_checkertoalias INNER JOIN monitoring_checker ON (`monitoring_checkertoalias`.`checker_id` = monitoring_checker.`id`) INNER JOIN Hosts ON (`monitoring_checker`.`monhost_id` = Hosts.`id`) INNER JOIN dcmap_alias ON (`monitoring_checkertoalias`.`alias_id` = dcmap_alias.`id`) INNER JOIN Hosts T5 ON (`dcmap_alias`.`host_id` = T5.`id`) WHERE ( NOT (`Hosts`.`is_deleted` = TRUE) -- раз, проверка для monitoring_checker AND NOT (T5.`is_deleted` = TRUE) -- два, проверка для dcmap_alias AND dcmap_alias.name IN ('dir1.server.p', 'dir2.server.p') ); 

As you can see, in the request additional joines were added for the fields specified in the decorator and for checking `is_deleted` = TRUE .

Little about numbers


It is logical that additional joines and conditions increase query execution time. The study of this question showed that the degree of "complication" depends on the structure of the database, the number of records and the presence of indices.

Specifically, in our case, for each level of dependency, the request is penalized by about 30%. This is the maximum penalty that we get on the largest table with millions of records; on a smaller table, the penalty is reduced to a few percent. Fortunately, we have the necessary indexes set up, and for the majority of critical queries, the necessary joines were already there, so we did not feel a big difference in performance.

Unique Identifiers


Before you delete the data, you may need to free the identifiers that you plan to use in the future, because this may cause a non-unique value error when creating a new object. Despite the fact that in the Django-application will not be visible deleted objects, they will still be in the database. Therefore, for deleted objects we add uuid to the identifier.

 host.hostname = '{}_{}'.format(host.hostname, uuid.uuid4()) host.is_deleted = True host.save() 

Exploitation


For each new model or dependency, you need to update the decorator if you need one. To simplify the search for dependent models, we wrote a “smart” test:

 def test_deleted_host_decorator_for_models(self): def recursive_host_finder(model, cache, path, filters): # cache for skipping looked models cache.add(model) # process all related models for field in (f for f in model._meta.fields if isinstance(f, ForeignKey)): if field.related_model == Host: filters.add(path + field.name) elif field.related_model not in cache: recursive_host_finder(field.related_model, cache.copy(), path + field.name + '__', filters) # check all models for current_model in apps.get_models(): model_filters = getattr(current_model, DECORATOR_DEL_HOST_ATTRIBUTE, set()) found_filters = set() if current_model == Host: found_filters.add('') else: recursive_host_finder(current_model, set(), '', found_filters) if found_filters or model_filters: try: self.assertSetEqual(model_filters, found_filters) except AssertionError as err: err.args = ( '{}\n !!! Fix decorator "exclude_objects_for_deleted_hosts" ' 'for model {}'.format(err.args[0], current_model), ) raise err 

The test recursively checks all models for dependencies on the model being deleted, then looks to see if the decorator has been set to the required fields for this model. If something is missing, the test will delicately tell you where to add the decorator.

Epilogue


Thus, with the help of a decorator, it was possible with a little blood to realize the “deletion” of data that has a large number of dependencies. All requests automatically receive the required exclude filter. The imposition of additional conditions slows down the process of obtaining data, the degree of "complication" depends on the database structure, the number of records and the presence of indices. The proposed test will tell you which models need to add decorators, and in the future will monitor their consistency.

Source: https://habr.com/ru/post/438280/