There is a hierarchical data structure that is periodically updated based on uploads from an external source. Unloading can be complete or on one specific node.

Each node has a boolean attribute 'hidden', which comes in unloading. Also, the site has some content that may be empty. Also, each node has a Boolean attribute 'empty', which needs to be recalculated at the time of synchronization.

The empty attribute must be set to true if the node meets all the conditions:

  • Node has empty content.
  • Node has no non-empty and non-hidden descendants.

In the unloading, the descendants are guaranteed to follow their parents, so earlier the definition of empty was carried out as follows:

The program runs sequentially through the upload, each node adds to the database or updates, if it is already there. Looks in unloading, whether the content is empty. It makes a selection in its database of the immediate descendants of this node, which are not empty or hidden. Based on this, sets the value to empty.

After that, the program consistently goes up the chain of parents of this node and does the same procedure for them. This is necessary because the information about the descendants will be in the not yet worked out part of the upload, and if their list has been updated or their values ​​are empty and hidden, then the empty value of the parent may be invalid and it should be recalculated.

I know that the algorithm does not look optimal, but I did not write it.

Now the task has come: to make the node so that it can be like an alias (link) to another node. This is necessary to implement a parallel hierarchy.

Accordingly, in the calculation of empty it is necessary to take into account that the node must be non-empty if it refers to a non-empty node. The problem is that the alias may be present in the unloading both earlier and later than the node to which it refers.

How is it best to implement such a structure taking into account the requirements described?

UPD: PosgreSQL is used as a database via sqlalchemy. Accordingly, all synchronization work is performed by a Python script.

The node table has fields: id, parent_id, link_id, hidden, empty, and several other fields that are not involved in this task.

parent_id and link_id are id from the same table.

There is also a table with content, from which for each node a number of elements are taken.

  • 2
    At the time of recalculating the status of a node, find all references to it and also update them. From the unload ignore links if they are already in the database in the same form as they came or create, if they are not there yet. And it can first load all statuses, and then make a single recount of all parents in the hierarchy. for this, it is possible to add the changed sign to the nodes so as not to recalculate the entire tree. - Mike
  • one
    And if you indicated which particular database and what its structure (including links) you could come up with a specific solution for it. Although my idea remains such that nothing needs to be calculated at the time of synchronization of a particular element. First, we load everything, setting changed if the non-inherited empty value for the node has changed, and then we run through the tree from bottom to top and update the statuses, starting with the entries from the changed, in parallel (or upon completion, if we keep within one request) resetting this status - Mike
  • @Mike added structure information. - Xander
  • " and some more fields " - I think it is possible to consider for clarity the presence of one content field on the basis of which to calculate empty. By the way, when loading can it turns out that there is no node and at the same time it will be removed and all links to it? And if so, how do you track this fact? - Mike
  • one
    And yes, how do you like it better, recount part of branches on the basis of some changed or the whole tree at once? If usually only a small honor changes, then changed of course is convenient. but it must be stored somewhere. And it could be done in the same table and changed with the same update that changes the content or in a separate table to keep a list of changed records id, then besides the main update you will need to insert another in this table, but after work you can clear it and use the sign only when updating will not hang in the main table - Mike

1 answer 1

Add to the table with the tree the changed field of type int . On the column, we build partial index: create index tree_change on tree(changed) where changed is not null . When loading data from the outside, we set changed=1 for all records directly in which the content or hidden has changed in such a way that the empty field should change. The hierarchy of records at this stage is not considered at all. All new entries also set changed=1 .

Upon completion of loading external data, execute the following query

 with recursive Q(id, parent_id, link_id, changed) as( select id, parent_id, link_id, changed from tree where changed=1 union all select t.id, t.parent_id, t.link_id, Q.changed+1 from Q, tree t where t.id=Q.parent_id or t.link_id=Q.id ) UPDATE tree AS t SET changed = n.changed FROM ( select id, max(changed) changed from Q where changed>1 group by id ) n WHERE t.id = n.id; 

It exposes all parent entries in the hierarchy (as well as entries for which link_id indicates changing) changed by 1 more than the children. In fact, this is the number of transitions in the tree from the changed node to the given parent. If the same node has a different distance from several changed child elements, the maximum of them is taken.

Thus, changed will be set for all nodes that depend (on the hierarchy) on the changed.

After that, in the loop, while update has changed at least one record, we execute the following query:

 update tree t set empty=(case when content is not null then 0 else (select coalesce(min(t1.empty::int),1) from tree t1 where t1.parent_id=t.id or t1.id=t.link_id) end)::boolean, changed=NULL where changed=$N 

Where $N sequence number of the iteration of the cycle from 1 to the maximum changed received after the first request (there is no sense in directly receiving the maximum, the fact of changing the records indicates that there was still such a number).

Each request in the cycle calculates empty at the node of the tree based on the state of the node itself (first of all) and all its immediate children (if the node itself is not filled with content). It is not necessary to go deeper in the hierarchy, since the parents will be updated only after all the children have been counted (because the children have the lesser changed ).

I did not take into account the values ​​of the hidden field, add to your liking in the logic of calculating the status in the second query.