There is a table of users and a directory of properties (tags) of users. They relate to each other as many to many, i.e. there is a third intermediate table with the fields user_id, tag_id. How to organize a search for similar * users?

* Similar users are those who have matched sets of tags. Ideally, I would like to pass the parameter when searching for the value of the maximum difference in the composition / number of tags.

You can do anything, by and large, even using Postgres is not necessary. But the solution must be used by NodeJS.

  • describe in more detail what the "discrepancy in the composition of tags" is. the number of matched / differing is less clear, but here is the "composition" ... And search for a specific user similar or just all similar - Mike
  • Conditions can be clarified, but the more conditions, the less freedom. And the problem is not so trivial - MiF

1 answer 1

In general, the query receiving the similarity of users looks something like this:

select * from ( select a.user_id a_user_id,b.user_id b_user_id,a.tcnt a_cnt,b.tcnt b_cnt,count(1) same_cnt from (select user_id,tag_id,count(1) over(partition by user_id) as tcnt from user_tags where user_id=NNN ) A, (select user_id,tag_id,count(1) over(partition by user_id) as tcnt from user_tags ) B where A.tag_id=B.tag_id and A.user_id!=B.user_id group by A.user_id,B.user_id,A.tcnt,B.tcnt ) T where a_cnt=b_cnt and same_cnt=a_cnt 

In this example, all completely identical users are obtained. To find similarity, you need to fix where at the very end, based on the fact that a_cnt is the number of tags for user A, b_cnt is for user B, and same_cnt is the number of matching tags. If you want to look for the similarity of a particular user, then it is best to add where user_id=NNN inside subquery A

If you really want to look for the similarity of all users to all users, then the request can be simplified to speed up the work:

 with Q as( select user_id,tag_id,count(1) over(partition by user_id) as tcnt from user_tags ) select a.user_id a_user_id,b.user_id b_user_id,a.tcnt a_cnt,b.tcnt b_cnt,count(1) same_cnt from Q a, Q b where a.tag_id=b.tag_id and b.user_id!=a.user_id group by a.user_id,b.user_id,a.tcnt,b.tcnt having a.tcnt=b.tcnt and a.tcnt=count(1) 
  • No, you need to find all similar for a specific user - MiF
  • Does it seem to me, or is this a rather heavy query? It will not hang me the database on 100k users and at least 200 properties? - MiF
  • @MiF I added a user to the first version of the request. Of course the request is heavy. Gluing on the tag should work well on the index. But counting the number of tags of all users, I think the main problem for performance. It is solved by entering into the table of users, for example, the number of tags already counted for him and maintaining this field with automatic triggering - Mike
  • @MiF I see further optimization in data storage in the likeness and counting for new users and users with changing tags - Mike
  • @MiF If we were to look for exact equality, then we could calculate a certain checksum from the user's tags and then find entry into it. - Mike