Good night gentlemen. There was a need to solve the problem of uniformity of sampling data from the database. As a more understandable example: You need to select 120 records with a weight of 10 to 12 kg, and the material should be, for example, plastic or metal. And it seems that there should just be something like:

select obj.* from objects obj where obj.weight >= 10 and obj.weight <= 12 and (obj.material = 'пластик' or obj.material = 'металл' ) limit 120; 

But the problem is that such a request is likely to give me 120 records with a weight of 10 kg and material - metal, which is not suitable in my situation.

We need exactly a uniform sample: 20 plastic 10 kg each, 20 plastic 11 kg each, 20 plastic 12 kg each, 20 metal 10 kg each, and further by analogy. Total in the amount of exactly 120 required.

Were thought of a cycle to go through all the conditions, build queries for each of the above parts and get a number of resulting performance problems and sample adjustment in case of a lack of records in one of the groups of conditions.

Maybe I reinvent the wheel and there is a ready-made solution to my problem, or is there a more optimal solution direction? I work with PostgreSQL.

    3 answers 3

    Slightly modified request Small :

     SELECT * FROM( SELECT ROW_NUMER()OVER(ORDER BY order_in_group)AS total_order, T.* FROM ( SELECT obj.*, ROW_NUMBER() OVER (PARTITION BY material, weight) AS order_in_group, FROM objects obj WHERE obj.weight >= 10 and obj.weight <= 12 and (obj.material = 'пластик' or obj.material = 'металл' ) )T )T WHERE total_order <= 120; 

    For each unique set of values ​​for material, weight we numbered the lines ( order_in_group ). then we number with sorting by the received number ( total_order ) - and take the first 120 records. Thus, we do not know how many values ​​of materials and weight, we will issue equally from each group of values ​​(or all, if there are not enough values ​​in some group).

    Those. solve the problem: lack of records for one of the groups of conditions.

    If you want to see different data each time as a result of the request. You can slightly change the sort order_in_group : ROW_NUMBER() OVER (PARTITION BY material, weight ORDER BY random())

    ps: the second numbering is not required, this is me for clarity + I do not like to use a non-standard limit . But you can write a little easier:

     SELECT T.* FROM ( SELECT obj.*, ROW_NUMBER() OVER (PARTITION BY material, weight) AS order_in_group, FROM objects obj WHERE obj.weight >= 10 and obj.weight <= 12 and (obj.material = 'пластик' or obj.material = 'металл' ) )T ORDER BY order_in_group LIMIT 120; 
    • I ran across a boundless expanse of such a decision, but apparently something went wrong with me then. At this time, a little corrected request and it turned out exactly what you need. Thanks for the help! PS For different data there is more complicated logic, and there is no random access to it. It is necessary to issue data no more than once in a certain period of time, but this is another story. - Eugene
     SELECT /**/ FROM ( SELECT ROW_NUMBER() OVER (PARTITION BY material, weight ORDER BY /*по какому критерию отсортировать записи в группе и выбрать 10 из всех подходящих*/) AS lim, /**/ from objects obj where obj.weight >= 10 and obj.weight <= 12 and (obj.material = 'пластик' or obj.material = 'металл' ) ) x WHERE x.lim <= 10; 

    10 records of each material and weight block in the table will be returned. The general limit is optional.

    • everything seems fine, but 10 lines from each group? This does not solve the problem: adjusting the sample in case of a shortage in the number of records for one of the groups of conditions. - pegoopik

    Such a crutch will suit you?

     select obj.* from objects obj where obj.weight >= 10 and obj.weight <= 12 and (obj.material = 'пластик' or obj.material = 'металл' ) order by random() limit 120; 

    The idea is simple - turn all table entries into mincemeat by randomization.

    • Anyway, there is no guarantee that this stuffing will be evenly distributed - Eugene