There is a list of products that have a number of characteristics. Products have common characteristics: for example, weight and size. There are unique characteristics of some products, for example, color (in addition to the first two). There can be up to hundreds of different unique characteristics for different products and the task is to organize storage in such a way that the value of a particular characteristic can work in WHERE.

I plan to put each of the possible characteristics in a separate column, in connection with which the question is: how much does DB performance decrease depending on the number of columns, if there are up to a hundred columns? Maybe there is a more optimal way to store and process data in this situation?

  • one
    There is no more optimal one. There is less optimal, but more versatile. Your characteristics will now be hard-wired in the code and the addition of a new characteristic - code modification - Mike
  • Here's something like that was discussed, though there were more types of goods, each with its own list of characteristics ru.stackoverflow.com/questions/466357/… - Mike
  • Thanks for the reply and for the link, Mike. I was afraid that there would be problems with a hundred columns of performance. If this is a more optimal solution in terms of performance, then I will stop there. It’s not a problem for me to insert into the table structure all the characteristics by which the WHERE clause can be applied. - 118_64
  • I completely forgot that you will need to build an index for the search ... In general, all considerations are in response. - Mike

1 answer 1

There are two ways to store a lot of parameters, as you suggested - a parameter - a column. And in the form of a separate table, where for each record from the main table there are many records in which the id of the main record, the id of the parameter type and the value of this parameter.

Both approaches have their pros and cons. When the parameters are separate and separate lines, but the column with the parameter value is only one, an index can be constructed from this column and instantly find the id of the necessary records by the parameter value. On the other hand, when you just need to show all the parameters for a particular product, the selection of each parameter starts to slow down as a separate line. If the parameters are a hundred - it is quite expensive.

When you approach, a sample from the table of goods by product id will be instantaneous, all parameters are in one record, it means that in one disk operation we get everything we need to know about the product. It is perfectly. BUT on 100 columns it is impossible to build indexes, because each additional index takes up a lot of disk space and slows down the insertion of new records, because when inserting a record, each index must be partially rebuilt. And since we leave columns with parameter values ​​without indices, any search for them will be forced to read the entire table from the disk. In addition, with this approach, when adding a new parameter, you need to add a new column, explicitly indicate it in the samples, and it is possible to mention it in different parts of the code.

There is a third approach. Store attribute values ​​in columns and records in a separate table. To search by parameter values, a separate table is used. And when you show a particular product data is taken directly from the main record. Yes, this approach introduces data redundancy. When changing any parameter, it must be changed in two places. In principle, this can be done with triggers for reliability. The changes will be a little longer than when working with only one table.

In general, the truth is somewhere near , you need to find a balance. It is possible to keep a part of attributes in two places, and to keep some minor parameters only in the search table, but not in the main one.