There are many databases with the same structure. I need to merge them into one. The problem is that during a sink, id-shniki (they are unique only within the same database) can be duplicated in identical tables. And they cannot simply be re-generated by the insert th because they are referenced from other tables ... And there are many such references. I already tortured them. It’s not a union problem to merge, but it’s very difficult with id-snikami. Maybe a simple solution has already been invented for such a thing? Maybe I do not know?

  • You can write some script on some python that will generate new IDs and generate inserts with correctly set new IDs - andreymal
  • @andreymal I said - the problem is in the links. It is not enough just to generate id-shniki. Need more links to update. And this is not so easy. - PECHAIR
  • Well, the same script and re - generate links) - andreymal
  • I think if there is nothing unusual in the schema (like cyclic data dependencies) and there are few tables (≤10, say), then you can fill the data into temporary tables / databases and one or two pure sql queries to transfer them ... think up mcve'shka data I can think about the specifics ... - Fat-Zer
  • @ Fat-Zer <= 10? Seriously? : D No, there are tens of thousands of rows in each table. I did not understand the rest ... - PECHAIR

3 answers 3

We had a similar. True, we used Oracle.

So, in order to preserve uniqueness in the context of the source database, we did add. field - database ID roughly speaking (SOURCEID). And primary did compound (from the field ID and SOURCEID)

PS: You can also use the "offset" to not add a separate field.

For example, ID = 1, SOURCEID = 1; future ID = 100001; Etc.

  • Yes, I already tried to do that too. But in the end, I have to get rid of this field, but it is not so easy ... Ie It can be used during the transfer, but why should it be in the project? - PECHAPTER
  • one
    Cream will be one-time? You can then merge everything into one database first, and then recalculate the ID taking into account SOURCEID. For example, ID = 1, SOURCEID = 1; future ID = 100001; And then, in fact, delete the field SOURCEID - Chubatiy
  • Well, maybe not at all ... They just copied the project together with the base there once a year. And now until the current project ends (it’s not for the whole year, it’s already coming to an end) you cannot copy the last base. Then pour it all over. - PECHAIR
  • In addition, another problem is that registered users can be duplicated (by email), and you only need to select unique ones (sort by creation date), and the rest are not needed. - PECAPSTER
  • 2
    NDA, in short, there is no simple solution. And this is all I already knew. Okay. I will continue to suffer. - PECHAPTER

As a possible solution.

We analyze the scheme, take all the links in the pencil, identify the bushes (when the same value corresponds to more than 2 fields) and group them. For each connection, we select the delta value, adding it to id removes the problem of potential duplication (that is, delta > MAX(main_db.table.field)-MIN(add_db.table.field) for each table, after which the maximum delta is taken for the whole bush ties).

Then, the data from the additional database tables is unloaded into plain text with the correction of IDs (both unique and referrals) by the type request

 SELECT id + @delta1, field1, field2, ..., ref_field + @delta2, ... INTO OUTFILE 'backup_tableX_filename' export_parameters FROM add_db.tableX; 

And the download is performed without correction

 LOAD DATA INFILE 'backup_tableX_filename' import_parameters INTO TABLE main_db.tableX; 
  • a good idea, except that OP writes in comments to the neighboring answer that there are conflicts over other UNIQUE constraints, which, as I understand it, either have to be ignored or upsert to process ... have ideas what to do with them? - Fat-Zer
  • there are conflicts over other UNIQUE-constraints. As I understand it, there is not a UNIQUE problem, namely the duplication problem. In this case, the merging of two duplicates is a separate operation, not very complicated. But if there really is UNIQUE CONSTRAINT - then this is a complete ass ... And the path is much longer (three times somewhere). We unload in the described way into the second table, on the basis of the first and second we compile the third table, correspondences, which have an additional. fields (Id1-Id2-min_Id) and without uniqueness restrictions, after which we reassign the bindings, and then return either the constraints or the data to the main one. - Akina
  • and if you think about it, you can just drop UNIQUE, update everything, remove duplicates, and then add back ... - Fat-Zer
  • Well, if the base is not combat, but session, then it is possible and so. - Akina

There just won't be any. When designing you, no one apparently imagined that you would need this scenario with multiple copies and merging them, otherwise you would think about some kind of straw, for example, guid as keys or id segmentation.

Slowly translate all databases from int to guid - it will be easier to merge. This option may be more advantageous if the association is not a one-time.

Also, it seems to me, based on the comments, that you may soon need the merge records tool. For example, I consider such things to be generally must have in information systems. For example, I have to combine duplicate clients in the system - a matter of a few clicks (most of the time it takes to specify which client will be 'primary' (the main where the data flow), while contacts (also with duplicates!), Orders, etc. P.

In this regard, you may find interesting articles on the Habré about dadata.ru - the guys write sensible technical things about the deduplication of business data. Well, for example, it was immediately found in the search: DaData.ru finds and destroys the same people

Here you see what the chip. Stop thinking about your task as “doing only SQL queries”, not the fact of what happens.

  • What are the benefits of a guid over a regular numeric id? Something I do not understand at all. Yes, and it will probably be slower to work with guid (int vs string => always the number wins in speed before the string). And the fact that there is designed gaven I already know. The structure here is terrible. I after all also normalize it at the same time. Some of the tables and columns are superfluous or empty, the names of translit, broken links (referring to non-existent records) and similar "charms" at every step. Hack job: D - PECHAPTER