If for example there are 200 countries and 5000 cities, and that in the table of the form:

id | country | town | message 

do not use the text of the country and city (repeated), and instead use the numbers (id) of countries and cities from another table, for example:

 id | country //table country id | town //table town 

Will this reduce the size of the table if it is estimated that there will be 500 million records, is this correct and are there alternatives?

    3 answers 3

    I would say that it is just necessary to do so. And it's not even the size of the base, although it also matters. Imagine that you are mistaken in the name of the city. If this name is repeated in the database more than 1 time, you will need to change it in all records in which it is found, therefore in relational databases always try to adhere to the principle that all data is stored in a single copy, and in all other places just a reference by ID to the required object. Look in Google for the request "Normal form", there will be a link to the wiki, but a dry theory that is difficult to understand will be like a wiki, I recommend to follow the other links given by Google.

    As for your case, I approximately see the structure of the base as such (I don’t know what a message is, but suppose it really depends on the country and the city)

    I write everything in the oracle notation, the types of data are selected according to the database with which the country table is working:

     create table Countrys ( c_id number not null primaty key, -- ID страны name varchar(200) -- название страны ... -- возможно еще какие то данные о стране ); 

    City table:

     create table Towns ( t_id number not null primaty key, -- ID города c_id number not null -- ID страны в которой расположен город, надо же знать -- это Москва - столица РФ или одноименная деревня -- в США name varchar(200) -- Название города -- Возможно еще какие то данные о городе ); 

    Well, your mysterious messages:

     create table Messages( msg_id number not null primary key, t_id number not null, -- Ссылка только на город, если будет нужно узнать страну - мы ее найдем по городу message varchar(200) ); 

    And finally, a little SQL. With the insertion in general, everything is simple, insert always goes only to one table, separately fill out the directory of countries, then the cities and already your messages

     insert into Countrys(c_id, name) values(1, 'Россия'); insert into Towns(t_id, c_id, name) select 100, c_id, 'Москва' from Countrys as C where C.name='Россия'; -- Но работа по имени очень не приветствуется, когда вы будете так вставлять -- записи в messages как вы по имени зададите Москва, если в мире не 1 десяток -- городов с таким названием 

    Again, it depends on the base where the id is taken from. in MySQL, you can simply say when declaring a column that an id machine would be assigned. Look for it yourself, I do not remember by heart. And in absolutely other mechanisms.

    Well, a sample of one message so that both the city and the country would get

     select C.name as country_name, T.name as town_name, M.message from Messages as M, Towns as T, Countys as C where M.msg_id=1 and T.t_id=M.t_id and C.c_id=T.t_id 

    True, the query notation with the JOIN operator is usually used, but it is cumbersome, I do not like it.

    And read about foreign key, these are links showing the database linking the field id of the country in the table of cities with the primary key id in the table of countries. They need to create. At first it seems that they complicate your life by not allowing you to insert an entry into the table of cities with a non-existing country ID. But it is precisely by this that they save you from mistakes in the code, often leading to dire consequences. Imagine you insert 10k messages with the city of New Vasyuki, and then accidentally delete the city from the directory. You still have message messages, but you don’t know what city they belong to and what to do with them further (not to mention the fact that the one I’ve quoted above just doesn’t show these lines). So, the foreign key will not allow you to delete a city that has at least one link, and if you really decided to delete it, you will not forget about those messages that refer to it and make a decision, for example, change their city to Old-Vasyuki

    • And you can add or change how to perform the insert in the Towns if I know that the city is "Russia", how to insert in the Countrys, then it is clear, the question of how to insert in the Towns or in the Messages! You are welcome! - bsbak
    • I will write now. but often this is solved at the level of the program that works with the database. For example, the user selects a country in the drop-down list and you already know her id in the program immediately. You are in the user's browser (or where it is planned) for sure, besides the name, you transmitted hidden fields with id, because the names can never be trusted :) - Mike
    • @ user3737786 Yes, one more FOR work only by ID. Imagine that you are trying to put a message in a message using the name of a city, but at that moment another user noticed that the name was not capitalized and decided to correct it. And you can not insert a record, the city will not find it. And id is not going anywhere - Mike

    Up to the moment “there will be approximately 500 million records” I would say that it is necessary to split the data into several tables, but after this clarification everything is not so obvious. Up to the point that you should already think about the nature of data storage and, perhaps, abandon MySQL.

    About normalization already written in other answers, so I will write briefly. Its meaning is that you keep countries, cities and messages separate from each other. These are different entities and, in an amicable way, you don’t need to write the same thing to the database many times (however, there are exceptions to the rule and about them below). Data organization is described in detail in Mike’s response .

    BUT! this is true if the amount of data is within the "visible" limits. When we talk about a very large number of data, we must take into account the nature of working with data : is this data written more or read more? If they are read, then evenly or is there such data that is needed more than others (for example, records for the last month are jerked from the database more often than others)?

    With large amounts of data it makes sense to think about partitions . And if the data is very often jerked by a package ( country + city + message ), then I assume that storing them in one table would be a more optimal solution than permanently merging the three tables, especially with such volumes. This is the denormalization of the database .

    Another option: store data that is needed more often, separately from data that is rarely accessed.

    As an option, you can think about storing data from small tables on the client side (for example, in the browser's local storage) and merge the data already when rendering pages.

    Options for making the sea. Do not look for the right thing, look for the best for your particular situation.

    • Thanks, good to know! - bsbak

    Yes, it will reduce. That's the way to do it. This is called normalization.