Parsing large .csv and adding to the database

Question

There is a file, it has 300.000 lines, I need to process each line, and add a value to the database. I use the standard function php

$read = fopen('file.csv', 'r'); while( ($str = fgetcsv($read, 8000, ',')) !== FALSE ): //... парсинг файла endwhile;

The file itself is processed very quickly, about 0.6 - 0.9 seconds.

But when I add a line check, or a value, then the speed increases at least 1 time. It is clear that at 300k lines add a check, and it takes time.

But then, I also need to add all these lines to the database, I combine the requests, and send in the 1st request I combine several requests.

 $sql_add = "UPDATE `table` SET `prof` = '1,4,6,8' WHERE `id` IN(2,3,4,1,1,2233,3321,1,3,2... сюда еще дописывается в среднем до 1000 айдишников)"; $query = Db::query($sql_add); $query->closeCursor();# не ждать ответа от запроса

And thanks to such combined requests - the number of requests decreases. But all the same, about 1,250 requests come to 100k records, and my script hangs for a long time, right up to the timeout.

How are such large files parsed, and how can they be quickly added to the database? .. If most of the time is spent on sending requests and checking data.

There are 2 questions: Do you have duplicates in IN (three units) Is this normal?
Is it possible to somehow change the query so that INSERT INTO can be used ... ON DUPLICATE KEY UPDATE
This is for the test written with duplicates, but in general there are no duplicates.
@ user190134 and what about INSERT INTO ... ON DUPLICATE KEY UPDATE?

KAGG Design KAGG Design 20.5k 3 12 44 · Answer 1 · 2017-02-11T14:17:34

The problem is known, and server timeout settings are not always available for changing. To handle long requests, many plugins (for example, backup) break their code into chunks, which restart themselves to continue. Or create a queue of pieces, which are then executed one by one.

I did this on WordPress for downloading 14 thousand items. I do not know what your site is on, but the general principle of operation is the same: launching pieces from the cron queue.

Community spirit ♦ one · Answer 2 · 2017-02-11T15:06:22

I recommend importing the file directly to the database via LOAD INFILE :

Create a table matching the file:

 CREATE TABLE 'CSVImport' (id INT); ALTER TABLE CSVImport ADD COLUMN Title VARCHAR(256); ALTER TABLE CSVImport ADD COLUMN Company VARCHAR(256); ALTER TABLE CSVImport ADD COLUMN NumTickets VARCHAR(256);

Download:

 LOAD DATA INFILE '/home/paul/clientdata.csv' INTO TABLE CSVImport;

And then you insert into the table you need via INSERT INTO (SELECT ... FROM CVSImport)

Answer 3 · 2017-02-11T13:57:09

I can suggest using prepared queries .

 $connection->autocommit(false); //Отключам автоматическую фиксацию изменений базы данных при проведении транзакции. И, собственно, начиинаем саму транзакцию if (!($stmt = $connection->prepare("INSERT INTO orders (id_orders, id_handlings, id_gens, id_gens_groups, id_genotypes, id_staff, consider_cost) VALUES (DEFAULT, '$id_handlings', ?, ?, DEFAULT, DEFAULT, DEFAULT)"))) {die($connection->error);}; //Подготоавливаем запрос if (!$stmt->bind_param('ii', $id_gens, $id_gens_groups)) {die($connection->error);}; //Привязываем переменные - обе integer => ii $id_gens=15; //получаем значение одной из переменных $id_gens_groups=25; if (!$stmt->execute()) {die($connection->error);}; //Выполняем подготовленный запрос. if (!$connection->commit()) {die($connection->error);}; //Сохраняем транзакцию

Thus, the database will not compile each query every time. And compiles it once and will just substitute the values. This should help.

In addition, the script runtime (maximum) can be increased or the timeout can be completely removed.

 set_time_limit(0); //Устанавливаем максимальное время выполнения скрипта в секундах. Ноль - неограничено.

Option one more. Download the database and raise it on the local machine. Perform all manipulations on the local machine, create a copy and replace it on the server.

Answer 4 · 2017-02-13T08:32:56

The input file can be processed in parallel (if there is such an opportunity - several processors), this will increase the speed by several orders of magnitude. That is, several threads form the input file for LOAD DATA INFILE.
Further act as @Daniel Protopopov wrote

Parsing large .csv and adding to the database

4 answers 4

More articles: