How to read a HUGE (say ~ 25 GB) xml file in several streams?

Question

xml structure is pretty simple ...

But the file size is quite large ....

I want to read the file in several parallel threads ...

How to read a HUGE xml file in several threads?

which parser would you recommend?

UPD:

The file has no complicated structure ... there is no strong nesting ...

File slice:

<?xml version="1.0" encoding="utf-8"?> <PackageBOMs Date="2016-01-01" Time="00:00:00" System_ID="EGP" Client="1000" DateAktualn="2016-09-13" Version="BOM_007"> <BOM ST_LNR="00677090" ST_LAL="01" MA_TNR_EXT="СТУ1003094А" WERKS="" ST_LAN="1" BM_ENG="1.000" BM_EIN="ШТ" LA_BOR="GKB" ST_LST="10" ZT_EXT=""> <pos POS_NR="9000" ID_NRK="0202-09060-00006" PO_STP="L" SO_RTF="07" MENGE="0.060" MEINS="КГ" RE_KRS="false" SAN_KO="false" SAN_FE="true" SANKA="true" RV_REL="false" ZM_CMP="false" Z_WEIGHT_DETAIL="0.012" UNIT="KG" ZSIZE_DIMENSIONS="24" ZWEIGHT_BLANK="0.033" UNIT_B="KG" ALPRF="00" EWAHR="0" BEIKZ="" SCHGT="false" LGORT="" ITSOB="" IDENT="00000001"/> </BOM> <BOM ST_LNR="00677091" ST_LAL="01" MA_TNR_EXT="КПК0114624" WERKS="" ST_LAN="1" BM_ENG="1.000" BM_EIN="ШТ" LA_BOR="GKB" ST_LST="10" ZT_EXT=""> <pos POS_NR="9000" ID_NRK="0201-09080-00044" PO_STP="L" SO_RTF="07" MENGE="24.000" MEINS="КГ" RE_KRS="false" SAN_KO="false" SAN_FE="true" SANKA="true" RV_REL="false" ZM_CMP="false" Z_WEIGHT_DETAIL="9.850" UNIT="KG" ZSIZE_DIMENSIONS="163" ZWEIGHT_BLANK="19.400" UNIT_B="KG" ALPRF="00" EWAHR="0" BEIKZ="" SCHGT="false" LGORT="" ITSOB="" IDENT="00000001"/> <pos POS_NR="9000" ID_NRK="0201-09080-00044" PO_STP="L" SO_RTF="07" MENGE="24.000" MEINS="шт" RE_KRS="false" SAN_KO="false" SAN_FE="true" SANKA="true" RV_REL="false" ZM_CMP="false" Z_WEIGHT_DETAIL="9.850" UNIT="KG" ZSIZE_DIMENSIONS="163" ZWEIGHT_BLANK="19.400" UNIT_B="KG" ALPRF="00" EWAHR="0" BEIKZ="" SCHGT="false" LGORT="" ITSOB="" IDENT="00000001"/> <pos POS_NR="9000" ID_NRK="0201-09080-00044" PO_STP="L" SO_RTF="07" MENGE="24.000" MEINS="КГ" RE_KRS="false" SAN_KO="false" SAN_FE="true" SANKA="true" RV_REL="false" ZM_CMP="false" Z_WEIGHT_DETAIL="9.850" UNIT="KG" ZSIZE_DIMENSIONS="163" ZWEIGHT_BLANK="19.400" UNIT_B="KG" ALPRF="00" EWAHR="0" BEIKZ="" SCHGT="false" LGORT="" ITSOB="" IDENT="00000001"/> <pos POS_NR="9000" ID_NRK="0201-09080-00044" PO_STP="L" SO_RTF="07" MENGE="24.000" MEINS="КГ" RE_KRS="false" SAN_KO="false" SAN_FE="true" SANKA="true" RV_REL="false" ZM_CMP="false" Z_WEIGHT_DETAIL="9.850" UNIT="KG" ZSIZE_DIMENSIONS="163" ZWEIGHT_BLANK="19.400" UNIT_B="KG" ALPRF="00" EWAHR="0" BEIKZ="" SCHGT="false" LGORT="" ITSOB="" IDENT="00000001"/> <pos POS_NR="9000" ID_NRK="0201-09080-00044" PO_STP="L" SO_RTF="07" MENGE="24.000" MEINS="л" RE_KRS="false" SAN_KO="false" SAN_FE="true" SANKA="true" RV_REL="false" ZM_CMP="false" Z_WEIGHT_DETAIL="9.850" UNIT="KG" ZSIZE_DIMENSIONS="163" ZWEIGHT_BLANK="19.400" UNIT_B="KG" ALPRF="00" EWAHR="0" BEIKZ="" SCHGT="false" LGORT="" ITSOB="" IDENT="00000001"/> ... ... ... </BOM> ... ... ... </PackageBOMs>

Unfortunately, this xml is not formed in our company (by customers) ...

We do not have access to databases (4 different types of databases) on the client’s side, and cannot be, for security policy reasons in the field of information and something there (in a word, the public account) ...

The customer has no specialists and the desire to rewrite their software, which collects data from tables of different databases and generates this xml.

But I have no idea how streaming format like XML can be effectively read in several streams.
I have already implemented c using SAX (users are satisfied) ... But I don’t really like the speed ... I want to play
I had a problem reading a large file (about 6 GB) and only the StringTokenizer helped me.
in single-threaded mode, it takes a couple of seconds, judging by the logs.
Taking into account all the above, it remains only to apply SSD.
Or, alternatively, use a logical disk formatted with large clusters for storage of such large files: 32-64 kb, depending on your OS and file system.
The larger the cluster size, the faster the read-write of large files.

Accepted Answer · 2016-09-14T18:50:40

Files in XML format, like any other LL (n) -like grammars, cannot be read into multiple streams. The maximum that you can do is immediately after receiving the data transfer them to another stream for processing, so that 1 stream is always busy with parsing.

For example, if it takes half the time to parse a file, and another half to write to the database, then moving the work from the database to another thread will speed up the process twice.

Here you may find the java.util.concurrent.BlockingQueue class and the producer-consumer pattern (Producer-Consumer) useful.

This is true for an arbitrary XML file, but it may be incorrect for a particular XML file.
For example, if there are just <link x = "somelink" /> entries ... of the same type and without nesting, then reading in several streams will be no problem.
In the same way, the task of converting a specific cryptographic hash can be easily solved, but in its generality ...
Records of the same type ... I added an example file to the question.

Answer 2 · 2016-09-14T17:52:38

It seems to me that multithreading will not solve the problem ...

You are reading a file from one HDD => by running N threads, you will run into the HD performance.

Alternatively, you can read continuously in one stream into memory, and process data in several streams.

It seems that the pattern is called producer / consumer.

And why the XML file can reach such sizes? It may be possible to abandon XML, and place the data on the entities in the database?

In my opinion, this file size is not normal ...

I added to the question about why the file is obtained in such sizes (25GB is not the limit ..) .. I completely agree with you that files of this size are not normal.

avp avp 37.4k 3 gold marks 35 silver marks 90 bronze marks · Answer 3 · 2016-09-15T09:36:41

In principle, the file of such a structure is easily "parsed in parallel."

Determine its size and divide by the number of handler threads. Thus, each thread knows the position in the file from which it needs to start processing and where to end it.

All streams (except the first) are positioned at the desired point of the file and then read the file line by line. After reading the line with </BOM> (i.e. the closing tag for the part of the file processed by the previous thread), the line-by-line processing of its part begins.

It will end when the thread read the line with </BOM> at the position after the end of its part.

It is clear that the algorithm for finding the start of processing for the first and the end of processing for the last thread is somewhat (hopefully obvious) different from the above described ones.

How exactly "slip" self-allocated fragments into the Java parsing library, sorry, I don’t know. Personally, I (probably) would parse them myself, since the structure is trivial.

Will such stream processing give a performance boost?

Not obvious
(since reading from the same disk)
but quite possible
(imho due to parallelization of database update requests (however, this already depends heavily on a particular database)).

Nartallax Nartallax 9 1 bronze sign · Answer 4 · 2016-09-14T10:48:24

A pull-parser can become an alternative to the SAX parser: http://www.scala-lang.org/api/2.7.4/scala/xml/pull/XMLEventReader.html (it is from Scala, but from Java it will also be possible to use)

However, it is unlikely that he alone will give a really big speed gain.

In general, the next idea is possible - to read this file not sequentially, but from several positions at the same time (for example, the first gigabyte is read in the first stream, the second in the second and so on). Https://docs.oracle.com/javase/7/docs/api/java/nio/channels/FileChannel.html can help you with this (most importantly, do not start reading in the middle of the xml-tag).

This is an unverified idea, never implemented it. But she can help, because usually the biggest bottleneck is disk read.

Well, you started reading with a tag at a depth of 25. Your actions?
But in some cases (for example, in the root tag a lot of small unrelated tags) it makes sense to “jump over” to the nearest closing tag and start reading the next one.
@VladD, well, you can implement, if you started reading not from the opening tag, then we mark that the object is not complete.
we know the positions from which reading begins, we can number these collections (which groups of objects).
After reading the file, go through the collections to find unfinished objects, merge.
But of course the advice is to use FileChannel - this is in the spirit of "use a programming language, Luke!"
@hardsky: Okay, but then we have 25 GB of memory in its memory.
I have a suspicion that it is necessary to process the data immediately upon receipt.

Adeptius Adeptius 278 2 silver marks 8 bronze marks · Answer 5 · 2016-09-14T17:11:47

There is a class RandomAccessFile. Can read a file from a specific byte.

 //r- read, файл открыт только для чтения RandomAccessFile raf = new RandomAccessFile("input.txt", "r"); // «курсор» стоит на 0-м символе. String text1 = raf.readLine(); //перемещаем «курсор» на 100-й символ. raf.seek(100); String text2 = raf.readLine(); //перемещаем «курсор» на 0-й символ. raf.seek(0); String text3 = raf.readLine(); //закрываем файл raf.close();

How he will then glue the data or work with the pieces separately - I do not know, but I suggested with the help of which class you can read a piece of the file.
It can be divided into streams - each starts with its own byte.
And I agree with what was written here that everything will rest on the performance of the disk.

How to read a HUGE (say ~ 25 GB) xml file in several streams?

5 answers 5

More articles: