There is a 400mb XML file that needs to be rendered in a readable form.

Unfortunately, all the console utilities that I found preload the file into memory and cannot process my file.

Can anyone have an example script or console utility for such cases?

I tried xmllint, crashes in a few seconds:

xmllint --format --shell in.xml > tmp.xml 

Thank!

  • one
    split it into parts (by tags repeating), process parts of xmllint, merge them back into a large file. you can try using php at stackoverflow.com/questions/1167062/… and indeed, through any language with xml support, you can find options for lazy loading - strangeqargo
  • You may not need to use the --shell argument, check - strangeqargo
  • Take some Sax parser. Or if you can in C #, go XmlReader . - VladD
  • one
    Try adding the --stream to xmllint , as suggested in the comment to this answer - yeputons
  • Well, on php, by the way, there is also an XMLReader, I used to parse large files at one time ... only I misunderstood ... do you just need to parse it, or do you really need to format the text into a readable form ?? those. really someone will be half-gigovy xml eyes to read? - AlexandrX

1 answer 1

  1. How to bring a large xml file in a readable form? If I understand correctly the first thing to do is to read the data from this file, the second is to form the output, the third is to write or output information
  2. They preload the file into memory and cannot process my file ... If I understood correctly, the main problem is that the file is too large and there is not enough memory to read it.

If I misunderstood you, I ask you to give clarifying information and in that case I will correct the answer, and if I understand correctly, I suggest using the PHPOffice / PhpSpreadsheet library. It was created based on the PHPExcel library, which was very slow, but supported including xml. What is the reason for choosing this library? It allows large files to be read in portions. I give an example of the author’s code for using the library to read several lines.

  <?php error_reporting(E_ALL); set_time_limit(0); date_default_timezone_set('Europe/London'); ?> 

 <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <title>PHPExcel Reader Example #11</title> </head> <body> <h1>PHPExcel Reader Example #11</h1> <h2>Reading a Workbook in "Chunks" Using a Configurable Read Filter (Version 1)</h2> 

  <?php /** Include path **/ set_include_path(get_include_path() . PATH_SEPARATOR . '../../../Classes/'); /** \PhpOffice\PhpSpreadsheet\IOFactory */ include 'PHPExcel/IOFactory.php'; $inputFileType = 'Xls'; // $inputFileType = 'Xlsx'; // $inputFileType = 'Xml'; // $inputFileType = 'Ods'; // $inputFileType = 'Gnumeric'; $inputFileName = './sampleData/example2.xls'; /** Define a Read Filter class implementing \PhpOffice\PhpSpreadsheet\Reader\IReadFilter */ class chunkReadFilter implements \PhpOffice\PhpSpreadsheet\Reader\IReadFilter { private $_startRow = 0; private $_endRow = 0; /** * We expect a list of the rows that we want to read to be passed into the constructor. * * @param mixed $startRow * @param mixed $chunkSize */ public function __construct($startRow, $chunkSize) { $this->_startRow = $startRow; $this->_endRow = $startRow + $chunkSize; } public function readCell($column, $row, $worksheetName = '') { // Only read the heading row, and the rows that were configured in the constructor if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) { return true; } return false; } } echo 'Loading file ',pathinfo($inputFileName, PATHINFO_BASENAME),' using IOFactory with a defined reader type of ',$inputFileType,'<br />'; /* Create a new Reader of the type defined in $inputFileType **/ $reader = \PhpOffice\PhpSpreadsheet\IOFactory::createReader($inputFileType); echo '<hr />'; /* Define how many rows we want for each "chunk" **/ $chunkSize = 20; /* Loop to read our worksheet in "chunk size" blocks **/ for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) { echo 'Loading WorkSheet using configurable filter for headings row 1 and for rows ',$startRow,' to ',($startRow + $chunkSize - 1),'<br />'; /* Create a new Instance of our Read Filter, passing in the limits on which rows we want to read **/ $chunkFilter = new chunkReadFilter($startRow, $chunkSize); /* Tell the Reader that we want to use the new Read Filter that we've just Instantiated **/ $reader->setReadFilter($chunkFilter); /* Load only the rows that match our filter from $inputFileName to a PHPExcel Object **/ $spreadsheet = $reader->load($inputFileName); // Do some processing here $sheetData = $spreadsheet->getActiveSheet()->toArray(null, true, true, true); var_dump($sheetData); echo '<br /><br />'; } ?> 

 <body> </html> 

$ chunkSize = 20; I think this is too small. If you have 1 000 000 lines there, then it will be quite possible to take 25 000 lines - this is about 25-30 seconds of processing. In essence, what we are talking about here: We implement the filter interface, create a verification method, specify the number of read lines at a time $ chunkSize, and in the loop specify the starting and ending string. We get the result in the form of the array $ sheetData = $ spreadsheet-> getActiveSheet () -> toArray (null, true, true);

added a minute later: as you can see the file format is set hard, but you can automatically

 $inputFileType = \PhpOffice\PhpSpreadsheet\IOFactory::identify($arr['FileName']); $reader =\PhpOffice\PhpSpreadsheet\IOFactory::createReader($inputFileType); $reader->setReadDataOnly(true); 

Added 2 minutes later:

To quickly launch the library I downloaded a folder with contents called PhpSpreadsheet, put it in the PhpOffice folder and made _autoload

 //ini_set('include_path', '/var/www/main_lib'); //error_reporting(E_ALL); function __autoload($class_name) { try{ require_once(str_replace( '\\', '/', $class_name ). '.php'); } catch(Exception $e){ echo 'err/не удалось загрузить класс '.$class_name.' либо в php.ini не выставлен параметр include_path="/var/www/main_lib"'; } } 

As you may have guessed, my library is in the / var / www / main_lib folder