There are a lot of excel files (more than 10,000), open each file and copy information from there - it will take a lot of time.

The question is how to extract text, for example, from the first sheet of a K6 cell from several sheets at once, and record this data, at least somewhere, at least in a txt file.

  • Take, for example, vb, or python and write a prog line at 10 - Alexander Chernin
  • @AlexanderChernin and some basic methods? - iKey
  • It seems to me that only if VisualBasic - Alexander Chernin
  • But if there are a lot of files with sheets, then python is better - Alexander Chernin
  • It is possible and in VBA. Yes, the files will open, but the user may not see this. Refine the task, language - vikttur

1 answer 1

This documentation allowed me to make a set of utilities for both parsing and xls / xlsx generation. I’ll say that xls consists of a wrapper and data. Xlsx from an "almost" zip archive (i.e., a bit different from a regular zip) and xml set (document itself + dictionary + some other data) You need to clearly understand what is being parsed, because sometimes excel can mean xml and html documents that msoffice can convert to a document.

The algorithm is as follows: 1. Read the MSAT from the header. Either read the entire SAT chain, or read it "as they become available." 2. Read (taking into account MSAT + SAT + SSAT) the "root" of the document, it is easy to find it by signature 09 08 with an address multiple of 512. In the root of the document read the dictionary, and addresses of the sheets (signature 85 00 ). 3. Read the document idly until the data sheet. A sheet always starts with a signature 09 08 . 4. We read the sheet, we associate the data sheet with the dictionary (if necessary).

If the document is less than 66 kilobytes, then the data is practically not fragmented, and the file is very easy to parse. I recommend parsing a small file, then up to 6MB, then over 10MB. Because there are features with MSAT.

For xlsx links

Algorithm such 1. open the file, find the sheet, find the dictionary. 2. Read the dictionary, unpack the dictionary. 3. Read the sheet, unpack it, link the data sheet with the dictionary. Without a dictionary, only numeric data is recorded.

In your case - the sheet - not necessarily unpack to the end. It is enough to find the necessary data and stop the decompression. In the data there will be a link to the dictionary - the same, take the dictionary to the element with your number.

PS Libraries as a rule are not sharpened for "narrow" tasks. These cells will not be recorded in the same place of the file in 99.99% of cases. My opinion is that reading the value of a single cell through the library will take more CPU time and PC resources (because, as a rule, the library parses the entire document) than if you write a highly specialized parser.