How to pull data from a specific cell excel multiple files

Question

There are a lot of excel files (more than 10,000), open each file and copy information from there - it will take a lot of time.

The question is how to extract text, for example, from the first sheet of a K6 cell from several sheets at once, and record this data, at least somewhere, at least in a txt file.

Take, for example, vb, or python and write a prog line at 10

Answer 1 · 2018-12-04T09:16:17

This documentation allowed me to make a set of utilities for both parsing and xls / xlsx generation. I’ll say that xls consists of a wrapper and data. Xlsx from an "almost" zip archive (i.e., a bit different from a regular zip) and xml set (document itself + dictionary + some other data) You need to clearly understand what is being parsed, because sometimes excel can mean xml and html documents that msoffice can convert to a document.

xls format signature D0 CF 11 E0 A1 B1 1A E1 or 09 08 for all older versions.
Xls wrapper documentation https://msdn.microsoft.com/en-us/library/dd942138.aspx
Documentation wrapping xls http://www.amwa.tv/downloads/specifications/aafcontainerspec-v1.0.1.pdf
Documentation data xls https://www.openoffice.org/sc/excelfileformat.pdf

The algorithm is as follows: 1. Read the MSAT from the header. Either read the entire SAT chain, or read it "as they become available." 2. Read (taking into account MSAT + SAT + SSAT) the "root" of the document, it is easy to find it by signature 09 08 with an address multiple of 512. In the root of the document read the dictionary, and addresses of the sheets (signature 85 00 ). 3. Read the document idly until the data sheet. A sheet always starts with a signature 09 08 . 4. We read the sheet, we associate the data sheet with the dictionary (if necessary).

If the document is less than 66 kilobytes, then the data is practically not fragmented, and the file is very easy to parse. I recommend parsing a small file, then up to 6MB, then over 10MB. Because there are features with MSAT.

For xlsx links

From MS-XLSX microsoftware http://msdn.microsoft.com/en-us/library/dd922181.aspx
ECMA http://www.ecma-international.org/publications/standards/Ecma-376.htm
http://secure.wikimedia.org/wikipedia/en/wiki/Microsoft_Office_XML_formats
http://secure.wikimedia.org/wikipedia/en/wiki/Office_Open_XML
SO-en http://stackoverflow.com/questions/4886027/looking-for-a-clear-description-of-excels-xlsx-xml-format
Parsing xlsx on with # http://www.stackoverflow.com/a/600965/17974
php export habr http://habr.com/post/236107
For simplicity, if you have not found a good library for xlsx (libraries are not always convenient and everything is parsit) I recommend two things: a) zip-depack b) xml-library. Inside the archive there are sheets, for example xl/worksheets/sheet1.xml and there is a "dictionary" xl/sharedStrings.xml - in which all string data is stored.

Algorithm such 1. open the file, find the sheet, find the dictionary. 2. Read the dictionary, unpack the dictionary. 3. Read the sheet, unpack it, link the data sheet with the dictionary. Without a dictionary, only numeric data is recorded.

In your case - the sheet - not necessarily unpack to the end. It is enough to find the necessary data and stop the decompression. In the data there will be a link to the dictionary - the same, take the dictionary to the element with your number.

PS Libraries as a rule are not sharpened for "narrow" tasks. These cells will not be recorded in the same place of the file in 99.99% of cases. My opinion is that reading the value of a single cell through the library will take more CPU time and PC resources (because, as a rule, the library parses the entire document) than if you write a highly specialized parser.

How to pull data from a specific cell excel multiple files

1 answer 1

More articles: