📜 ⬆️ ⬇️

Excel metadata files

Not so long ago, I faced the task of analyzing the number of Excel files created in a particular month of the year. I will not go into details, I will say briefly: to evaluate the work of employees of one firm. There were about a thousand files available and the essence of the task was to make a clear graph in the form of a histogram for the head of the company, in which month of the year and how many files were made (these were invoices).
I thought it would be nice to use file metadata for this purpose, i.e. run the available files on a cycle, extract the file creation dates and load it all as a list into Matplotlib to create a histogram.


I have never worked with metadata before this time, although there was a need from time to time, but it was managed by other means, so there was no experience. And in the next morning, having made sandwiches and hovering a liter cup of coffee, he began to implement his idea.


"Started" of course strongly said. First of all, it is natural to search the Internet for information about metadata, about working with them in Python, similar projects, etc. Almost immediately I came across the oletools module for Python, it is on both GitHub and PyPi, there is some good documentation. I studied her for a couple of hours. Installed oletools. Everything worked, the metadata was correctly extracted. But. In the terminal. I needed a script with a loop. Wherever I didn’t search on the Internet how to make friends oletools and IDLE, I reached 15 (!) Pages in a search engine. Information zero.


I am sure, without a doubt, that this is a wonderful module, probably very powerful, but it didn’t come to me at all. In the next attempt (by the evening) to force the script to somehow extract the file metadata, another exception was raised. And here I noticed that oletools loads the olefile module for work. I typed in a search engine and what was my surprise when I discovered that this was exactly what I had been looking for all day. Very clear documentation. I read it in half an hour and immediately, as I read, I sketched a script.


Сначала установка olefile в системе: pip3 install --user olefile 

And here is the script itself that extracts metadata from an Excel file (I only needed the value of meta.last_saved_time):


 import olefile File_Ole='File.xls' assert olefile.isOleFile(File_Ole) # Проверка корректности OLE файла ole = olefile.OleFileIO(File_Ole) meta = ole.get_metadata() # Извлечение метаданных print('Дата создания файла: '+str(meta.create_time)) # Вывод даты создания файла print('Дата последнего сохранения: '+str(meta.last_saved_time))# Вывод даты сохранения файла meta.dump() # Вывод всех метаданных на экран ole.close() # Закрытие файла 

I will not give here all the code of my script, my task in this post was to tell about simple and fast extraction of metadata from an Excel file using the olefile module. By the way, this module extracts metadata from a variety of file formats, more information can be found in the documentation .


It is also worth noting for the sake of fairness that olefile extracts metadata only from Excel 1997-2003 files, i.e. with the .xls extension. For .xlsx files, I wrote a converter with the pyexcel module.


Well that's all. Thank you for your attention, I hope my post will help someone save time, nerves and, most importantly, keep a good mood.



Source: https://habr.com/ru/post/439358/