Tell me how in python3 to open and get data from office files, such as odt, doc, docx, rtf. At least odt.

The fact that odt and docx are essentially archives in the course, you can unpack them in theory, and look at the content.xml file (if I'm not mistaken), but you can eat more modern or convenient ways.

All I’ve found is to create ods tables.

I found the modules uno , pyoo, and everywhere I described how to create tables, but I did not find how to get data from office documents.

The task is to run through all the existing files in the directory (subdirectories), find or analyze the necessary one and output the result in a separate file.

Now this is partially implemented on bash'e, I want to rewrite everything in python3.

Tell or show how to search.

    2 answers 2

    Well, offhand a couple of libraries:

    1. https://pypi.python.org/pypi/ezodf
    2. https://github.com/eea/odfpy

    Of course, working through the services of OpenOffice is a more correct way for the samurai, but for this you need at least a "headless" OpenOffice, but it may not be. In addition, it seems to me that OpenOffice services will disappoint with performance indicators when processing a large number of files, but you will get full functionality.

    By the way, you need to take into account that when using OpenOffice, you will have to follow Java API documentation and adapt it to Python.

    • Now I will try, thank you ... for the time being, I need to read the file and make a sample ... and then we'll see - Sober
    • I was able to create and make changes .. but I can’t read the odt files ... they seem to have body, content and text in response to either none or <ezodf.body.TextBody object at 0x7fbabf2db208> ... in an example on a githaba is just how to make changes and create .. and how to read is not ... an unclaimed see task) - Sober
    • Strange. And this line pi = sheet[1, 1].value is just a read from a cell. There are clear examples of reading in the documentation . Maybe you will expand your question on the subject "to find or analyze the necessary" or write a new one? Then it will be more clear how much functionality you need and I could then at least write an example. - tutankhamun
    • I can get data from the table, but not from a simple document) pi = sheet [1, 1] .value - this worked with the table ... in theory doc = opendoc ('file.odt') doc.body.text or doc.conent.text should help .. but they output None, although the text file. The entire text of the document should be driven into the list / word, that's what we need ... if briefly - Sober
    • About doc.body.text did not find anywhere, but if you run for by iterator doc.body you can find elements with the necessary content - tutankhamun

    I will issue it as an answer, so as not to scour the comments, if tutankhamun does not mind, if it does, then add to my answer and I will delete my own.

    And so, the issue turned out to be solved with the help of the ezodf module (not a lot of its documentation ). When installing, be careful if you have both the 2nd and 3rd versions of python, for the third one I put the following python3 setup.py install .

    A small sample code for clarity.

     import ezodf odt = ezodf.opendoc('/home/user/python/text.odt') list=[] # Запускаем цикл for и перебираем все что нашли в файле) for i in odt.body: if i.text == None: print('no') else: list.extend(re.findall(r"[\w']+", i.text.lower())) 

    Let me explain, I used i.text instead of i.plaintext () , to catch several lines with a value of None (apparently I didn’t understand the service data), just plaintext () adds empty elements to the list and at that time it seemed to me What through text will be faster, but in the morning I can rethink)

    and here list.extend (re.findall (r "[\ w '] +", i.text.lower ())) - I attach it to the existing list or even so, expanding the existing list with it. I select all the words with a regular expression (each word from the document to the list), apply lower case to them and that's it.

    this is just a piece, because it may not look very good, and a lot of things can be added, but at least now it’s clear how to read documents.

    Thanks tutankhamun for the tips.