Where to get docx parser

Closed due to the fact that off-topic participants Xander , ߊߚߤߘ , Suvitruf , andreymal , Darth 17 Oct '17 at 12:17 .

  • Most likely, this question does not correspond to the subject of Stack Overflow in Russian, according to the rules described in the certificate .
If the question can be reformulated according to the rules set out in the certificate , edit it .

  • I do not know the parser ready. But you can write yourself in principle. For docx is a zip archive with xml inside. If you only need text from a document, then you will easily get it - Dejsving
  • This is a good idea, then you will not need to install new modules - Alexey Zakharenkov
  • But how to do that? Maybe push where you can see. - Alexey Zakharenkov
  • Look for the module to open the archives, it most likely even already exists in the set of modules, and then parses xml. I do not understand the question? If interested, take any docx and open it with the archiver. - Dejsving
  • It all depends on the level of programming and the ability to sort out the question yourself. Community and created in order to get advice or help. You may also be asking questions that seem obvious to someone. - Alexey Zakharenkov

2 answers 2

There is a great module for working with docx: pip install python-docx

 from docx import Document document = Document("Обеденное меню 777.docx") # Регулярка для поиска последовательностей пробелов: от двух подряд и более import re multi_space_pattern = re.compile(r'[ ]{2,}') for table in document.tables: for row in table.rows: name, weight, price = [multi_space_pattern.sub(' ', i.text.strip()) for i in row.cells] if name == weight == price or (not weight or not price): print() name = name.title() print(name) continue print('{} {} {}'.format(name, weight, price)) # Таблицы в меню дублируются break 

Console:

 Обеденное Меню Салаты Салат «Винегрет» 150 гр 45 руб. Салат с сёмгой (вареная сёмга, рис отв., св. огурец, яйцо, соус «тар-тар») 150 гр 60 руб. Супы Солянка 250 мл 60 руб. Суп грибной «Лесная поляна» 250 мл 55 руб. ... 

You can download this script along with an example here: https://github.com/gil9red/SimplePyScripts/tree/6c64ecb4a6cea678892edd0a6db2bbc23d7e020e/read_docx

  • Thank you, very interesting, now I will test and write. - Alexey Zakharenkov

Register in console 'pip search docx parsing'.

Try the modules python-opc , python-docx , docx2txt , all of them are put through PIP.