I use xml.ETree.ElementTree to parse a large xml file sequentially using iterparse. However, over time, the application starts to eat a lot of memory. It seems that although iterparse processes the document sequentially, it saves the entire document tree. How to avoid it?

Preferably using the standard library, rather than third-party parsers.

  • one
    Hello. There is a great jq utility. It is used for parsing json. There are analogues for xml. I advise you to look in the direction of yq - hedgehogues
  • one
    Look towards xml.parsers.expat , example: github.com/gil9red/SimplePyScripts/blob/… But this is a very low level, but it will allow you to efficiently process giant xml - gil9red

1 answer 1

There is a great jq utility. It is used for parsing json. There are analogues for xml. I advise you to look in the direction of yq

Take a look at my answer here . He describes working with jq

Standard parsers (a la xml.ETree ) are bad because, as a rule, they are slow and superimposed, as you noticed from memory. In addition, they spread the parsing on the code and it becomes poorly readable and unsupported. jq and analogues make it possible to localize the problem in a separate script, and the parsing script is easier to read in order.