It was possible to work here with broken svg-files, in which a large number of errors are of two types:

  1. Forgot to close tag: <svg><g>foo</svg>

  2. The tag was closed twice: <svg><g>foo</g></g>bar</svg>

When faced with the first type of error, I used the feature built into lxml:

 parser = lxml.etree.XMLParser(recover=True) svg = lxml.etree.XML(svgdata, parser=parser) 

Someone produces <svg><g>foo</g></svg> and everything is fine.

However, when I ran into the second type of error, it gave out <svg><g>foo</g></svg> , that is, lost the bar (part of the file after the closed tag). And I would like to get <svg><g>foo</g>bar</svg> .

Are there any ready-made solutions to fix both types of errors or go saw your bike?

    1 answer 1

    You can use BeautifulSoup:

    Code:

     from bs4 import BeautifulSoup badString = "<svg><g>foo</g></g>bar</svg>" print(BeautifulSoup(badString, 'lxml').html.body.next) 

    Result:

     # <svg><g>foo</g>bar</svg> 
    • <html><body> oh-oh) - andreymal
    • cut off think no problem)). Corrected the answer - Dmitry Erohin
    • BeautifulSoup() especially if you specify html5lib , can recognize a fairly wide range of broken html / xml documents. - jfs