Alternatives to quickly parsing xml / html strings with python tools

Question

There are many lines of the form:

a = '<httpSample t="51" lt="51" ts="1478854873129" s="true" lb="Enter SHI" rc="200" rm="OK" tn="Sorting 1-21" dt="text" by="1749"/>'

The task: to find a faster way to get the values of the attributes "t" and "lb" from the strings of the type "a" than the presented ones (I clarify that the main criterion is time, it should be less than in the options I have presented):

Not good (the number of attributes changes suddenly), but quickly (the number of attributes is constant):
```
 def x(): b = a.split('"') xxx, yyy = b[1], b[9] 
```

Good, but 6 times longer x() :

 import xml.etree.cElementTree as ET def y(): tree = ET.fromstring(a) xxx = tree.attrib['t'] yyy = tree.attrib['lb']

You can use to check:

 from timeit import timeit print timeit(x, number=3000000) print timeit(y, number=3000000)

Answer 1 · 2016-11-29T17:04:18

For example, do fewer operations in split. Split into 2 stages, first by "lb =" into two parts. The remainder is also in two parts by the first quote. Breaking down by "lb =" or "t =" solves the question with (suddenly the number of attributes changes)

 def get_key(stri: str, key: str): return stri.split('%s="' % key, 1)[1].split('"', 1)[0] get_key(a, 'lb') def z(): a.split('t="', 1)[1].split('"', 1)[0] a.split('lb="', 1)[1].split('"', 1)[0] >>> x - 2.4 c >>> z - 2.8 c

It turned out a little slower, with an increase in the number of arguments in the line, the idea would be faster

Thank you for your option - it is undoubtedly useful for me when changing the number of attributes - adopted), but for my current task it is necessary to reduce the processing time.
- I checked your version for 1000 000 different lines with the same number of attributes - it works a little slower (as you said), but put it as a except, and as an alternative to your quick version - huge + you karma)

Answer 2 · 2016-11-30T10:16:38

You can use a pool of streams, for 10 million rows, the increase is almost 2 times

 import time import multiprocessing A = '<httpSample t="51" lt="51" ts="1478854873129" s="true" lb="Enter SHI" rc="200" rm="OK" tn="Sorting 1-21" dt="text" by="1749"/>' data = [A]*10**7 def x1(a): b = a.split('"') return b[1], b[9] def x2(a): b = a.split('"', 9) return b[1], b[9] if __name__ == '__main__': # однопоточно t = time.time() list(map(x1, data)) print(time.time() - t, 'no pool') # в пуле pool = multiprocessing.Pool(processes=4) t = time.time() pool.map(x1, data) print(time.time() - t, 'pool') >>> x1 (24.128000020980835, 'no pool') >>> x1 (13.368000030517578, 'pool') >>> x2 (13.958999872207642, 'no pool') >>> x2 (11.858999967575073, 'pool')

If we restrict split (, 9) as in x2 (), then it will be one-line approaching the pool time.

Thank you very much, I will rewrite my script, then I will write off how much I managed to speed it up + there are additional calculations that you can also try to speed up.
By the way, I’ll add a comment about a construction that does not work in python 2 using with as - New in python version 3.3: Pool objects now support the context management protocol - see Context Manager Types.
__enter __ () returns the pool object, and __exit __ () calls terminate ().

Alternatives to quickly parsing xml / html strings with python tools

2 answers 2

More articles: