Suppose there is such a regular expression (pattern):

r'<[a-zA-Z-]+:[a-zA-Z-]+ [a-zA-Z]+="([a-zA-Z-]+:[a-zA-Z]+)"><([a-zA-Z-]+:[a-zA-Z_]+)>[a-zA-Z0-9]+</[a-zA-Z-]+:[a-zA-Z_]+>'

Is it possible to describe that a certain part of a regular expression should be repeated a certain number of times (unknown in advance), for example: <([a-zA-Z-]+:[a-zA-Z_]+)>[a-zA-Z0-9]+</[a-zA-Z-]+:[a-zA-Z_]+> ?

I tried using {,}, but as I understood, this operator (?) Is only suitable for enumerating characters. I ask for your help

  • five
    Bad idea with regular parsing xml / html :) - gil9red
  • @ gil9red but is there a suggestion how to get the necessary line faster?) if yes, then you can say (speed is important to me) - ddsds
  • 2
    For example, use the xml / html parser and pull out via the css selector or xpath request. Generally, if you aim to pull out with a regular schedule, then it is better to attach the data to the question and what should be pulled out of them. So the chance to get an answer will be higher. And few people will understand your regular season - it’s easier to write one: D - gil9red
  • @ gil9red, I just noticed, xml is parsed via lxml, the regular schedule is needed to get the data out of the line (which goes to the input, and look for them in xml (I’m boiling already)) - ddsds

2 answers 2

Found the answer, just use the operator | , which allows you to use 2 regular expressions in the search, in my case, it looks like this: r'<[a-zA-Z-]+:[a-zA-Z-]+ [a-zA-Z]+="([a-zA-Z-]+:[a-zA-Z]+)">|<([a-zA-Z-]+:[a-zA-Z_]+)>[a-zA-Z0-9]+</[a-zA-Z-]+:[a-zA-Z_]+>' will return a list that consists of a list, each of which has 2 elements: the first one corresponds to the first expression found, the second, respectively, the second. what is written is only the first one found (i.e. one of the elements will be empty)

    that a certain part of a regular expression must be repeated a number of times (unknown in advance)

    In theory, this is intended for + or * :

     <[a-zA-Z-]+:[a-zA-Z-]+(\s+[a-zA-Z]+="([a-zA-Z-]+:[a-zA-Z]+)")*> ^_____________________________________^^ 

    and now the tag can have any number of attributes. However, with some very limited content - according to the idea, attributes should not be found through "([a-zA-Z-]+:[a-zA-Z]+)" , but through "[^"]*" .

    • This option does not work, since + and * repeat the sequence of elements, groups of sequences are not repeated (determined by trial and error) - ddsds
    • @ddsds, in C # you can get all the group matches: ideone.com/pcwUYk . About python not in the know. - Qwertiy