Why does the same regular expression normally work on sites like pythex.org but all the time it returns an empty list during program execution?

The line I'm trying to parse:

info = u'\nOutage start time:\r\n 3/23/2017 5:11:12 AM\n\nEstimated restoration time:\n\n\n\n\n\nEstimated customers impacted:\r\n1\n\nReason:\r\n An object has made contact with power lines in your area. SRP crews are working to restore power as quickly as possible.\n\nImpacted area:\r\nS SCHNEPF RD to N QUAIL RUN LN and E JUDD RD to W MAGMA RD\n\n' 

Regular expression: (?<=start time:)(.*?)(?=Estimated)

The result on the site: 3/23/2017 5:11:12 AM enter image description here

Result in the interpreter (Python 2.7):

 >>> re.findall(r'(?<=start time:)(.*?)(?=Estimated)', info, re.UNICODE) [] >>> re.findall(ur'(?<=start time:)(.*?)(?=Estimated)', info, re.UNICODE) [] 

2 answers 2

The re.DOTALL flag is missing because the input is multiline.

 >>> re.findall(ur'(?<=start time:)(.*?)(?=Estimated)', info, re.UNICODE|re.DOTALL) [u'\r\n 3/23/2017 5:11:12 AM\n\n'] 

Alternatively, you can explicitly specify a space in the regex (if we are only interested in dates in which there is no line break inside):

 >>> re.findall(ur'(?<=start time:)\s*(.*?)\s*(?=Estimated)', info, re.UNICODE) [u'3/23/2017 5:11:12 AM'] 
  • re.DOTALL not needed here, since the necessary information is on 1 line. - Wiktor Stribiżew
  • @ WiktorStribiżew is incorrect . Look at the result in the first example: it is clearly visible that it starts with "\r\n" . In Python (and many other languages), "\n" is a newline character. That is, the "necessary information" (date) is at least on the second line. - jfs
  • The fact that the information is on the second line is not discussed. re.DOTALL changes the point's behavior, and not on which lines there is a match. Because This difference is stated as the main cause of the problem; this answer is incorrect. In addition, the use of forward and backward viewing blocks in this case is unnecessary, since re.findall returns the contents of the snoring groups, if they are defined in the expression (in this case, they are). - Wiktor Stribiżew
  • @ WiktorStribiżew read at least the title of the question: re.DOTALL can explain why the same regular expression works in one case and does not work in the other. And so many options can be how to make a working regex (a couple of options are given in the answer). - jfs

If you work with Unicode strings, you need to convert all strings to the required view (either by declaring string literals using the u prefix, or by using .decode / .encode ).

Working code using your string:

 # -*- coding: utf-8 -*- import re info = u'\nOutage start time:\r\n 3/23/2017 5:11:12 AM\n\nEstimated restoration time:\n\n\n\n\n\nEstimated customers impacted:\r\n1\n\nReason:\r\n An object has made contact with power lines in your area. SRP crews are working to restore power as quickly as possible.\n\nImpacted area:\r\nS SCHNEPF RD to N QUAIL RUN LN and E JUDD RD to W MAGMA RD\n\n' print([x.encode('utf8') for x in re.findall(ur'start time:\s*(.*?)\s*Estimated', info)]) # => ['3/23/2017 5:11:12 AM'] 
  • So I have a string with the prefix u , is this not enough? - sky
  • No, not enough. By the way, the re.UNICODE modifier is not needed, as there are no predefined character classes \w , \d or even the word boundary \b in your regular re.UNICODE (with the help of re.UNICODE these operators begin to recognize the corresponding Unicode characters). - Wiktor Stribiżew
  • Your option works, thanks. - sky
  • Please, if my answer helped you solve the problem, mark the answer as a solution (see What should I do with the answers to my question? ). - Wiktor Stribiżew
  • Tell me please. I see you have changed the regular season - that is, the initial version was still wrong? - sky