Regular expressions are quite resource-intensive compared to ordinary string operations. Moreover, their complexity is hidden behind a compact syntax - it’s almost impossible to estimate which of several possible options will be faster. The problem is complicated by various dialects (POSIX, Perl) and implementations (NFA, DFA) of regular expressions.

Is there a possibility / tool to calculate the number of operations that a regular expression will require to find the answer in a given string? Primarily interested in PCRE-regular expressions?

  • 2
    I think we should start by reading this perldoc.perl.org/re.html#'debug'-mode - KoVadim
  • eight
    regex101.com -> regex debugger. - Wiktor Stribiżew
  • And something to measure time will not work? - Qwertiy
  • @Qwerty, time is directly proportional to the number of steps taken. The number of steps, in turn, can be obtained using the debugger (@ wiktor-stribiżew, for example, has already given a link to this). - ߊߚߤߘ
  • one
    @ WiktorStribiżew I have seen more than once how two regular expressions are compared in the SO pages, and preference is given to the one that takes fewer steps, which is logical and correct. I would like to understand how this is done practically. The decision is made in any form: website, utility, algorithm. All that you use in practice to solve this problem. - cheops

4 answers 4

The only reliable means of evaluating the performance of several regular expressions is to actually measure the search time and necessarily on different text samples, because different text samples will be processed at different times.

The number of steps that shows regex101 makes sense (my subjective opinion) only when building recursive regular expressions - in them it will allow more or less to evaluate the effectiveness of recursion.

Compare three regular expressions:

  1. [az] - 33 steps https://regex101.com/r/vC5lG7/1
  2. ^[^az]*[az] - 4 steps https://regex101.com/r/vC5lG7/2
  3. ^[^az]*?[az] - 4 steps https://regex101.com/r/vC5lG7/3

Let's do a simple measurement of the execution time of these expressions:

 function test( $re, $text, $iters ) { $t1 = microtime( true ); for ( $i=0; $i<$iters; $i++ ) { $res = preg_match( $re, $text ); }; $t2 = microtime( true ); echo ($t2-$t1)."<br/>"; }; $re1 = "/[az]/"; $re2 = "/^[^az]*[az]/"; $re3 = "/^[^az]*?[az]/"; $text = " a"; $iters = 50000000; test( $re1, $text, $iters ); test( $re2, $text, $iters ); test( $re3, $text, $iters ); 

Result:

 9.5385589599609 9.1712720394135 9.6358540058136 

The first regular expression is only 4% slower than the second, although the number of steps was 33 vs. 4.
The first regular expression is 1% faster than the third, although the number of steps was 33 vs. 4.

Draw your own conclusions.


PS

necessarily on different text patterns, because different text patterns will be processed at different times

Briefly explain.

Example
Given a large piece of text where the match will be at the end of the text. In such a text, greedy quantification will find it faster than the minimum, but if the match is at the beginning of a large text, then the minimum quantification will find it faster.

    Is there a possibility / tool to calculate the number of operations that a regular expression will require to search for an answer in a given string?

    Visit regex101.com .

    Enter \d{4} in the regular expression window, and see:

    enter image description here

    The calculation took 13 steps, i.e. Number of operations for accessing the PCRE library callback function compiled with the PCRE_AUTO_CALLOUT parameter PCRE_AUTO_CALLOUT .

    Here is what Lukas Tsheshnevsky writes :

    PCRE_AUTO_CALLOUT flag supports PCRE_AUTO_CALLOUT flag. I'm 99.99% sure that's how regex101's debugger works. Each callout invocation registers as a step in the debugger.

    To see a detailed picture of what happens when you find matches in a string using this regular expression, click regex debugger :

    enter image description here

    Regular Expression Performance and Step Count

    This number does not directly indicate the rate of expression parsing by a particular library of regular expressions. You can simply compare several similar templates using the same text as an example and draw conclusions regarding their use.

    For real performance testing of a regular expression, use a simple method:

    1) Prepare a loop (100-200 thousand iterations), input lines, pattern
    2) Count the time before the start of the match to the end
    3) Calculate the average time spent on each iteration.

      It is better, it seems to me, to remember that regulars are a utility, and it exists not to search quickly, but to search with compact universal templates, without wasting time on developing analysis algorithms.

      If this situation turns out that the speed of the regular work is critical - then at this moment it is necessary to change the approach to analyzing the text with a manual algorithm. The reason is easier to control performance + performance gain.

      If you need to parse specialized formats - like HTML, you need to remember that for their parsing there are much more convenient libraries that allow you to make the parser more readable and faster.

      • 6
        And do you know that in regular languages, regular expressions are faster than the manual algorithm? - ReinRaus
      • @ReinRaus and how is that? - gecube
      • It is logical - performing a search for a regular search, this is compiled code. And he is usually a bit faster interpreted. Only from the cannon on the sparrows to beat why, why do you need to hammer nails with a screwdriver? Tell me a practical example, where it is necessary to use regular text on a large text, yes it turns out so - what do they slow down? - Goncharov Alexander
      • 1) It would be interesting to see the statistics of the use of translated and compiled languages. There is a suspicion that most of the code is written in the broadcast languages, because it is php, js, python, which means that it is incorrect to do rerexing antirecommendation claiming that they are slow. - ReinRaus
      • one
        In addition to very large texts, there is another facet of the issue: small texts. It is the processing of small texts that is most often used in programs. And here regular expressions will significantly overtake the “clean” implementations in speed. For example: overtake text.replace( /[^a-z0-9а-я]+/i, "" ) clean implementation in js. A very frequent task is to rehabilitate the string. At the same time compare the difference in the amount of code and its ease of understanding. - ReinRaus pm

      Regular expressions are quite resource-intensive compared to ordinary string operations.

      And how much they are resource-intensive compared to disk operations, or queries to the database, or with the chasing of data on sockets ...

      But let's imagine that:

      1. We have a spherical horse in a vacuum
      2. The questioner has not yet learned the word "profiler"

      How can one evaluate the "effectiveness" of this or that approach? Obviously, from the definition of this very effectiveness. Classically it is considered that either memory or speed.

      OK, how can I measure speed? Imagine, unexpectedly: note the time before, after, and calculate the difference! It is so banal that I will not give examples in order not to insult the interlocutors.

      How to estimate the cost of memory? Here it is already more complicated, it all depends on the execution environment. Therefore - "the question requires clarification " :-)

      PS

      • 2
        If you correctly understood the question, then there is not quite about the speed that you described in the response-comments (as it is not the answer), but about "Is there an opportunity / tool to calculate the number of operations that a regular expression will require to search for an answer in a given line? " - edem
      • one
        @edem Well, this is some kind of "academic" question then it turns out :) The number of operations is calculated in advance, by the regexp itself, and, taking into account the input data. But what this means, in view of its implementation, and practical effectiveness is a completely different question :-) - PinkTux