Which approach is more correct when drawing up a regular report that would catch all entries in access.log NGINX , whose Referrer: field Referrer: contains bs.serving-sys.com ?

An example of an entry with this field:

174.109.109.115 www.auctiondirectusa.com - [01 / Oct / 2015: 09: 38: 24 -0500] "GET / used-cars-raleigh-nc? Utm_source = TU + MEDIA & utm_medium = MOBILE & utm_campaign = GEO + FENCING HTTP / 1.1" 200 48476 "www.bs.serving-sys.com/BurstingPipe/adServer.bs" "Mozilla / 5.0 (iPhone; CPU iPhone OS 9_0_1 like Mac OS X) AppleWebKit / 601.1.46 (KHTML, like Gecko) Mobile / 13A404" 1.285

 ^<HOST>.*"(GET|HEAD|POST).*HTTP.*".*bs.serving-sys.com.*".*$ 

I use such a regular schedule, you can say it is not “hard”, that is, the referrer may be in the User-agent field, and then it will work, but this is very unlikely.

I would like to know which way is correct, write it clearly defined, so that both the date and the site are checked. But this is an extra check, I think. After all, it may be that it does not pass, while the name referrer -a will be the same and we will not receive a response due to the fact that the site name or date, or any other additional check failed.

PS: for <HOST> a group has already been defined that parses the IP address.

    1 answer 1

    1. If double quotes in all strings are similar, then instead of .* should put [^"]* (any number of characters, not double quotes).
    2. And an additional check for the presence of only spaces or numbers between the target of the query and the referrer [\s\d]+

       /^<HOST>.*"(?=GET|HEAD|POST)[^"]*HTTP[^"]*"[\s\d]+"[^"]*bs\.serving-sys\.com[^"]*".*$/ 

    Example https://regex101.com/r/wG8rB9/1 (I added <HOST> to the test lines at the beginning so that the regular schedule could not be changed)

    • Yes, the quotes are the same everywhere " I also thought about it. About [^"]* But look, for example, we always have HTTP inside the line, in quotes. That is, it turns out that using .*HTTP we will say everything, and we will rest on HTTP as a stop? - Mihail Politaev
    • (?=) Why this group? You meant grouping (?:) without feedback so that the memory does not take? - Mihail Politaev
    • one
      1. Yes, instead of (?=) can put (?:) . 2. Before HTTP you can put both .* And checks .*"(?=GET|HEAD|POST)[^"]* can be removed altogether. After HTTP you need to leave everything [^"]* . - Visman
    • Does this expression ^<HOST>.*"(GET|HEAD|POST) mean that I would not call any other quotation mark except the one that precedes (GET|HEAD|POST) ? After all, this group is right after the quotation mark. Or I Regularly first catches everything up to the latest quotes, and then seeing (GET|HEAD|POST) comes back and catches only that quote that precedes (GET|HEAD|POST) ? If so, then these are unnecessary operations and then I understand ^<HOST>[^"]*"(GET|HEAD|POST) we write. - Mihail Politaev
    • @MishaPolitaev,. .*"(GET|HEAD|POST) here we catch the first occurrence of a quote and one of three words or, if this does not happen, we will go to the end of the line . If we use [^"]*" , we will reach the first quote and if there is no one of three words, then everything. - Visman