sregex_iterator does not match the string

Question

I have a string - html code that needs to be parsed using regex regular expressions. I need to write to the vector std :: vector all the URLs on the page that are in href="" . My C ++ code does not work regularly.

 #include <regex> #include <iostream> #include <string> using std::string; using std::regex; using std::cout; using std::endl; using std::sregex_iterator; using std::smatch; int main() { string subject("<head><title>Search engines</title></head><body><a href=\"https://yandex.ru\">Yandex</a><a href=\"https://google.com\"></a></body>"); try { regex re("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\""); sregex_iterator next(subject.begin(), subject.end(), re); sregex_iterator end; if (next == end) cout << "Oops" << endl; while (next != end) { smatch match = *next; cout << match.str() << endl; next++; } } catch (std::regex_error& e) { ; // Syntax error in the regular expression } return 0; }

Only Python works.

 #!/usr/bin/python3 import re html = '<head><title>Search engines</title></head><body><a href="https://yandex.ru">Yandex</a><a href="https:/google.com"></a></body>' title = re.findall(r'<title>(.*?)</title>', html)[0] links = [ x[1] for x in re.findall(r'<a\s+(?:[^>]*?\s+)?href=(["\'])(.*?)\1', html)] print (title) print (links)

I guess that you can sit for a week, flipping through the reference book of Jeffrey Fridl in regular expressions and the regex library, and achieve the desired result, but stackoverflow is not intended for tips like "read Fridla, but do not ask to digest the porridge." In addition to this seemingly useful question, there is no answer on the stack to work.

parsing HTML using regular expressions will not work, because HTML is not regular
@VTT Please suggest a code using std, boost or something else

Wiktor Stribiżew Wiktor Stribiżew 11.9k 2 13 32 · Accepted Answer · 2019-01-06T16:36:20

You can std::regex_constants::icase code using the std::regex_constants::icase , as well as using sregex_token_iterator with 1 as the fourth argument (to get the value in the exciting submask # 1). In Python, re.findall only captured substrings, if the template contains exciting subtitles, there is no such method in C ++.

Example of working C ++ code :

 #include <iostream> #include <string> #include <vector> #include <regex> using namespace std; int main() { regex re("<\\s*A\\s+(?:[^>]*?\\s+)?href\\s*=\\s*\"([^\"]*)\"", std::regex_constants::icase); string subject("<head><title>Search engines</title></head><body><a href=\"https://yandex.ru\">Yandex</a><a href=\"https://google.com\"></a></body>"); vector<string> result(sregex_token_iterator(subject.begin(), subject.end(), re, 1), sregex_token_iterator()); for( auto & s : result ) cout << s << endl; return 0; } // => https://yandex.ru, https://google.com

Well, I mean, how should I find this page ideone.com/eSaf6K ?

sregex_iterator does not match the string

1 answer 1

More articles: