I have a string - html code that needs to be parsed using regex regular expressions. I need to write to the vector std :: vector all the URLs on the page that are in href="" . My C ++ code does not work regularly.

 #include <regex> #include <iostream> #include <string> using std::string; using std::regex; using std::cout; using std::endl; using std::sregex_iterator; using std::smatch; int main() { string subject("<head><title>Search engines</title></head><body><a href=\"https://yandex.ru\">Yandex</a><a href=\"https://google.com\"></a></body>"); try { regex re("<\\s*A\\s+[^>]*href\\s*=\\s*\"([^\"]*)\""); sregex_iterator next(subject.begin(), subject.end(), re); sregex_iterator end; if (next == end) cout << "Oops" << endl; while (next != end) { smatch match = *next; cout << match.str() << endl; next++; } } catch (std::regex_error& e) { ; // Syntax error in the regular expression } return 0; } 

Only Python works.

 #!/usr/bin/python3 import re html = '<head><title>Search engines</title></head><body><a href="https://yandex.ru">Yandex</a><a href="https:/google.com"></a></body>' title = re.findall(r'<title>(.*?)</title>', html)[0] links = [ x[1] for x in re.findall(r'<a\s+(?:[^>]*?\s+)?href=(["\'])(.*?)\1', html)] print (title) print (links) 

I guess that you can sit for a week, flipping through the reference book of Jeffrey Fridl in regular expressions and the regex library, and achieve the desired result, but stackoverflow is not intended for tips like "read Fridla, but do not ask to digest the porridge." In addition to this seemingly useful question, there is no answer on the stack to work.

1 answer 1

You can std::regex_constants::icase code using the std::regex_constants::icase , as well as using sregex_token_iterator with 1 as the fourth argument (to get the value in the exciting submask # 1). In Python, re.findall only captured substrings, if the template contains exciting subtitles, there is no such method in C ++.

Example of working C ++ code :

 #include <iostream> #include <string> #include <vector> #include <regex> using namespace std; int main() { regex re("<\\s*A\\s+(?:[^>]*?\\s+)?href\\s*=\\s*\"([^\"]*)\"", std::regex_constants::icase); string subject("<head><title>Search engines</title></head><body><a href=\"https://yandex.ru\">Yandex</a><a href=\"https://google.com\"></a></body>"); vector<string> result(sregex_token_iterator(subject.begin(), subject.end(), re, 1), sregex_token_iterator()); for( auto & s : result ) cout << s << endl; return 0; } // => https://yandex.ru, https://google.com 
  • Did you just google? - Anton
  • What googled? ... - Wiktor Stribiżew
  • one
    Well, I mean, how should I find this page ideone.com/eSaf6K ? - Anton
  • Or is this site your favorites? - Anton
  • And, everything, understood. This is your code - Anton