Recursive html parser using the htmlcxx library

Question

Good day! Please help me figure out what's wrong with recursion. It does not work correctly, and that's it! Using the recursive function, looking for a tag with a given value of a given attribute. However, the function always returns that there is no such attribute, although I know for sure that there is. From this I conclude that the recursive traversal works incorrectly. Here is the code:

int depthSearch(tree<htmlcxx::HTML::Node> const & dom, tree<htmlcxx::HTML::Node>::iterator &returnIt, tree<htmlcxx::HTML::Node>::iterator &currentIterator, string tagName, string attributeName, string attrContent) { int errorCode = 0; //return fixed-depth iterator to the first node at a given depth for given iterator tree<htmlcxx::HTML::Node>::iterator it = dom.begin_fixed (currentIterator, 0); tree<htmlcxx::HTML::Node> currDom; for ( unsigned i = 0; i < dom.number_of_children(it); i++ ) { //return child node of current node currDom = dom.child(it, i); //Return leaf iterator to the first leaf of the subtree at the given node. currentIterator = currDom.head; if (currDom.number_of_children(it) > 0) { errorCode = depthSearch(currDom, returnIt, currentIterator, tagName, attributeName, attrContent); if (errorCode == 0) { returnIt = it; return 0; } } else { errorCode = GoToTagWithAttr(currDom.begin_fixed(currentIterator, 0), tagName, attributeName, attrContent); if (errorCode == 0) { returnIt = it; return 0; } } } return 1; }

The essence of the algorithm: we go over the children of nodes, we reach the deepest, we begin to check whether there are necessary attributes in it. Further, if not, roll back recursively and check again. If you find the desired node with the desired attribute, we immediately exit and remember the pointer of the found tag.

Link to library documentation: htmlcxx documentation .

An article about the use of iterators: htmlcxx basic description and basic techniques of use .

And here are just a few links to how people use the library:

htmlcxx c ++ crawling html

htmlcxx API usage

htmlcxx - html and css APIs for C ++

and the fact that you are looking for an element with the given attributes only after reaching the "leaves" of the tree is correct?
Because if there are branches from any of the "nodes", the code with a call to GoToTagWithAttr () in this node is never executed.
@margosh, no, I need to bypass all the branches, because the desired tag sits quite deeply and cannot be obtained using indexes of the first level (((
I meant that in the current algorithm you will bypass only those elements of the tree that have no descendants, all the elements that have descendants will not be checked for a match.
In addition, it seems to me that in case of successful finding of attributes in one of such descendants, you overwrite an iterator with it, because, when returning from a function in the "parent", you assign the parent iterator to the parameter.

margosh margosh 2,093 15 silver marks 33 bronze marks · Answer 1 · 2015-05-19T10:07:08

If you need to check each of the tree nodes for matching parameters, regardless of whether this node has descendants, then this function should check for suitable attributes after the for loop:

 ... for{ ... } errorCode = GoToTagWithAttr(currDom.begin_fixed(currentIterator, 0), tagName, attributeName, attrContent); if (errorCode == 0) { returnIt = it; return 0; } rerurn 1;

In addition, for the case when the node has descendants, and in one of them the necessary attribute was found, if you want to send an iterator to the top exactly to that descendant, the string "returnIt = it;" - superfluous:

 ... if (currDom.number_of_children(it) > 0) { errorCode = depthSearch(currDom, returnIt, currentIterator, tagName, attributeName, attrContent); if (errorCode == 0) { return 0; } } else{ ... }

Perhaps this is a problem, unfortunately I’m not able to add your code to test my hypotheses.

your assumption was wrong. Below, I wrote code that somehow works. - neo

ruzzz ruzzz eleven 1 bronze sign · Answer 2 · 2017-01-24T21:36:14

Can someone come in handy. A bit "faked" the original htmlcxx. https://github.com/Ruzzz/htmlcxx Be sure to watch /test/test.cpp

Example:

 std::string html(...ВАШ HTML с ссылкам...); ParserDom parser; Tree domTree = parser.parseTree(html); Tree::pre_order_iterator it = domTree.begin(); std::vector<std::string> links; std::string href; std::for_each(it, domTree.end(), [&links, &href](Node &node) { if (node.isTag() && (node.tagName() == "a") && (node.parseAttributes() > 0) && (node.attribute("href", href))) links.push_back(href); });

neo neo 673 2 gold marks 11 silver marks 33 bronze marks · Answer 3 · 2015-05-25T15:52:17

It seems like this code works, but only for an example with a little nesting. For some reason, he does not want to work hard on examples of real html-pages !!!

 void walk_tree( tree<HTML::Node> const & dom , std::string tagName, std::string attrName, std::string attrContent) { tree<HTML::Node>::iterator it = dom.begin(); if (strcasecmp(it->tagName().c_str(), tagName.c_str()) == 0) { it->parseAttributes(); if (strcasecmp(it->attribute(attrName).second.c_str(), attrContent.c_str()) == 0) { std::cout << "I FOUNDED THIS TAG = DIV WITH ATTR = CLASS" << std::endl; ++it; if ((!it->isTag()) && (!it->isComment())) { std::cout << std::endl; std::cout << it->text() << std::endl; std::cout << std::endl; } //goto Exit; } } for ( unsigned i = 0; i < dom.number_of_children(it); i++ ) { walk_tree( dom.child(it,i), tagName, attrName, attrContent); } /*Exit: return;*/ }

Waiting for your comments / suggestions, what is wrong with this library !!!

Recursive html parser using the htmlcxx library

3 answers 3

More articles: