How to parse html page on objects in qt?

Question

It is necessary to obtain information that is between the html tags (For example, div id="content" ).

I received the source code for the page, but what tools can QT, C ++, or lib use to build a tree from html to easily pull out the contents of the tags?

If you are given an exhaustive answer, mark it as correct (a daw opposite the selected answer).

Nicolas Chabanovsky ♦ Nicolas Chabanovsky 38.2k 54 221 438 · Answer 1 · 2016-05-31T07:58:04

You can go the other way - all the actions that need to be done with the page are performed using JavaScript, and in C ++ only process the result. I will explain.

Imagine that you have a QWebEngineView

 QWebEngineView *view;

You can assign a callback function to the page load event.

 connect(view, SIGNAL(loadFinished(bool)), SLOT(finishLoading(bool)));

Call up landing page

 view->load(QUrl(requestUrl));

When loading will be executed, for simplification of work with JavaScript, we will add jQuery. JQuery - the string obtained from the resource file. After adding jQuery, add your javascript code and result handler. The finishLoading function might look like this.

 void MyClass::finishLoading(bool) { view->page()->runJavaScript(JQuery::Instance()->Code()); view->page()->runJavaScript(YOUR_JS_CODE, invoke(this, &MyClass::onResultCallback)); }

Where, YOUR_JS_CODE is a QString with page processing code. for example

 const QString YOUR_JS_CODE = " \ function doSomething(){ \ var results = $('.content'); \ if (results != 'undefined') \ return results.html(); \ return ''; \ }; doSomething();";

Please note that you need to call the added code (last line).

Finally, you need to process the result.

 void MyClass::onResultCallback(const QVariant& returnValue) { QString result = ""; bool isOk = false; result = returnValue.toString(&isOk); if (!isOk) { // Обработка ошибки return; } // Логика работы с результатом. }

For reference, I recommend to watch fancybrowser or querychecker based on it.

Answer 2 · 2016-05-30T13:55:54

If QWebKit through QWebKit , then use the findAllElements and findFirstElement methods .

It is necessary to add to the pro-file in QT += webkit . For Qt5 you also need to add QT += webkitwidgets

And to impose:

 #include <QWebPage> #include <QWebFrame> #include <QUrl> #include <QWebElementCollection> #include <QDebug>

In Qt5 inclusions will be:

 #include <QtWebKitWidgets/QWebPage> #include <QtWebKitWidgets/QWebFrame>

Code:

 QWebPage page; page.mainFrame()->setHtml("<a><b/><div id=\"content\">!!!</div><div id=\"content\">@@@</div></a>"); qDebug() << page.mainFrame()->toHtml(); foreach (QWebElement el, page.mainFrame()->findAllElements("#content").toList()) { qDebug() << el.toInnerXml(); }

Console:

 "<html><head></head><body><a><b><div id="content">!!!</div><div id="content">@@@</div></b></a></body></html>" "!!!" "@@@"

In QWebKit , css selectors are used to search for - an analogue of xpath queries.

But the trouble is that in the latest versions of Qt there is no possibility to use Qwebkit.
I would like to know why I put a minus, for a working example?
@gil9red - I apologize, corrected, I thought I was in a different topic, this minus was another person, for a different answer.
@ gil9red, the author probably means Qt version 5.6, where QtWebKit really doesn’t connect.

Vyacheslav Savchenko Vyacheslav Savchenko 141 one eleven · Answer 3 · 2016-05-31T00:19:23

I can offer parsing through Qt XML , I do not know whether it will work in HTML, in theory it should, further excerpt from Qt Assistant , from the description of the QDomDocument class. If you need any specific questions, please write in the comments. If you want to parse html of any nesting, then you need to use recursive calls. And do not forget to add QT += xml in *.pro .

 QDomDocument doc("mydocument"); QFile file("mydocument.xml"); if (!file.open(QIODevice::ReadOnly)) return; if (!doc.setContent(&file)) { file.close(); return; } file.close(); // print out the element names of all elements that are direct children // of the outermost element. QDomElement docElem = doc.documentElement(); QDomNode n = docElem.firstChild(); while(!n.isNull()) { QDomElement e = n.toElement(); // try to convert the node to an element. if(!e.isNull()) { cout << qPrintable(e.tagName()) << endl; // the node really is an element. } n = n.nextSibling(); }

Duracell Duracell 1,223 2 7 22 · Answer 4 · 2016-05-30T12:59:18

I would personally parse it using the usual std :: string, do this:

1) I would consider the entire contents of the request to the variable std :: string.

2) Further, using the find method, I would find the necessary tags and put them into the container of strings in order to later reproduce the hierarchical tag tree

3) Next, using the erase method, I would pull out the insides of the desired tag, if necessary

4) Well, I would create a simple algorithm for determining the nesting of tags

and it was also possible to parse regulars, but it’s better not to
Parsing html / xml needs to be done with the corresponding parser plus there are xpath requests and css selectors.
@ gil9red - there was a question: "what tools can QT, C ++ or either build a tree from html" - I brought it with c ++ tools

How to parse html page on objects in qt?

4 answers 4

More articles: