How to parse HTML in .NET?

Question

You must extract all the URLs from the href attributes of a tags in the HTML page. I tried using regular expressions:

 Uri uri = new Uri("http://google.com/search?q=test"); Regex reHref = new Regex(@"<a[^>]+href=""([^""]+)""[^>]+>"); string html = new WebClient().DownloadString(uri); foreach (Match match in reHref.Matches(html)) Console.WriteLine(match.Groups[1].ToString());

But there are many potential problems:

How to filter only specific links, for example, by CSS class?
What happens if the quotes in the attribute are different?
What happens if there are spaces around the equal sign?
What happens if a piece of the page is commented out?
What happens if you get a piece of JavaScript?
And so on.

Regular expression becomes monstrous and unreadable very quickly, and more and more problem areas are found.

What to do?

In my understanding, FAQ is when a question is on the verge of breaking the rules for universality, and the answer is a compilation of dozens of answers from different users (ak.v. wiki).
If another library is added, it is better if it is a separate answer, say.
Well, it would be a pity for me to lose the .NET tag - after all, .NET is not limited to Sharp and Basik.
For my taste, faq is precisely the answer to a frequently asked question.
Questions like "how to find X in the HTML site Y using the regular Z" pop up regularly.
I just have to put it here at stackoverflow.com/questions/1732348/…

Athari athari one · Accepted Answer · 2015-05-01T09:32:43

Regular expressions are designed to handle relatively simple texts that are specified in regular languages . Regular expressions have become very complicated since their introduction, especially in Perl, the implementation of regular expressions in which is an inspiration for other languages and libraries, but regular expressions are still poorly adapted (and are unlikely to ever be) for processing complex languages such as HTML. The complexity of HTML processing lies also in the very complex rules for processing invalid code, which were inherited from the first implementations of the birth of the Internet, when there were no standards at all, and each browser manufacturer had unique and unique features.

So, in general, regular expressions are not the best candidate for handling HTML. It is usually wiser to use specialized HTML parsers.

Csquery

License: MIT

One of the modern HTML parsers for .NET. The validator.nu parser for Java, which in turn is a port of the parser from the Gecko engine (Firefox), is taken as the basis. This ensures that the parser will process the code in the same way as modern browsers.

The API draws inspiration from jQuery, using the selectors language CSS to select elements. The names of the methods are copied almost one-to-one, that is, for programmers who are familiar with jQuery, learning will be easy.

It has high performance. On orders of magnitude higher than HtmlAgilityPack + Fizzler in speed on complex queries.

 CQ cq = CQ.Create(html); foreach (IDomObject obj in cq.Find("a")) Console.WriteLine(obj.GetAttribute("href"));

If a more complex query is required, the code is practically not complicated:

 CQ cq = CQ.Create(html); foreach (IDomObject obj in cq.Find("h3.ra")) Console.WriteLine(obj.GetAttribute("href"));

HtmlAgilityPack

License: Ms-PL

The oldest, and therefore the most popular parser for .NET. However, age does not mean quality, for example, for five years (!!!) the critical bug Incorrect parsing of HTML4 optional end tags has been left unclosed, which leads to incorrect processing of HTML tags that cover tags for which are optional. There are oddities in the API, for example, if nothing is found, it returns null , and not an empty collection.

XPath is used to select items, not CSS selectors. On simple requests, the code is more or less readable:

 HtmlDocument hap = new HtmlDocument(); hap.LoadHtml(html); HtmlNodeCollection nodes = hap.DocumentNode.SelectNodes("//a"); if (nodes != null) foreach (HtmlNode node in nodes) Console.WriteLine(node.GetAttributeValue("href", null));

However, if complex queries are needed, then XPath is not very suitable for simulating CSS selectors:

 HtmlDocument hap = new HtmlDocument(); hap.LoadHtml(html); HtmlNodeCollection nodes = hap.DocumentNode.SelectNodes( "//h3[contains(concat(' ', @class, ' '), ' r ')]/a"); if (nodes != null) foreach (HtmlNode node in nodes) Console.WriteLine(node.GetAttributeValue("href", null));

Fizzler

License: LGPL

Add-in to the HtmlAgilityPack, which allows the use of CSS selectors.

 HtmlDocument hap = new HtmlDocument(); hap.LoadHtml(html); foreach (HtmlNode node in hap.DocumentNode.QuerySelectorAll("h3.ra")) Console.WriteLine(node.GetAttributeValue("href", null));

AngleSharp

License: BSD (3-clause)

New player on the parser field. Unlike CsQuery, it is written from scratch manually in C #. Also includes parsers for other languages.

The API is based on the official JavaScript specification of the HTML DOM. Some places have oddities unusual for developers on .NET (for example, when accessing the wrong index in the collection, null will be returned instead of throwing an exception; there is a separate Url class; the namespace is very granular, even the basic use of the library requires three using and etc.), but in general, nothing critical.

Of other oddities - the library is dragging the Microsoft BCL Portability Pack. Therefore, when you connect AngleSharp via NuGet, do not be surprised if you find three additional packages connected: Microsoft.Bcl, Microsoft.Bcl.Build, Microsoft.Bcl.Async.

HTML processing is simple:

 IHtmlDocument angle = new HtmlParser(html).Parse(); foreach (IElement element in angle.QuerySelectorAll("a")) Console.WriteLine(element.GetAttribute("href"));

It is not complicated, and if more complex logic is needed:

 IHtmlDocument angle = new HtmlParser(html).Parse(); foreach (IElement element in angle.QuerySelectorAll("h3.ra")) Console.WriteLine(element.GetAttribute("href"));

Regex

Scary and terrible regular expressions. It is undesirable to use them, but sometimes it becomes necessary, since the parsers that build the DOM are noticeably more gluttonous than the Regex : they consume more CPU time and memory.

If it came to regular expressions, then you need to understand that you can not build on them a universal and absolutely reliable solution. However, if you want to parse a specific site, then this problem may not be so critical.

For God's sake, do not turn regular expressions into unreadable mess. You do not write C # code in one line with single-letter variable names, and regular expressions do not need to be spoiled. The regular expression engine in .NET is powerful enough to be able to write high-quality code.

For example, here is a little modified code for extracting links from a question:

 Regex reHref = new Regex(@"(?inx) <a \s [^>]* href \s* = \s* (?<q> ['""] ) (?<url> [^""]+ ) \k<q> [^>]* >"); foreach (Match match in reHref.Matches(html)) Console.WriteLine(match.Groups["url"].ToString());

I would also add Silenium primarily as a DOM builder in the scripted tables seleniumhq.org/docs/05_selenium_rc.jsp#c scraping.pro / ... In order to get to the generated DOM, you can use the calculated script stackoverflow.com/questions/26584215 instead of PageSource / ... var pageSource = (string) driver.ExecuteScript ("return document.body.outerHTML");
@Serginio If you use outerHTML , then Selenium will be a third-party tool for getting HTML, and parsing will still be done by parsers, which we are talking about here.
Now, if you use it directly - yes, it turns out DOM as after a parser, two in one.
@Serginio Talking about the fact that in AngleSharp support for JS is unfinished is probably not entirely fair, because it is unlikely that it is going to mutate into a full-fledged browser.
:) I don’t understand why the author gave in to screwing the zhaboscript to the parser, because you can’t do anything useful with it.
If it becomes necessary to emulate a browser, then it is clearly not a stripped-down proof-of-concept that is taken.
// And I repeat: the HTML parser is a thing that explodes a very specific language.
The browser emulator is a much more powerful and high-level thing, text analysis without the DOM is weaker and lower-level.

Nick Volynkin ♦ 24.6k 14 94 175 · Answer 2 · 2016-11-27T07:31:36

Use the CefSharp library for such tasks.

Why should this approach be used?

You have a much simpler development process due to the fact that instead of writing XPath, conditions and / or cycles in C #, you simply develop everything you need in the browser console (preferably based on Chromium), then when a small core of the class is written (I'll show you it below), you just paste the javascript code that you need.
Reliability. You are not trying to parse HTML and do not reinvent the wheel, which is almost always a very bad idea. The project is based on Chromium, so you do not have to trust any new / unfamiliar product. Actively supported for synchronization with the new version.

For Javascript calls, jQuery is used for simplicity and demonstration, assuming that it also exists on the target site. But it can also be pure JavaScript or another library provided that this library is used on the site.

If you scroll down, you will notice that in addition to writing a small layer of code and initialization, the solution takes one or two lines:

 string[] urls = await wrapper.GetResultAfterPageLoad("https://yandex.ru", async () => await wrapper.EvaluateJavascript<string[]>( "$('a[href]').map((index, element) => $(element).prop('href')).toArray()"));

What it is?

This is a managed shell over CEF ( Chromium Embedded Framework ). That is, you get the power of Chromium, which is controlled programmatically.

Why choose CEF / CefSharp?

You should not bother with parsing pages (and this is a difficult and ungrateful task, which I highly recommend not to do).
You can work with an already loaded page (after running scripts).
It is possible to execute arbitrary javascript with the latest features.
It makes it possible to invoke AJAX with the help of JavaScript, and then with success (success), to pull events in the C # code with the result of AJAX. Detailed and with an example reviewed here .

CefSharp Varieties

CefSharp.WinForms
CefSharp.Wpf
CefSharp.OffScreen

The first two are used if you need to give users a Browser control. Conceptually similar to WebBrowser in Windows Forms, which is a wrapper for managing IE, not Chromium, as in our case.

Therefore, we will use the CefSharp.OffScreen (offscreen) version.

Code writing

Suppose we have a console application, but it already depends on you.

Install the CefSharp.OffScreen Nuget package of the 57th version:
Install-Package CefSharp.OffScreen -Version 57.0.0

The fact is that C # all arrays map to List<object> , the result of JavaScript is wrapped in object , which already contains List<object> , string , bool , int depending on the result. In order to make the results strongly typed, create a small ConvertHelper :

 public static class ConvertHelper { public static T[] GetArrayFromObjectList<T>(object obj) { return ((IEnumerable<object>)obj) .Cast<T>() .ToArray(); } public static List<T> GetListFromObjectList<T>(object obj) { return ((IEnumerable<object>)obj) .Cast<T>() .ToList(); } public static T ToTypedVariable<T>(object obj) { if (obj == null) { dynamic dynamicResult = null; return dynamicResult; } Type type = typeof(T); if (type.IsArray) { dynamic dynamicResult = typeof(ConvertHelper).GetMethod(nameof(GetArrayFromObjectList)) .MakeGenericMethod(type.GetElementType()) .Invoke(null, new[] { obj }); return dynamicResult; } if (type.IsGenericType && type.GetGenericTypeDefinition() == typeof(List<>)) { dynamic dynamicResult = typeof(ConvertHelper).GetMethod(nameof(GetListFromObjectList)) .MakeGenericMethod(type.GetGenericArguments().Single()) .Invoke(null, new[] { obj }); return dynamicResult; } return (T)obj; } }

To handle javascript errors, create a JavascriptException class.

 public class JavascriptException : Exception { public JavascriptException(string message) : base(message) { } }

You may have your own way of handling errors.

Create a CefSharpWrapper class:

 public sealed class CefSharpWrapper { private ChromiumWebBrowser _browser; public void InitializeBrowser() { Cef.EnableHighDPISupport(); // Perform dependency check to make sure all relevant resources are in our output directory. Cef.Initialize(new CefSettings(), performDependencyCheck: false, browserProcessHandler: null); _browser = new ChromiumWebBrowser(); // wait till browser initialised AutoResetEvent waitHandle = new AutoResetEvent(false); EventHandler onBrowserInitialized = null; onBrowserInitialized = (sender, e) => { _browser.BrowserInitialized -= onBrowserInitialized; waitHandle.Set(); }; _browser.BrowserInitialized += onBrowserInitialized; waitHandle.WaitOne(); } public void ShutdownBrowser() { // Clean up Chromium objects. You need to call this in your application otherwise // you will get a crash when closing. Cef.Shutdown(); } public Task<T> GetResultAfterPageLoad<T>(string pageUrl, Func<Task<T>> onLoadCallback) { TaskCompletionSource<T> tcs = new TaskCompletionSource<T>(); EventHandler<LoadingStateChangedEventArgs> onPageLoaded = null; T t = default(T); // An event that is fired when the first page is finished loading. // This returns to us from another thread. onPageLoaded = async (sender, e) => { // Check to see if loading is complete - this event is called twice, one when loading starts // second time when it's finished // (rather than an iframe within the main frame). if (!e.IsLoading) { // Remove the load event handler, because we only want one snapshot of the initial page. _browser.LoadingStateChanged -= onPageLoaded; t = await onLoadCallback(); tcs.SetResult(t); } }; _browser.LoadingStateChanged += onPageLoaded; _browser.Load(pageUrl); return tcs.Task; } public async Task EvaluateJavascript(string script) { JavascriptResponse javascriptResponse = await _browser.GetMainFrame().EvaluateScriptAsync(script); if (!javascriptResponse.Success) { throw new JavascriptException(javascriptResponse.Message); } } public async Task<T> EvaluateJavascript<T>(string script) { JavascriptResponse javascriptResponse = await _browser.GetMainFrame().EvaluateScriptAsync(script); if (javascriptResponse.Success) { object scriptResult = javascriptResponse.Result; return ConvertHelper.ToTypedVariable<T>(scriptResult); } throw new JavascriptException(javascriptResponse.Message); } }

Next we call our CefSharpWrapper class from the Main method.

 public class Program { private static void Main() { MainAsync().Wait(); } private static async Task MainAsync() { CefSharpWrapper wrapper = new CefSharpWrapper(); wrapper.InitializeBrowser(); string[] urls = await wrapper.GetResultAfterPageLoad("https://yandex.ru", async () => await wrapper.EvaluateJavascript<string[]>("$('a[href]').map((index, element) => $(element).prop('href')).toArray()")); wrapper.ShutdownBrowser(); } }

Also: in this library there is a feature that an empty JavaScript array is cast to null . Therefore, it may be worthwhile to add the appropriate code to ConvertHelper (it depends on your code and needs), or to write something like in the calling code

 if (urls == null) urls = new string[0]

Also install x64 or x86 as a platform. Any CPU platform is supported, but requires additional code .

You would also write about the shortcomings of the approach: running a full browser engine is 100 times slower than DOM parsit, and 1000 times slower than regular parsing.
It makes sense to use except on fully dynamic sites, in the guts of which too lazy to understand, and in other hardcore cases.
@VadimOvchinnikov: Well, instead of C # libraries, each of which offers its own syntax, you use libraries on JS, each of which also offers its own syntax.
Plus, for example, I don’t know JS, but I know C # - why know and use two languages, if one can cope?

MSDN.WhiteKnight MSDN.WhiteKnight 14k five 26 56 · Answer 3 · 2017-11-01T06:14:18

If the performance requirements are not very high, you can use the Internet Explorer COM object (add a link to the Microsoft HTML Object Library):

 public static List<string> ParseLinks(string html) { List<string> res = new List<string>(); mshtml.HTMLDocument doc = null; mshtml.IHTMLDocument2 d2 = null; mshtml.IHTMLDocument3 d = null; try { doc = new mshtml.HTMLDocument();//инициализация IE d2 = (mshtml.IHTMLDocument2)doc; d2.write(html); d = (mshtml.IHTMLDocument3)doc; var coll = d.getElementsByTagName("a");//получить коллекцию элементов по имени тега object val; foreach (mshtml.IHTMLElement el in coll)//извлечь атрибут href из всех элементов { val=el.getAttribute("href"); if (val == null) continue; res.Add(val.ToString()); } } finally { //освобождение ресурсов if (doc != null) Marshal.ReleaseComObject(doc); if (d2 != null) Marshal.ReleaseComObject(d2); if (d != null) Marshal.ReleaseComObject(d); } return res; }

Mic mic 31 one · Answer 4 · 2018-11-01T15:36:48

I’ll insert my five kopecks, if you don’t want to mess around with COM objects mshtml, you can create a WebBrowser () object from Windows.Forms, and if you don’t need all the scripts to work, then I understand that the page can be loaded not by the browser itself, but than simpler, like WebClient.DownloadString (), and then load the resulting page text for parsing in WebBrowser:

 var itemPageText = _webClient.DownloadString(url); using (var pageHtml = new WebBrowser()) { pageHtml.DocumentText = itemPageText; var elem = pageHtml.Document.GetElementById("imainImgHldr"); }

well, etc., the main thing is that methods like GetElementById () are also somewhat more digestible wrappers, unlike mshtml.

Anatol anatol 3,052 one 13 37 · Answer 5 · 2019-02-10T04:32:00

F #

Search page for all references to books on F #:

  let fsys = "https://www.google.com/search?tbm=bks&q=F%23" let doc2 = HtmlDocument.Load(fsys) let books = doc2.CssSelect("div.g h3.ra") |> List.map(fun a -> a.InnerText().Trim(), a.AttributeValue("href")) |> List.filter(fun (title, href) -> title.Contains("F#"))

F # Data
F # Data HTML Parser
F # Data HTML CSS selectors

Answer 6 · 2015-09-14T05:31:24

I do great with XElement Try it :)

 var htmlDom = XElement.Parse("[Код HTML]");

As suggested in the comments, this will work if the page we need is a valid XHTML document.

iRumba

3,323 one eleven 35

four
No, you try it: XElement.Parse("<html><body><ul class=foo><li><input type=checkbox checked>Hello, world!<li>Second line"); - Pavel Mayorov
3
in HTML it is allowed not to close tags, not to put quotes around the attribute value - and not even to specify this value if it is boolean. - Pavel Mayorov
6
Where have you seen fully valid pages on the Internet? But the author needs to parse real pages, not spherical in a vacuum ... - Pavel Mayorov
7
And what's the point of using a third-party service if you can use a full-fledged HTML parser? - Pavel Mayorov
eight
@iRumba You did not understand the humor. Closing many tags in HTML is optional; unclosed tags will not result in a validation error. Now, if the XHTML page is yes, the XML parser will cope with it, only such pages are few. - Athari

|