Regular expressions are designed to handle relatively simple texts that are specified in regular languages . Regular expressions have become very complicated since their introduction, especially in Perl, the implementation of regular expressions in which is an inspiration for other languages and libraries, but regular expressions are still poorly adapted (and are unlikely to ever be) for processing complex languages such as HTML. The complexity of HTML processing lies also in the very complex rules for processing invalid code, which were inherited from the first implementations of the birth of the Internet, when there were no standards at all, and each browser manufacturer had unique and unique features.
So, in general, regular expressions are not the best candidate for handling HTML. It is usually wiser to use specialized HTML parsers.
License: MIT
One of the modern HTML parsers for .NET. The validator.nu parser for Java, which in turn is a port of the parser from the Gecko engine (Firefox), is taken as the basis. This ensures that the parser will process the code in the same way as modern browsers.
The API draws inspiration from jQuery, using the selectors language CSS to select elements. The names of the methods are copied almost one-to-one, that is, for programmers who are familiar with jQuery, learning will be easy.
It has high performance. On orders of magnitude higher than HtmlAgilityPack + Fizzler in speed on complex queries.
CQ cq = CQ.Create(html); foreach (IDomObject obj in cq.Find("a")) Console.WriteLine(obj.GetAttribute("href"));
If a more complex query is required, the code is practically not complicated:
CQ cq = CQ.Create(html); foreach (IDomObject obj in cq.Find("h3.ra")) Console.WriteLine(obj.GetAttribute("href"));
License: Ms-PL
The oldest, and therefore the most popular parser for .NET. However, age does not mean quality, for example, for five years (!!!) the critical bug Incorrect parsing of HTML4 optional end tags has been left unclosed, which leads to incorrect processing of HTML tags that cover tags for which are optional. There are oddities in the API, for example, if nothing is found, it returns null , and not an empty collection.
XPath is used to select items, not CSS selectors. On simple requests, the code is more or less readable:
HtmlDocument hap = new HtmlDocument(); hap.LoadHtml(html); HtmlNodeCollection nodes = hap.DocumentNode.SelectNodes("//a"); if (nodes != null) foreach (HtmlNode node in nodes) Console.WriteLine(node.GetAttributeValue("href", null));
However, if complex queries are needed, then XPath is not very suitable for simulating CSS selectors:
HtmlDocument hap = new HtmlDocument(); hap.LoadHtml(html); HtmlNodeCollection nodes = hap.DocumentNode.SelectNodes( "//h3[contains(concat(' ', @class, ' '), ' r ')]/a"); if (nodes != null) foreach (HtmlNode node in nodes) Console.WriteLine(node.GetAttributeValue("href", null));
License: LGPL
Add-in to the HtmlAgilityPack, which allows the use of CSS selectors.
HtmlDocument hap = new HtmlDocument(); hap.LoadHtml(html); foreach (HtmlNode node in hap.DocumentNode.QuerySelectorAll("h3.ra")) Console.WriteLine(node.GetAttributeValue("href", null));
License: BSD (3-clause)
New player on the parser field. Unlike CsQuery, it is written from scratch manually in C #. Also includes parsers for other languages.
The API is based on the official JavaScript specification of the HTML DOM. Some places have oddities unusual for developers on .NET (for example, when accessing the wrong index in the collection, null will be returned instead of throwing an exception; there is a separate Url class; the namespace is very granular, even the basic use of the library requires three using and etc.), but in general, nothing critical.
Of other oddities - the library is dragging the Microsoft BCL Portability Pack. Therefore, when you connect AngleSharp via NuGet, do not be surprised if you find three additional packages connected: Microsoft.Bcl, Microsoft.Bcl.Build, Microsoft.Bcl.Async.
HTML processing is simple:
IHtmlDocument angle = new HtmlParser(html).Parse(); foreach (IElement element in angle.QuerySelectorAll("a")) Console.WriteLine(element.GetAttribute("href"));
It is not complicated, and if more complex logic is needed:
IHtmlDocument angle = new HtmlParser(html).Parse(); foreach (IElement element in angle.QuerySelectorAll("h3.ra")) Console.WriteLine(element.GetAttribute("href"));
Scary and terrible regular expressions. It is undesirable to use them, but sometimes it becomes necessary, since the parsers that build the DOM are noticeably more gluttonous than the Regex : they consume more CPU time and memory.
If it came to regular expressions, then you need to understand that you can not build on them a universal and absolutely reliable solution. However, if you want to parse a specific site, then this problem may not be so critical.
For God's sake, do not turn regular expressions into unreadable mess. You do not write C # code in one line with single-letter variable names, and regular expressions do not need to be spoiled. The regular expression engine in .NET is powerful enough to be able to write high-quality code.
For example, here is a little modified code for extracting links from a question:
Regex reHref = new Regex(@"(?inx) <a \s [^>]* href \s* = \s* (?<q> ['""] ) (?<url> [^""]+ ) \k<q> [^>]* >"); foreach (Match match in reHref.Matches(html)) Console.WriteLine(match.Groups["url"].ToString());