Parsing relative url in html markup

Question

There is a need to get all the relative paths in the HTML markup. Made up a similar regular expression:

@"(?:src|href)=""([^#](?!http[s]*[:])[^/]{2}(([a-z0-9-.]*/)*)([a-z0-9-.]*?[a-z0-9-]*!?.[az]{2,4})(?!#)\w*\W*)"""

In general, it works as it should if you use the same, for example in JavaScript. Anchors like #yakor ignored correctly, but there is a problem with anchor links like index.html#yakor in C #, they are not just ignored.

Designed in this calculator, but it's for javascript.

Try to screen # like this \# or like this \x23 - nick_n_a

Community spirit ♦ one · Accepted Answer · 2016-08-10T20:09:09

If the problem is only in parsing the link, as stated in the comments, then it is better not to try to use the regulars again, but to apply the honest Uri class.

Example:

 var uri1 = new Uri("http://www.google.com/index.html#yakor", UriKind.RelativeOrAbsolute); var uri2 = new Uri("/index.html#yakor", UriKind.RelativeOrAbsolute); Console.WriteLine(uri1.IsAbsoluteUri); // true Console.WriteLine(uri2.IsAbsoluteUri); // false

And to parse the HTML is better to use the tips from here .

In general, Uri simplified the cutting off of absolute paths, but the paths like //site.ru/img.jpg have to be manually //site.ru/img.jpg .
How do I understand this is a kind of abbreviated alternative http:// for src and href

Parsing relative url in html markup

1 answer 1

More articles: