Tell me, please, working Regexp (.NET) for hyperlinks. Now I use the regex view:

@"(?<protocol>http(s)?)://(?<server>([A-Za-z0-9-]+\.)*(?<basedomain>[A-Za-z0-9-]+\.[A-Za-z0-9]+))+((:)?(?<port>[0-9]+)?(/?)(?<path>(?<dir>[A-Za-z0-9\._\-/]+)(/){0,1}[A-Za-z0-9.-/_]*)){0,1}" 

But parsit not all links, unfortunately, for example, does not accept such:

http://regexlib.com/%28A%28-umS_xFKaqoVVf9qVIJUf1Zy7GjFNovSUv_QSprOszdQi3qMJRTXbLp9XIzGZOdY9B8Xq3gtGPTkGYEe5C6Rg6XjA0fwU_JkMeaaE2ONmrRbxhFFfAt9Y-AEfyujh9NpzsN268y6Dh25xbgqyzTzjkY8AKB8_7uLPmDk2wgufsFSxx39e269HFBLoTs8wMhX0%29%29/DisplayPatterns.aspx

This link can be parsed by another regular program:

 "^(ht|f)tp(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$" 

But it breaks off when it encounters a text of the form:

''http://edition.cnn.com/\"javascript:CNN_handleOverlay('opt_out_cnn')\"/''.

  • and what's the record: <protocol>, <basedomain> ...?) - timka_s
  • These are named groups, it’s possible not to call them, but to call them by their ordinal number, but it’s more convenient for me. - Merlin
  • look at 8 useful regexp with a clear analysis , in the comments dispute the suitability of examples, but still - Specter
  • I will quote from the comments "for learning will roll, but for real use is not very," so it is, I have already tried them. In general, if there are no good regulars, I will go myself to compose then, maybe I will create something full-fledged. - Merlin
  • If it is easy, please provide a URL for which it is not working, for example, the URL of this question is resolved to: protocol => 'http' path => 'questions / 51778 /' server => 'hashcode.ru' basedomain => 'hashcode .ru 'dir =>' questions / 51778 / ' http://www.tyres.spb.ru/index.php?mid=31 => protocol =>' http 'path =>' index.php 'server => 'www.tyres.spb.ru' basedomain => 'spb.ru' dir => 'index.php' - chernomyrdin

2 answers 2

You are here -> Analysis of the URI structure

It was necessary to slash off screen:
^(([^:\/?#]+):)?(\/\/([^\/?#]*))?([^\?#]*)(\?([^#]*))?(#(.*))?

  • Is there a regular ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))? but it seems not for .Net - Merlin
  • Regulars - they (with a few exceptions) are common) No subtleties are used here) - timka_s
  • one
    It does not work in general. - Merlin
  • With escaped slashes, it also breaks off on the line of the form edition.cnn.com \\\ "javascript: CNN_handleOverlay ('no_cookie_cnn') \\\" / - Merlin
  • one
    You somehow do something wrong - everything finds it) - timka_s

So here it is, the fish of my dreams!

 (http|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])? 

Copes with many types of html links, for example, with such:

 http://homenet.beeline.ru/index.php?showforum=783 http://regexlib.com/%28A%28-umS_xFKaqoVVf9qVIJUf1Zy7GjFNovSUv_QSprOszdQi3qMJRTXbLp9XIzGZOdY9B8Xq3gtGPTkGYEe5C6Rg6XjA0fwU_JkMeaaE2ONmrRbxhFFfAt9Y-AEfyujh9NpzsN268y6Dh25xbgqyzTzjkY8AKB8_7uLPmDk2wgufsFSxx39e269HFBLoTs8wMhX0%29%29/DisplayPatterns.aspx http://edition.cnn.com/\\\"javascript:CNN_handleOverlay('no_cookie_cnn')\\\"/" 

If you suddenly find that this regular program does not cope with any text, please write here, we will continue to search or refine.