Suppose we have:

string a = <p style="text-align:center;"><strong>Primary Link: <a href="http://www.mediafire.com/download/90wqj6d0n86h7z1/YandereSimMay7th.rar">http://www.mediafire.com/download/90wqj6d0n86h7z1/YandereSimMay7th.rar</a></strong></p> 

and you need to cut it to the left and to the right, so that it happens:

 string a = http://www.mediafire.com/download/90wqj6d0n86h7z1/YandereSimMay7th.rar 

It is necessary to delete the characters before the "http" and after the "rar", while also removing the "duplicate". The question is how to do this? I will ask with an example, since I am new to C ++.

  • one
    If you are new to C ++, use the language in which you are not new. And yes, for parsing HTML, they usually use a ready-made parser, and not reinvent the wheel. - VladD
  • If you are given an exhaustive answer, mark it as correct (a daw opposite the selected answer). - Nicolas Chabanovsky

1 answer 1

It depends on what you need. If the first URL is the same approach, if the first URL in the <a> tag is slightly different, etc. You can, for example, use regular expressions.

Since you are a beginner, let's simplify the task - find href="URL" , and select a URL from it.

Find the position of href=" in the line

 size_t pos = a.find("href=\""); 

and, if found, truncate the string to the left, skipping 6 characters href="

 if (pos != string::npos) a = a.substr(pos + 6); 

Then look for quotes and cut to the right.

 pos = a.find('"'); if (pos != string::npos) a = a.substr(0,pos); 

Everything.

For this particular case :) You understand that you need to search from <img> little differently, well, I’m already given advice to see the parser. And yet - from experience - HTML is not XML, it even allows for freedom under the standard, well, and even if XML meets a curve, then what can we say about HTML ... I’m exaggerating, but 80% of the code will handle HTML errors and 20% - analysis itself :)