Hello! My task is to get one table (from <table ..... to </ table>) from large porridge on html. This table differs from the rest of the mass in that it has the following entry in the opening tag:

class="table01" 

Based on this, make up the appropriate template:

 '/<table .* class="table01" .*>[\S\s]*<\/table>/Uix' 

And in the end I get zero. And even so, even so:

 $pregTable = '/<table .*? class="table01" .*?>[\S\s]*?<\/table>/ix'; 

Here is the code itself:

 $file = file_get_contents('test.html'); $pregTable = '/<table .* class="table01" .*>[\S\s]*<\/table>/Uix'; $arrTable = array(); preg_match_all($pregTable, $file, $arrTable, PREG_SET_ORDER); print_r($arrTable); 

I tried a lot of different options, I suffer the whole day, nothing comes out. I receive either the text from the beginning of the desired table to the closing of the last - if I do not use it? or the modifier U, or zero - if with them. What am I doing wrong here?

7 answers 7

The easiest and most effective way in this case is to parse HTML using DOM and get the table via XPath:

 $text = <<< EOS <body> <table class="table01"> <tr><th>First table</th></tr> <tr><td><table><tr><th>Inner <table><tr><th></th></tr></table> table</th></tr></table></td></tr> <tr><td><table><tr><th>Second inner table</th></tr></table></td></tr> </table> <table> <tr><td>Second outer table</th></tr> </table> </body> EOS; $dom = new DOMDocument(); $dom->loadHTML($text); $xpath = new DOMXPath($dom); $nodes = $xpath->evaluate('//table[@class="table01"]'); var_dump($dom->saveXML($nodes->item(0))); 

However, if you wish, you can solve this problem using a regular expression. The problem with nested tables in this case is solved using recursive expressions :

 $class = 'table01'; // Любой символ, с которого не начинается тег <table> $any = "(?: [^<] | <(?!/?table\b) )"; // Открытый и закрытий теги <table>, между которыми любое количество символов $any, // либо подставить рекурсивно подшаблон #2 (шаблон #1 - это кавычка, см. далее) $inner = "(<table[^>]*> (?> $any | (?2) )+? </table>)"; // Тоже самое, что и $inner, но с дополнительным атрибутом у тега <table> // Модификатор 's' в данном случае не нужен, т.к. мы не используем мета-символ '.' // А модификатор 'U' не нужен, поскольку мы оперируем только ascii символами $pattern = "~<table\b[^>]*\bclass=(\"|')?$class\\1[^>]*> (?> $any | $inner )+ </table>~xi"; preg_match($pattern, $text, $m); var_dump($m); 
  • Finished regular season. And yesterday at one o'clock in the morning I hurried a bit with her :) - Ilya Pirogov

How to explain to you ... there are two options:

the first:

 preg_match('~<table.*?>(.*?)</table>~is', $content, $m ); 

and second:

 preg_match('~<table.*?>(.*)</table>~is', $content, $m ); 

The only difference is "?"

The first will not work if there is another nested table inside the table.

And the second option will not work when there are 2 tables on the page in parallel.

The first rule of parsing says: before you look at the output, look at the input (this is my rule, brought it out after such long suffering :))

Therefore, see what you are trying to parse, most likely you get the wrong page that you expect, or a blank text at all (for example, outgoing requests are prohibited)

  • Why two options? The instructions you gave me are clear, but I wrote that the table is one, and differs from others only in the presence of class = "table01". Nested tables do not have. The template suggested by Pavel Vladimirov - "| <table. * Class = \" table01 \ ". * </ Table> | sU" should work ... But for some reason it does not work! - Roman St

Since we started talking about DOM parsers:

http://simplehtmldom.sourceforge.net/manual.htm

 $html = file_get_html('test.htm'); $ret = $html->find('table[class=table01]'); 

    should work with modifiers "s" and "U"

     preg_match("|<table.*class=\"table01\".*</table>|sU",$text,$match); 
    • I understand perfectly well what your template means and how it works, but oddly enough it still returns an empty array. - Roman St
    • Put an example of HTML text or a link to your file that you parse. most likely an error in HTML. - Pavel Vladimirov
    • I'll post it later .. There, the thing is, if you search for tags individually (for example, "| <table. * Class = \" table01 \ "> |", or "| </ table> |", then everything is fine. The HTML document itself is very large. Maybe there are some limitations on the volume? I won’t mind. - Roman St
    • I usually check big regulars: I cut the regular season, for example, in half. if it gives out what it needs, I add another part, and so on. so you can determine the place where it works incorrectly. As for the large document, I can’t say anything about the size limitations. The official documentation did not find such a limitation. - Pavel Vladimirov
    • Regular not big. Big - html. With your method of gradually increasing the regular season, I tried to do it before I created the question here .. The error occurs exactly at the moment when I start to regulate the greed of quantifiers with a question mark, or with the U modifier. - Roman St

    The task, frankly speaking, is not a trivial and simple regular procedure. There are two types of regular expressions: lazy and greedy. Lazy expression can pull you a table inside the desired and the first piece of the desired table, which is clearly not enough. Greedy can pull out the desired table and another half a page of code, which is also unacceptable. Two exits are suggested:
    1. use php-parser, pass code through it and search for the desired table using class methods
    2. write your function, pulling out all closing and opening tags and memorizing their entries. And then it will be necessary to go through the memorized values ​​and, having the beginning and end of the table, pull out everything in between.
    So far, only such ideas.

    • Why not decide? Extract from the "big text" everything from <table .... to ... </ table> I think, more literally, it will be using regular expressions. - Roman St
    • If you are sure that you have only one this table on the page, then why not. But if you have something like this on the page: <table> <table> </ table> </ table> <table> </ table>, then you won’t get off with a little blood. - ling
    • Yes, my html is just as confusing as you wrote! But, if you carefully read the question, you will see that the desired table has class = "table01", and, I will add, it does not have nested tables. The task, it would seem, is much easier! \ '/ <table. *? class = "table01" [\ S \ s] *? <\ / table> / \ '- i.e. We take everything where <table and after it occurs, through an unknown number of characters (ID and other non-unique, table parameters) there is a class = "table01", then we take all printed characters and line breaks (spaces) before the first closing </ table >. But neither this condition, no similar work = ((( - Roman St
    • Well, if so, then $ pattern = '/<table[>>>>+class="table01""^>>+> (). - ling
    • I see such problems, why the regulars do not work for you: - incorrect regular (checked, it means it's not in it) - error in HTML. maybe there </ table> or something else is written (you need to look at the source HTML more carefully) - your HTML is cut off when reading from a file (try outputting it to another file or on the screen and see) - the regular process is processed with an error (see that the function preg_last_error returns) nothing can be said without debugging on a specific task. - Pavel Vladimirov

    Thanks to everyone for their help and advice! Although the question was not solved, new knowledge was received. In general, I solved my problem in a very radical way:

     function getStrBetween($string, $from, $to) { $prepared = substr //обрезаем лишнее сначала ( $string, stripos //вычисляем количество лишних символов сначала файла ( $string, $from ) ); $returned = substr //обрезаем лишнее в конце ( $prepared, 0, ( stripos ( $prepared, $to ) - strlen($prepared) + strlen($to) ) ); return $returned; } 

    He achieved the desired result, but the question remains open, for those who are interested for some reason, the regular season did not work here.

    • So what did preg_last_error () give out or not? - Ilya Pirogov
    • Just tried it! with the pattern: '/ <table. * class = "table" [\ S \ s] * <\ / table> / Uix' preg_last_error () prints 2 - Roman St
    • And this is what it means: 2 - PREG_BACKTRACK_LIMIT_ERROR - the limit of backlinks has been reached And what does this mean? o_O - Roman St
    • one
      php.net/manual/en/ ... In the very first comments, they say that the limit of 100m (by default 100k) works without any problems on a netbook. Try playing with this value. - Ilya Pirogov

    Attention, the correct answer!

     preg_last_error() 

    brought me an error code - 2, it means

     2 - PREG_BACKTRACK_LIMIT_ERROR - Лимит обратных ссылок исчерпан 

    Just simply had to uncomment

     pcre.backtrack_limit=100000 

    in php.ini, and increase this limit, as many as 146428 characters in my ill-fated file!
    Now everything is OK, and a simple expression:

     '/<table .* class="table01" [\S\s]* <\/table>/Uix' 

    works great! I hope the information will be useful. Thanks again to everyone for their assistance :)

    • And I will add .. Viktor yesterday advised these two links here realcode.ru/regexptester gskinner.com/RegExr There is a web-muzzle for developing patterns in real time. Here is another script: danechka.com/tester.php There is at least not a bug, and you can add any modifiers yourself! Posted by my friend! In general, I recommend :) - Roman St
    • one
      I advised this function for debugging - Pavel Vladimirov
    • That's it! I knew about her, but did not attach any importance :) - Roman St