PHP regular expressions. Extract lines between tags.

Question

Hello! My task is to get one table (from <table ..... to </ table>) from large porridge on html. This table differs from the rest of the mass in that it has the following entry in the opening tag:

class="table01"

Based on this, make up the appropriate template:

 '/<table .* class="table01" .*>[\S\s]*<\/table>/Uix'

And in the end I get zero. And even so, even so:

 $pregTable = '/<table .*? class="table01" .*?>[\S\s]*?<\/table>/ix';

Here is the code itself:

 $file = file_get_contents('test.html'); $pregTable = '/<table .* class="table01" .*>[\S\s]*<\/table>/Uix'; $arrTable = array(); preg_match_all($pregTable, $file, $arrTable, PREG_SET_ORDER); print_r($arrTable);

I tried a lot of different options, I suffer the whole day, nothing comes out. I receive either the text from the beginning of the desired table to the closing of the last - if I do not use it? or the modifier U, or zero - if with them. What am I doing wrong here?

I advise realcode.ru/regexptester and gskinner.com/RegExr to help
I tested my templates by links - everything is OK - in fact - nothing comes of it.

Answer 1 · 2011-05-24T20:39:09

The easiest and most effective way in this case is to parse HTML using DOM and get the table via XPath:

 $text = <<< EOS <body> <table class="table01"> <tr><th>First table</th></tr> <tr><td><table><tr><th>Inner <table><tr><th></th></tr></table> table</th></tr></table></td></tr> <tr><td><table><tr><th>Second inner table</th></tr></table></td></tr> </table> <table> <tr><td>Second outer table</th></tr> </table> </body> EOS; $dom = new DOMDocument(); $dom->loadHTML($text); $xpath = new DOMXPath($dom); $nodes = $xpath->evaluate('//table[@class="table01"]'); var_dump($dom->saveXML($nodes->item(0)));

However, if you wish, you can solve this problem using a regular expression. The problem with nested tables in this case is solved using recursive expressions :

 $class = 'table01'; // Любой символ, с которого не начинается тег <table> $any = "(?: [^<] | <(?!/?table\b) )"; // Открытый и закрытий теги <table>, между которыми любое количество символов $any, // либо подставить рекурсивно подшаблон #2 (шаблон #1 - это кавычка, см. далее) $inner = "(<table[^>]*> (?> $any | (?2) )+? </table>)"; // Тоже самое, что и $inner, но с дополнительным атрибутом у тега <table> // Модификатор 's' в данном случае не нужен, т.к. мы не используем мета-символ '.' // А модификатор 'U' не нужен, поскольку мы оперируем только ascii символами $pattern = "~<table\b[^>]*\bclass=(\"|')?$class\\1[^>]*> (?> $any | $inner )+ </table>~xi"; preg_match($pattern, $text, $m); var_dump($m);

And yesterday at one o'clock in the morning I hurried a bit with her :)

Answer 2 · 2011-05-24T17:03:17

How to explain to you ... there are two options:

the first:

 preg_match('~<table.*?>(.*?)</table>~is', $content, $m );

and second:

 preg_match('~<table.*?>(.*)</table>~is', $content, $m );

The only difference is "?"

The first will not work if there is another nested table inside the table.

And the second option will not work when there are 2 tables on the page in parallel.

The first rule of parsing says: before you look at the output, look at the input (this is my rule, brought it out after such long suffering :))

Therefore, see what you are trying to parse, most likely you get the wrong page that you expect, or a blank text at all (for example, outgoing requests are prohibited)

The instructions you gave me are clear, but I wrote that the table is one, and differs from others only in the presence of class = "table01".
The template suggested by Pavel Vladimirov - "| <table. * Class = \" table01 \ ". * </ Table> | sU" should work ... But for some reason it does not work!

Answer 3 · 2011-05-25T03:48:18

Since we started talking about DOM parsers:

http://simplehtmldom.sourceforge.net/manual.htm

 $html = file_get_html('test.htm'); $ret = $html->find('table[class=table01]');

Answer 4 · 2011-05-24T14:53:35

should work with modifiers "s" and "U"

 preg_match("|<table.*class=\"table01\".*</table>|sU",$text,$match);

Pavel Vladimirov

377 1 silver mark 7 bronze marks

I understand perfectly well what your template means and how it works, but oddly enough it still returns an empty array. - Roman St
Put an example of HTML text or a link to your file that you parse. most likely an error in HTML. - Pavel Vladimirov
I'll post it later .. There, the thing is, if you search for tags individually (for example, "| <table. * Class = \" table01 \ "> |", or "| </ table> |", then everything is fine. The HTML document itself is very large. Maybe there are some limitations on the volume? I won’t mind. - Roman St
I usually check big regulars: I cut the regular season, for example, in half. if it gives out what it needs, I add another part, and so on. so you can determine the place where it works incorrectly. As for the large document, I can’t say anything about the size limitations. The official documentation did not find such a limitation. - Pavel Vladimirov
Regular not big. Big - html. With your method of gradually increasing the regular season, I tried to do it before I created the question here .. The error occurs exactly at the moment when I start to regulate the greed of quantifiers with a question mark, or with the U modifier. - Roman St

|

ling ling 14k 1 golden mark 18 silver marks 45 bronze marks · Answer 5 · 2011-05-24T18:52:17

The task, frankly speaking, is not a trivial and simple regular procedure. There are two types of regular expressions: lazy and greedy. Lazy expression can pull you a table inside the desired and the first piece of the desired table, which is clearly not enough. Greedy can pull out the desired table and another half a page of code, which is also unacceptable. Two exits are suggested:
1. use php-parser, pass code through it and search for the desired table using class methods
2. write your function, pulling out all closing and opening tags and memorizing their entries. And then it will be necessary to go through the memorized values and, having the beginning and end of the table, pull out everything in between.
So far, only such ideas.

Extract from the "big text" everything from <table .... to ... </ table> I think, more literally, it will be using regular expressions.
If you are sure that you have only one this table on the page, then why not.
But if you have something like this on the page: <table> <table> </ table> </ table> <table> </ table>, then you won’t get off with a little blood.
But, if you carefully read the question, you will see that the desired table has class = "table01", and, I will add, it does not have nested tables.
We take everything where <table and after it occurs, through an unknown number of characters (ID and other non-unique, table parameters) there is a class = "table01", then we take all printed characters and line breaks (spaces) before the first closing </ table >.
Well, if so, then $ pattern = '/<table[>>>>+class="table01""^>>+> ().
I see such problems, why the regulars do not work for you: - incorrect regular (checked, it means it's not in it) - error in HTML.
maybe there </ table> or something else is written (you need to look at the source HTML more carefully) - your HTML is cut off when reading from a file (try outputting it to another file or on the screen and see) - the regular process is processed with an error (see that the function preg_last_error returns) nothing can be said without debugging on a specific task.

Roman St Roman St 586 1 golden mark 9 silver marks 21 bronze marks · Answer 6 · 2011-05-25T13:35:40

Thanks to everyone for their help and advice! Although the question was not solved, new knowledge was received. In general, I solved my problem in a very radical way:

 function getStrBetween($string, $from, $to) { $prepared = substr //обрезаем лишнее сначала ( $string, stripos //вычисляем количество лишних символов сначала файла ( $string, $from ) ); $returned = substr //обрезаем лишнее в конце ( $prepared, 0, ( stripos ( $prepared, $to ) - strlen($prepared) + strlen($to) ) ); return $returned; }

He achieved the desired result, but the question remains open, for those who are interested for some reason, the regular season did not work here.

with the pattern: '/ <table. * class = "table" [\ S \ s] * <\ / table> / Uix' preg_last_error () prints 2
And this is what it means: 2 - PREG_BACKTRACK_LIMIT_ERROR - the limit of backlinks has been reached And what does this mean?
php.net/manual/en/ ... In the very first comments, they say that the limit of 100m (by default 100k) works without any problems on a netbook.

Roman St Roman St 586 1 golden mark 9 silver marks 21 bronze marks · Accepted Answer · 2011-05-25T14:31:48

Attention, the correct answer!

 preg_last_error()

brought me an error code - 2, it means

 2 - PREG_BACKTRACK_LIMIT_ERROR - Лимит обратных ссылок исчерпан

Just simply had to uncomment

 pcre.backtrack_limit=100000

in php.ini, and increase this limit, as many as 146428 characters in my ill-fated file!
Now everything is OK, and a simple expression:

 '/<table .* class="table01" [\S\s]* <\/table>/Uix'

works great! I hope the information will be useful. Thanks again to everyone for their assistance :)

And I will add .. Viktor yesterday advised these two links here realcode.ru/regexptester gskinner.com/RegExr There is a web-muzzle for developing patterns in real time.
Here is another script: danechka.com/tester.php There is at least not a bug, and you can add any modifiers yourself!

PHP regular expressions. Extract lines between tags.

7 answers 7

More articles: