Parsing a line with a tab-delimiter + ignoring text in corner brackets

Question

Good day to all. There are lines of this type -

108 уп Упаковка 0</nыЪ>

The delimiter is tabulation. It is necessary to parse in such a way that both the significant values of the fields and the empty ones are selected, that is, as in this case, 5 matches were obtained. And the fact that in angle brackets should not fall. I did it (or rather, did not succeed)

 [0-9а-яА-Яa-zA-Z]+(?!>)|\t

But he could not make it so that the block was not selected in the angle brackets. Help.

Give the text as close as possible to the one with which you will actually work.

Accepted Answer · 2014-01-31T08:12:22

Updated

Given a string:

 108[\t]уп[\t][\t][\t][\t]УпtакSЁовка[\t][\t]0</nыЪ>

The same, but in the form of a table with the division by the tab character [\ t]:

 +---+--+-+-+-+-----------+-+-+ |108|уп|-|-|-|УпtакSЁовка|-|0| +---+--+-+-+-+-----------+-+-+

"-" is an empty value.

Desired value capture scheme:

 {108}[\t]{уп}[\t]{-}[\t]{-}[\t]{-}[\t]{УпtакSЁовка}[\t]{-}[\t]{0}</nыЪ>

{*} - what we want to capture
[\ t] - tab character
"-" - empty value

Using regular expressions can not capture the "emptiness".
To somehow get out of this situation, we will capture the tab character before the empty value.

The final capture scheme:

 {108}[\t]{уп}{[\t]}-{[\t]}-{[\t]}-[\t]{УпtакSЁовка}{[\t]}-[\t]{0}</nыЪ>

Regular expression:

 @[а-яА-ЯёЁ0-9a-zA-Z]+(?=\t|<)|\t(?=\t)@

Explanation:

 /* * [а-яА-ЯёЁ0-9a-zA-Z]+(?=\t|<) - комбинация из русских букв, латинских букв и цифр, * после которой есть символ табуляции, или "<" * * \t(?=\t) - символ табуляции, сразу за которым есть еще один */

Result:

 /* * 1 : 108 * 2 : уп * 3 : [\t] * 4 : [\t] * 5 : [\t] * 6 : УпtакSЁовка * 7 : [\t] * 8 : 0 */

As it turns out, vbscript implementation does NOT support lookbehind and notation with two pluses coming in a row!
By the way, the expression without the first lookbehind works as it should, but whether it is possible to rewrite without two advantages - it does not work out yet ...
@sunny, I rewrote the regular calendar to fit the vbscript rules.
Corrected according to your character set - the exact match!

Answer 2 · 2014-01-31T06:54:43

 /(?<!\<\/)[\w\s]++(?!\>)/g

everything except what is between "</" & ">", if I'm not mistaken =)

 /(?<!\<\/)[^\<^\>]++(?!\>)/g

either that, it will cover an even wider range of characters.

Parsing a line with a tab-delimiter + ignoring text in corner brackets

2 answers 2

More articles: