Good day!

There is a problem for parsing the mailing address in one table and comparing the found components of the address in another table. Accordingly, it became necessary to isolate a certain entity (in my case, the apartment number) from the line in which there are other data.

For example, we have the following diverse set of lines that contains information about the apartment number and the room (the lines are manually filled by the operator):

FLAT varchar ------------ 12/1/3 --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 12, ΠΊΠΎΠΌΠ½Π°Ρ‚Ρ‹ 1 ΠΈ 3 134ΠΊ3 --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 134, ΠΊΠΎΠΌΠ½Π°Ρ‚Π° 3 12ΠΊ.4 --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 12, ΠΊΠΎΠΌΠ½Π°Ρ‚Π° 4 15k.2 --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 15, ΠΊΠΎΠΌΠ½Π°Ρ‚Π° 2 21К4 --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 21, ΠΊΠΎΠΌΠ½Π°Ρ‚Π° 4 27/Π° --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 27Π°, Π½ΠΎ Π½Π΅ ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 27 ΠΊΠΎΠΌΠ½Π°Ρ‚Π° А 27/2 --Π° это Π²ΠΎΡ‚ ΠΊΠ°ΠΊ Ρ€Π°Π· 27-я ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π°, ΠΈ ΠΊΠΎΠΌΠ½Π°Ρ‚Π° 2 24 --просто ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 24, Π±Π΅Π· ΠΊΠΎΠΌΠ½Π°Ρ‚ 25-6 --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 25, ΠΊΠΎΠΌΠ½Π°Ρ‚Π° 6 23.1-3 --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 23, ΠΊΠΎΠΌΠ½Π°Ρ‚Ρ‹ 1 ΠΈ 3 90.ΠΊ.2 --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 90, ΠΊΠΎΠΌΠ½Π°Ρ‚Π° 2 71 ΠΊΠΎΠΌΠ½.5 --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 71, ΠΊΠΎΠΌΠ½Π°Ρ‚Π° 5 11Π² --ΠΊΠ²Π°Ρ€Ρ‚ΠΈΡ€Π° 11Π² 

Since the formatting of such lines does not correspond to any standard (and there is no possibility to influence this - the addresses from the state structure come as one), it was decided to consider the first number in the field as the apartment number, and all subsequent ones refer to room numbering.

In this case, it is assumed that apartment numbers and room numbers will be matched with entities in another table in semi-automatic mode (first select and look for what you can select and find, and give the rest to the user in manual mode).

Thus, the task is reduced to the fact that from such a set of lines (initially there are several tens of thousands of records) get in one column the first number, which will be the number of the apartment, and in the second field - residual information (data on the rooms).

Those. From the source table I would like to get something like this:

 FLAT varchar F_NO varchar ROOM_NO varchar ------------ ------------ --------------- 12/1/3 12 1/3 134к3 134 к3 12к.4 12 к.4 15 15 k.2 21К4 21 K4 27/а 27 А 27/2 27 2 24 24 (Null) 25-6 25 6 23.1-3 23 1-3 90.к.2 90 к.2 71 комн.5 71 комн.5 11в 11в (Null) 

Of course, it would not be bad to parse the ROOM_NO field further, pulling out more or less imputable room numbers from there, but so far it has stopped at least at the apartment numbers.

So, the difficulty is that as a separator can be used an extensive set of characters, from space and decimal point to slashes, hyphens and some letters.

Saw the implementation of analog SUBSTRING_INDEX from MySQL to English. SO, but there is a function (and there are several of its implementations) sharpened by the use of one delimiter, and if you call it several times for a line, sorting through an array of delimiters, something tells me that it will be executed very slowly ...

Now I solve the problem through 'REPLACE', approximately the same way - by applying to the line 'REPLACE' for each separator (to get an apartment). Like that:

 ... CASE WHEN (CHARINDEX('@',REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(_R2.Flat, '/','@'), '\','@'), 'ΠΊ','@'), 'К','@'), --русская К '.','@'), ',','@'), ' ','@'), 'k','@'), 'K','@') --английская K ) > 0) THEN LEFT(_R2.Flat,CHARINDEX('@',REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(_R2.Flat, '/','@'), '\','@'), 'ΠΊ','@'), 'К','@'), --русская К '.','@'), ',','@'), ' ','@'), 'k','@'), 'K','@') --английская K ) - 1) ELSE ISNULL(_R2.Flat,'0') END --CASE 

Of course, this option is also not very pleasant (bulky, not universal, does not take into account the room, not very fast ...).

And something tells me that I am far from being the first from the community who faced with parsing similar addresses and strings in principle, so I would appreciate any recommendations, tips, code and everything that would help solve the problem.

Thanks in advance for your attention and with the Coming! :)

    1 answer 1

    Here's how to break it up into different characters, where @c is the name of your field. In square brackets are listed the characters by which to break (minus and percentage - service, the rest without problems)

     select case when @c like '%[:,.;A-Zа-zА-Яа-я]%' then substring(@c,1, patindex('%[:,.;A-Zа-zА-Яа-я]%',@c)-1) else @c end , case when @c like '%[:,.;A-Zа-zА-Яа-я]%' then substring(@c,patindex('%[:,.;A-Zа-zА-Яа-я]%',@c),255) else null end 

    If you leave only the numbers, you can with the negative:

     select case when @c like '%[^0-9]%' then substring(@c,1, patindex('%[^0-9]%',@c)-1) else @c end , case when @c like '%[^0-9]%' then substring(@c,patindex('%[^0-9]%',@c),255) else null end 

    ps 255 is a habit from old pascal, probably 99 is enough. But if the string is longer than this number, then you need to put 8000 (longer than mssql does not allow the string to be made) or len(@c) .

    • 3
      Why so scary then. Maybe PATINDEX('%[^0-9]%',a) is easier - Mike
    • I did not know that there is a denial ... class - nick_n_a
    • About denial, LIKE patterns, I knew about patterns in PATINDEX separately, but this did not help much))) Thanks for the reply and comments to it! - BlackWitcher