Good day!
There is a problem for parsing the mailing address in one table and comparing the found components of the address in another table. Accordingly, it became necessary to isolate a certain entity (in my case, the apartment number) from the line in which there are other data.
For example, we have the following diverse set of lines that contains information about the apartment number and the room (the lines are manually filled by the operator):
FLAT varchar ------------ 12/1/3 --ΠΊΠ²Π°ΡΡΠΈΡΠ° 12, ΠΊΠΎΠΌΠ½Π°ΡΡ 1 ΠΈ 3 134ΠΊ3 --ΠΊΠ²Π°ΡΡΠΈΡΠ° 134, ΠΊΠΎΠΌΠ½Π°ΡΠ° 3 12ΠΊ.4 --ΠΊΠ²Π°ΡΡΠΈΡΠ° 12, ΠΊΠΎΠΌΠ½Π°ΡΠ° 4 15k.2 --ΠΊΠ²Π°ΡΡΠΈΡΠ° 15, ΠΊΠΎΠΌΠ½Π°ΡΠ° 2 21Π4 --ΠΊΠ²Π°ΡΡΠΈΡΠ° 21, ΠΊΠΎΠΌΠ½Π°ΡΠ° 4 27/Π° --ΠΊΠ²Π°ΡΡΠΈΡΠ° 27Π°, Π½ΠΎ Π½Π΅ ΠΊΠ²Π°ΡΡΠΈΡΠ° 27 ΠΊΠΎΠΌΠ½Π°ΡΠ° Π 27/2 --Π° ΡΡΠΎ Π²ΠΎΡ ΠΊΠ°ΠΊ ΡΠ°Π· 27-Ρ ΠΊΠ²Π°ΡΡΠΈΡΠ°, ΠΈ ΠΊΠΎΠΌΠ½Π°ΡΠ° 2 24 --ΠΏΡΠΎΡΡΠΎ ΠΊΠ²Π°ΡΡΠΈΡΠ° 24, Π±Π΅Π· ΠΊΠΎΠΌΠ½Π°Ρ 25-6 --ΠΊΠ²Π°ΡΡΠΈΡΠ° 25, ΠΊΠΎΠΌΠ½Π°ΡΠ° 6 23.1-3 --ΠΊΠ²Π°ΡΡΠΈΡΠ° 23, ΠΊΠΎΠΌΠ½Π°ΡΡ 1 ΠΈ 3 90.ΠΊ.2 --ΠΊΠ²Π°ΡΡΠΈΡΠ° 90, ΠΊΠΎΠΌΠ½Π°ΡΠ° 2 71 ΠΊΠΎΠΌΠ½.5 --ΠΊΠ²Π°ΡΡΠΈΡΠ° 71, ΠΊΠΎΠΌΠ½Π°ΡΠ° 5 11Π² --ΠΊΠ²Π°ΡΡΠΈΡΠ° 11Π² Since the formatting of such lines does not correspond to any standard (and there is no possibility to influence this - the addresses from the state structure come as one), it was decided to consider the first number in the field as the apartment number, and all subsequent ones refer to room numbering.
In this case, it is assumed that apartment numbers and room numbers will be matched with entities in another table in semi-automatic mode (first select and look for what you can select and find, and give the rest to the user in manual mode).
Thus, the task is reduced to the fact that from such a set of lines (initially there are several tens of thousands of records) get in one column the first number, which will be the number of the apartment, and in the second field - residual information (data on the rooms).
Those. From the source table I would like to get something like this:
FLAT varchar F_NO varchar ROOM_NO varchar ------------ ------------ --------------- 12/1/3 12 1/3 134ΠΊ3 134 ΠΊ3 12ΠΊ.4 12 ΠΊ.4 15 15 k.2 21Π4 21 K4 27/Π° 27 Π 27/2 27 2 24 24 (Null) 25-6 25 6 23.1-3 23 1-3 90.ΠΊ.2 90 ΠΊ.2 71 ΠΊΠΎΠΌΠ½.5 71 ΠΊΠΎΠΌΠ½.5 11Π² 11Π² (Null) Of course, it would not be bad to parse the ROOM_NO field further, pulling out more or less imputable room numbers from there, but so far it has stopped at least at the apartment numbers.
So, the difficulty is that as a separator can be used an extensive set of characters, from space and decimal point to slashes, hyphens and some letters.
Saw the implementation of analog SUBSTRING_INDEX from MySQL to English. SO, but there is a function (and there are several of its implementations) sharpened by the use of one delimiter, and if you call it several times for a line, sorting through an array of delimiters, something tells me that it will be executed very slowly ...
Now I solve the problem through 'REPLACE', approximately the same way - by applying to the line 'REPLACE' for each separator (to get an apartment). Like that:
... CASE WHEN (CHARINDEX('@',REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(_R2.Flat, '/','@'), '\','@'), 'ΠΊ','@'), 'Π','@'), --ΡΡΡΡΠΊΠ°Ρ Π '.','@'), ',','@'), ' ','@'), 'k','@'), 'K','@') --Π°Π½Π³Π»ΠΈΠΉΡΠΊΠ°Ρ K ) > 0) THEN LEFT(_R2.Flat,CHARINDEX('@',REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(_R2.Flat, '/','@'), '\','@'), 'ΠΊ','@'), 'Π','@'), --ΡΡΡΡΠΊΠ°Ρ Π '.','@'), ',','@'), ' ','@'), 'k','@'), 'K','@') --Π°Π½Π³Π»ΠΈΠΉΡΠΊΠ°Ρ K ) - 1) ELSE ISNULL(_R2.Flat,'0') END --CASE Of course, this option is also not very pleasant (bulky, not universal, does not take into account the room, not very fast ...).
And something tells me that I am far from being the first from the community who faced with parsing similar addresses and strings in principle, so I would appreciate any recommendations, tips, code and everything that would help solve the problem.
Thanks in advance for your attention and with the Coming! :)