Hello. How to write the expression re.split, which will divide the string by character; (semicolon) under the condition that this character is not part of one word?

Example:

Название;"Дат;а";Человек; Грив;22,14;"Пе;тя" 

Should be divided into:

 Название, Дат;а, Человек, Грив, 22,14, Пе;тя 

I understand that the example is very fantastic (there are no words that include the symbol;), but I hope I explained it clearly.

Thank.

  • Updated the answer, check out - kitscribe

2 answers 2

In such cases, it is convenient to use the cvs module :

 >>> import csv >>> lines = 'Название;"Дат;а";Человек;\nГрив;22,14;"Пе;тя"'.splitlines() >>> list(csv.reader(lines, delimiter=';')) [['Название', 'Дат;а', 'Человек', ''], ['Грив', '22,14', 'Пе;тя']] 

You can also use csv for output:

 >>> rows = csv.reader(lines, delimiter=';') >>> import sys >>> csv.writer(sys.stdout).writerows(rows) Название,Дат;а,Человек, Грив,"22,14",Пе;тя 

Note that the second line automatically uses quotes to shield commas. You can use another delimiter, for example, the space:

 >>> csv.writer(sys.stdout, delimiter=' ').writerows(rows) Название Дат;а Человек Грив 22,14 Пе;тя 

or use a special escape character:

 >>> csv.writer(sys.stdout, escapechar='\\', quoting=csv.QUOTE_NONE).writerows(rows) Название,Дат;а,Человек, Грив,22\,14,Пе;тя 

In this case, the backslash is used to screen a comma (field separator) inside the field.

In spite of simplicity, in this format there can be subtle places where it is easy to make mistakes if you try to independently recognize such a format using your parser - if there are no special reasons, it is better to use the existing format with the already tested parser used.

    In this case, it is better to use other modules. Such as csv or json

    But if you want to split a string by a character using regular expressions, you can use the negative sign ^ . Examples:

    Split the line into spaces:

    Pattern: r'[^ ]+'

     result = re.findall(r'[^ ]+', 'Это тестовая строка, чтобы показать как можно разбивать строку по символам') print(result) # ['Это', 'тестовая', 'строка,', 'чтобы', 'показать', 'как', 'можно', 'разбивать', 'строку', 'по', 'символам'] 

    Divide the string by commas:

     result = re.findall(r'[^,]+', 'Это тестовая строка, чтобы показать как можно разбивать строку по символам') print(result) # ['Это тестовая строка', ' чтобы показать как можно разбивать строку по символам'] 

    Now let's get to your line. You have a very interesting task: you need to ignore parentheses. But even here you can cope by adding or ( | ) to the search condition of our pattern:

     import re data = [] lines = ['Название;"Дат;а";Человек;', 'Грив;22,14;"Пе;тя"'] for line in lines: result = re.findall(r'(".+?"|[^;]+)', line) data.extend(result) # обратите внимание на то, какой у нас вывод # когда мы используем функцию extend print(data) # ['Название', '"Дат;а"', 'Человек', ' Грив', '22,14', '"Пе;тя"'] 

    To get rid of quotes, you can use the sub function from the same re module. In order not to make a bunch of lists and not get an error during the replacement of values, before going through it, copy it. In general, we get this code:

     import re data = [] lines = ['Название;"Дат;а";Человек;', 'Грив;22,14;"Пе;тя"'] for line in lines: result = re.findall(r'(".+?"|[^;]+)', line) result = [re.sub(r'"(.+)"', r'\1', x) for x in result] data.append(result) # посмотрите какой вывод при использовании функции append print(data) # [['Название', 'Дат;а', 'Человек'], ['Грив', '22,14', 'Пе;тя']] 

    Note that in the examples, when adding a list , different functions were used : extend and append , which gave a different output.



    More information about regular expressions can be found here and here .