Hello, help, please, with such a task: there is a string

data = "sadsadsadsfffffffddd dddsfd dsd" 

How to break it into separate characters? I understand that data.split() , but it’s just not clear what to write in split() . Thank you in advance.

    8 answers 8

    Well namudri ...

     list(str) 

    data already a sequence ( data[i] works). Nothing to call.

    For example, to print each character on a separate line:

     for char in text: print(char) 

    What can be print('\n'.join(text)) down: print('\n'.join(text)) . If you need a Python list, then just chars = list(text) .

    If you are working with text, use Unicode. Unicode strings in Python are immutable character sequences ( Unicode code points ).

    User-visible letters (grapheme clusters) can consist of several characters, for example, a letter can be represented as a sequence of two characters: U + 435 U + 308 in Unicode— u'\u0435\u0308' in Python:

     >>> print(u'\u0435\u0308') ё 

    Each character can be represented in different encodings with one or several bytes, for example, the letter я (U + 044F) can be encoded in two bytes: 11010001 10001111 in utf-8 encoding:

     >>> print(u'\u044f') я >>> u'\u044f'.encode('utf-8') b'\xd1\x8f' # два байта: 209, 143 

    Bytes / byte string ( bytes type) is an unchangeable byte sequence in Python.

    str type is bytes in Python 2. str is Unicode in Python 3.

    In addition, there is the concept of code unit (8 bits in utf-8, 16 bits in utf-16). Javascript strings can often be thought of as utf-16 code unit sequences ( may matter when transferring functionality to Python ), for example, a smiley 😂 (U + 1F602) character is represented as two code unit: D83D DE02 in utf-16 (BE) encoding:

     >>> print(u'\U0001F602') 😂 >>> u'\U0001F602'.encode('utf-16be') b'\xd8=\xde\x02' # четыре байта: 216, 61, 222, 2 

    That is, if you have text represented as str in Python 3 (Unicode), then you can treat it as different sequences depending on the task:

     >>> import regex # $ pip install regex >>> text = 'я 😂 ё' # 6 code points >>> print(ascii(text)) '\u044f \U0001f602 \u0435\u0308' >>> regex.findall(r'\X', text) # 5 grapheme clusters ['я', ' ', '😂', ' ', 'ё'] # 5 user-perceived characters >>> utf16codeunits(text) # 7 utf-16 code units (1103, 32, 55357, 56834, 32, 1077, 776) >>> text.encode('utf-16be') # 14 bytes in utf-16 b'\x04O\x00 \xd8=\xde\x02\x00 \x045\x03\x08' >>> text.encode('utf-8') # 12 bytes in utf-8 b'\xd1\x8f \xf0\x9f\x98\x82 \xd0\xb5\xcc\x88' 

    where utf16codeunits() .

      For a start, the question is - why break something? If you need to refer to each character separately, you can do it like this:

       data[i] 

      and on the topic: .split works like this:

       arr = data.split('<символ(ы) для разделения>') 

      the array arr is obtained. If you put a character to separate "" (space), then in the array you will have three elements.

      • Add that to 2.7 (and most likely on any version) with data.split ('') Gives an exception ValueError: empty separator - timka_s
      • I have 3.2, I do not remember such problems ... maybe I have never split a gap space - Izengardjke
      • I wrote without a space - timka_s
      • I understand how split () works. I want to read the file file_in = open ('in.txt', 'r'). Read () - so it is considered in one line, then split it into separate characters, you get a list of characters, then do how to do frequency analysis i.e. data = [[a, 4], [s, 6], [e, 9]] Something like this, if you know some other approach or some other way to accomplish the task, please write, I will be grateful :) - Rumato

      Here is your real question: How to make frequency analysis of the occurrence of symbols

      On the pseudo-code (he is JS) so:

       str = "содержание_вашего_файла" res = []; for ( var i = 0; i < str.length; i++ ){ var ch = str[ i ]; if ( !res[ ch ] ) res[ ch ] = 1; else res[ ch ]++; } 

        Code analysis.

         text = 'hello world' unique_letters = set(text) analize = {} for letter in unique_letters: analize[letter] = text.count(letter) print analize # => {' ': 1, 'e': 1, 'd': 1, 'h': 1, 'l': 3, 'o': 2, 'r': 1, 'w': 1} 
        • Can you imagine how it will work with a string of, say, a couple of hundred kilobytes? - Ilya Pirogov 5:58
        • I imagine I tried on the text of 50 megabytes. It works smartly, I understand that it will eat memory, you can arrange the whole thing in the form of a generator. - MyNameIss
         import collections, io stats = collections.defaultdict(lambda: 0) with open('some.txt', 'r') as fp: for line in fp: for char in line: stats[char] += 1 

        Or so as not to produce nested loops:

         import collections, io from itertools import chain stats = collections.defaultdict(lambda: 0) with open('some.txt', 'r') as fp: for char in chain.from_iterable(fp): stats[char] += 1 

          It will be great to work on an array of any size through a generator expression. The array is not loaded into memory:

            text = 'hello world'
           indecies = set ( text )
           values = ( text . count ( letter ) for letter in indecies )
           analize = dict ( zip ( indecies , values ) )
          
          • Just read, all that they wrote, thank you very much for your help! And I wrote my own variant: file_in = open ('in.txt', 'r'). Read () exampleData = ",". Join ("[% s, '% s']"% (file_in.count ( i), i) for i in sorted (set (file_in))) it turns out, of course, a string, but in python, you can't just cast this string to the list type? If you do this, it doesn't quite work: list (exampleData). - Rumato

          In quotes, a character that separates characters:

           s=input() l=list(''.join(s)) print(l) 

           rgbycm ['r', 'g', 'b', 'y', 'c', 'm'] 

          To simply print the characters separated by a space without square brackets and quotes:

           s=input() l=' '.join(s) print(l) 

           rgbycm rgbycm 
          • one
            If that, ''.join(s) is equal to the original string ( s ). Why such a perversion? - insolor
          • В кавычках знак который разделяет символы is not true. join intended to combine a sequence (for example, a list or any other iterable sequence) of strings, collecting them into a single line with the specified separator. Thus, you first divide the line into separate characters, and then collect it back into a whole line (what you had, you received: s == ''.join(s) ), and then again break it into a list of individual characters. In the first piece of code, the second line can simply be written l=list(s) . - insolor