data
already a sequence ( data[i]
works). Nothing to call.
For example, to print each character on a separate line:
for char in text: print(char)
What can be print('\n'.join(text))
down: print('\n'.join(text))
. If you need a Python list, then just chars = list(text)
.
If you are working with text, use Unicode. Unicode strings in Python are immutable character sequences ( Unicode code points ).
User-visible letters (grapheme clusters) can consist of several characters, for example, a letter can be represented as a sequence of two characters: U + 435 U + 308 in Unicode— u'\u0435\u0308'
in Python:
>>> print(u'\u0435\u0308') ё
Each character can be represented in different encodings with one or several bytes, for example, the letter я
(U + 044F) can be encoded in two bytes: 11010001 10001111 in utf-8 encoding:
>>> print(u'\u044f') я >>> u'\u044f'.encode('utf-8') b'\xd1\x8f' # два байта: 209, 143
Bytes / byte string ( bytes
type) is an unchangeable byte sequence in Python.
str
type is bytes in Python 2. str
is Unicode in Python 3.
In addition, there is the concept of code unit (8 bits in utf-8, 16 bits in utf-16). Javascript strings can often be thought of as utf-16 code unit sequences ( may matter when transferring functionality to Python ), for example, a smiley 😂
(U + 1F602) character is represented as two code unit: D83D DE02 in utf-16 (BE) encoding:
>>> print(u'\U0001F602') 😂 >>> u'\U0001F602'.encode('utf-16be') b'\xd8=\xde\x02' # четыре байта: 216, 61, 222, 2
That is, if you have text represented as str
in Python 3 (Unicode), then you can treat it as different sequences depending on the task:
>>> import regex # $ pip install regex >>> text = 'я 😂 ё' # 6 code points >>> print(ascii(text)) '\u044f \U0001f602 \u0435\u0308' >>> regex.findall(r'\X', text) # 5 grapheme clusters ['я', ' ', '😂', ' ', 'ё'] # 5 user-perceived characters >>> utf16codeunits(text) # 7 utf-16 code units (1103, 32, 55357, 56834, 32, 1077, 776) >>> text.encode('utf-16be') # 14 bytes in utf-16 b'\x04O\x00 \xd8=\xde\x02\x00 \x045\x03\x08' >>> text.encode('utf-8') # 12 bytes in utf-8 b'\xd1\x8f \xf0\x9f\x98\x82 \xd0\xb5\xcc\x88'
where utf16codeunits()
.