String (string) consists of bytes, how to extract bytes from it

Question

There is a string use[1] = "строка \n string".encode('utf-8') translate the code into bytes and put it in the Pandas table and save the table in csv. In the byte code in order to be able to extract the string in the future as it contains the \ n character of the carry to a new line. If you leave with this sign, the table is not readable in the future. then I read the table train_dataset = np.genfromtxt('data', usecols=use[1:4], delimiter=';',dtype=object,skip_header=1) . I transfer the data from the code bytes to utf-8 and I got a string of bytes of type String but the bytes are listed there.

 for x in range(train_dataset.shape[0]): train_dataset[x][0]=train_dataset[x][0].decode('utf-8') train_dataset[x][1] = train_dataset[x][1].decode('utf-8') train_dataset[x][2] = train_dataset[x][2].decode('utf-8') print(train_dataset)

"b'\\xd1\\x81\\xd1\\x82\\xd1\\x80\\xd0\\xbe\\xd0\\xba\\xd0\\xb0 \\n \\xd1\\x81\\xd1\\x82\\xd1\\x80\\xd0\\xb8\\xd0\\xbd\\xd0\\xb3'" result is a string type, how to convert to a type byte

Accepted Answer · 2019-04-10T12:00:19

No need to invent anything - Pandas does a great job with line breaks:

 In [22]: df = pd.DataFrame({ ...: 'id': [1,2,3], ...: 'text': ['aaa', 'xxx\nyyy\nzzz', 'ccc'], ...: 'val': [10,20,30] ...: }) In [23]: df Out[23]: id text val 0 1 aaa 10 1 2 xxx\nyyy\nzzz 20 2 3 ccc 30 In [24]: print(df.loc[1, 'text']) xxx yyy zzz In [25]: df.to_csv('c:/temp/1.csv', index=False) In [26]: pd.read_csv('c:/temp/1.csv') Out[26]: id text val 0 1 aaa 10 1 2 xxx\nyyy\nzzz 20 2 3 ccc 30

CSV file - note that the line with line breaks is enclosed in double quotes, otherwise such a file will not be a valid CSV file:

 id,text,val 1,aaa,10 2,"xxx yyy zzz",20 3,ccc,30

UPDATE: how to read the CSV file in Numpy NDArray:

use the DataFrame.values attribute:

 In [43]: pd.read_csv('c:/temp/1.csv').values Out[43]: array([[1, 'aaa', 10], [2, 'xxx\nyyy\nzzz', 20], [3, 'ccc', 30]], dtype=object)

Since version 0.24.0, there is a DataFrame.to_numpy () method in Pandas:

 In [44]: pd.read_csv('c:/temp/1.csv').to_numpy() Out[44]: array([[1, 'aaa', 10], [2, 'xxx\nyyy\nzzz', 20], [3, 'ccc', 30]], dtype=object)

You use the Pandas table to save and load, but I need to load not the Pandas table, but from the csv file into the numpy array.
I managed to do this by replacing \ n with another character using the replace method, but the question remained unresolved how to get bytes from the string that contains bytes (a string of the String type) (byte type).
@ Alexandr1234567890, this is exactly what I am doing at the end of the answer - the last line of code
the pd.read (file) method returns a DataFrame, and the numpy array is needed

String (string) consists of bytes, how to extract bytes from it

1 answer 1

More articles: