Split the CSV row of the data frame by comma and place it in the column

Question

In general, the problem is this, there is a data frame containing data of the form:

Id Sequence 3 1,3,13... 7 1,2,1,... 8 1,2,4,... 11 1,8,25... 13 1,111,..

Where id is the sequence number, sequence is the sequence itself. The task, for example, take the first sequence and arrange it in a column and so on with all. The number of elements in the sequence is different everywhere.

I do this:

 #Импортируем необходимые пакеты import sys import warnings import pandas as pd import numpy as np from sklearn.metrics import mean_absolute_error, mean_squared_error import statsmodels.formula.api as smf import statsmodels.tsa.api as smt import statsmodels.api as sm import scipy.stats as scs from scipy.optimize import minimize import matplotlib.pyplot as plt #читаем данные и выводим dftrain = pd.read_csv('../../data/IntegerSeqTrain.csv', sep=",", index_col=['Id']) dftrain.head(10)

It will return: (the fact that the sequence above id already confuses me)

  Sequence Id 3 1,3,13... 7 1,2,1,... 8 1,2,4,... 11 1,8,25... 13 1,111,...

Further, the actual partitioning itself.

 #для удобства пишем последовательности в столбец, предварительно разбив по запятой print(dftrain.shape[1]) i=0 for dfitem in dftrain: j=0 for dfitem2 in dfitem: dftrain[j] = dftrain['Sequence'].str.split(',').str.get(j) j+=1 i+=1 #удаляем лишний столбец dftrain = dftrain.drop('Sequence', 1) #pd.set_option('max_colwidth', 10) #что получилось print(dftrain.head(10))

At the exit:

  0 1 2 3 4 5 6 7 Id 3 1 3 13 87 1053 28576 2141733 508147108 7 1 2 1 5 5 1 11 16 8 1 2 4 5 8 10 16 20 11 1 8 25 83 274 2275 132224 1060067 13 1 111 12211 1343211 147753211 162528... 178781... 196659... 15 1 1 1 1 1 1 1 1 16 840 1320 1680 2520 3192 3432 4920 5208

Everything is written to the row, not to the column.
The number of columns has been greatly reduced (8-9 pieces, although there are actually many more)

How can the data be beautifully divided and presented as columns?) Thanks in advance The data itself (train.csv): https://dropmefiles.com/osxrI

Source (IPYNB file): https://dropmefiles.com/cIR4f

Accepted Answer · 2018-12-17T05:00:59

Decision:

 train = pd.read_csv(r'C:\download\train.csv', sep=",", index_col=['Id']) r = train.Sequence.str.split(',', expand=True).T

Result:

 In [203]: r Out[203]: Id 3 7 8 11 13 15 16 ... 227681 227682 227683 227684 227686 227689 227690 0 1 1 1 1 1 1 840 ... 7 1 0 0 0 2 5 1 3 2 2 8 111 1 1320 ... 7 0 0 -1 1 3 7 2 13 1 4 25 12211 1 1680 ... 3 1 4 -1 9 3 179 3 87 5 5 83 1343211 1 2520 ... 2 0 1198 -1 85 4 229 4 1053 5 8 274 147753211 1 3192 ... 3 0 1829388 -1 801 6 439 5 28576 1 10 2275 16252853211 1 3432 ... 9 0 23796035743 10324303 7549 4 557 6 2141733 11 16 132224 1787813853211 1 4920 ... 5 0 2142967506078650 -6586524273069171148 71145 5 6113 .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 341 None None None None None None None ... None None None None None None None 342 None None None None None None None ... None None None None None None None 343 None None None None None None None ... None None None None None None None 344 None None None None None None None ... None None None None None None None 345 None None None None None None None ... None None None None None None None 346 None None None None None None None ... None None None None None None None 347 None None None None None None None ... None None None None None None None [348 rows x 113845 columns] In [204]: r.columns Out[204]: Int64Index([ 3, 7, 8, 11, 13, 15, 16, 18, 20, 21, ... 227677, 227679, 227680, 227681, 227682, 227683, 227684, 227686, 227689, 227690], dtype='int64', name='Id', length=113845) In [205]: r.shape Out[205]: (348, 113845)

It was stupid of me not to think about the fact that a ready-made method had already existed for a long time, I tried to invent a bicycle) What do you think, can I submit data in this form to an input like this method: def moving_average(series, n): return np.average(series[-n:]) moving_average(dataset.CURRENT_COLUMN, Interval)
@StenFord, no, you have the same problem as before - there are too large numbers that fall outside the definition area of np.int64 , so all the columns of object type (strings) will not work to convert them into integers.
The only thing that comes to mind is to process this data as well as process the text - convert each unique number (string) into its corresponding ordinal number.
Actually, I thought to cut the data) It is not fundamentally for me to solve just such a problem for the time being, I'm just learning.
Try to trim in columns, for example, to leave only those values <100,000 or even billion.
Int64 easily pulls, another thing is that the amount can pass for int64.
You can replace all rows that are longer than 19 characters inf (infinity) and convert the data types of all columns to int64 .
If you want - you can formulate a new SO question with a small example of input and output data - I will try to answer it in the evening (when there will be free time)
Well, thank you) I, too, will only appear at home in the evening at 7 am Moscow time)

Split the CSV row of the data frame by comma and place it in the column

1 answer 1

More articles: