import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(low=0, high=10, size=(1000000, 2)),columns=['a','b']) df[['a','b']] = df[['a','b']].astype(str) for i in range(2, df.shape[0]): df['b'][i] = df['a'][i-2] + ' ' + df['a'][i-1] + ' ' + df['a'][i] + ' ' + df['a'][i+1] Turns the cycle very long. I have a data frame (1kk rows). I need to fill the string column "b" with elements from the string column "a", element by element. Such a run takes a lot of time (30-60 minutes). How to use a cycle in python?
I often encounter such a problem in jupyter. In my case, even a simple cycle
for i in range(0,1000000): df['b'][i] = df['a'][i+1] performed long.
Solved problem:
I have 30 words. The sentences from these words are written in the df ['a'] column in each cell one word. I want to find out the distribution of the sequence of these words (4 words per phrase), i.e. the sequence "mom" + "loves" + "strongly" + "waving" occurs much less than "mom" + "strongly" + "loves" + "waving", etc.
Actually, to solve this problem, df ['b'] is created
Answer
In jupitere, loops really work slowly with dataFrame [link] ( https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6 ). Everything is clearly described how to work with them, but I was too lazy to understand. In the end, I gash so
df['b'] = df.a.shift[2] + df.a.shift[1] + df.a + df.a.shift[-1] Thanks to all!
ngram_rangeparameter - MaxU