The sequel "read_table обрезает предложения" ...

There is a text file of the form:

 date : 2018-03-08T12:56:30+00:00 content : [A very long string, which might contain multiple line breaks] href : https://www.google.com/ date : 2018-03-08T12:56:30+00:00 content : [Another long string ...] href : https://www.google.com/ ... 

How to parse it in a DataFrame with three columns: date , content , url ?

    2 answers 2

    Code:

     import re from pathlib import Path import pandas as pd fn = r'C:\download\textout.txt' text = Path(fn).read_text(encoding='utf-8') pat = r'date\s*:\s*([^\r\n]*)[\r\n]*content\s*:\s\[([^\]]*)\][\r\n]*href\s*:\s*([^\r\n]*)' df = pd.DataFrame(re.findall(pat, text, flags=re.S|re.M), columns=['date','content','url']) df['date'] = pd.to_datetime(df['date']) 

    An example of the resulting datarame:

     In [26]: df Out[26]: date content \ 0 2018-03-08 12:56:30 The price of cryptocurrencies across all marke... 1 2018-03-08 12:32:04 GMO Internet has released a monthly report on ... 2 2018-03-08 11:35:00 Despite all its previous efforts to prevent th... 3 2018-03-08 09:05:38 On Wednesday, March 7, the US regulator the ... 4 2018-03-08 07:50:00 This week the popular wallet provider Bread (B... 5 2018-03-08 04:00:43 This month a law firm called Polsinelli LLP pu... 6 2018-03-08 01:30:27 Binance has found itself at the center of an u... ... ... ... 2043 2017-04-04 06:00:41 Kim Dotcom has recently tweeted a preview of t... 2044 2017-04-04 03:00:13 On August 2, 2016, the leading bitcoin exchang... 2045 2017-04-03 19:00:18 This past weekend on April 1-2 in Berlin, Germ... 2046 2017-04-01 19:00:04 People often cast nasty judgment on Bitcoin. T... 2047 2017-04-01 14:00:53 This past week on March 24 the San Francisco-b... 2048 2017-03-31 12:38:51 This week the price of bitcoin has remained fa... 2049 2017-03-31 06:00:31 What is the legality of Bitcoin in India? This... url 0 https://news.bitcoin.com/markets-update-crypto... 1 https://news.bitcoin.com/japanese-conglomerate... 2 https://news.bitcoin.com/chinese-internet-regu... 3 https://news.bitcoin.com/sec-publishes-warning... 4 https://news.bitcoin.com/wallet-provider-bread... 5 https://news.bitcoin.com/research-paper-says-i... 6 https://news.bitcoin.com/bots-blamed-for-binan... ... ... 2043 https://news.bitcoin.com/kim-dotcom-bitcache-m... 2044 https://news.bitcoin.com/bitfinex-bfx-tokens-r... 2045 https://news.bitcoin.com/new-alliances-at-bitc... 2046 https://news.bitcoin.com/bitcoin-used-buy-sex-... 2047 https://news.bitcoin.com/fifty-developers-hack... 2048 https://news.bitcoin.com/markets-update-bitcoi... 2049 https://news.bitcoin.com/bitcoin-legality-in-i... [2050 rows x 3 columns] In [27]: df.dtypes Out[27]: date datetime64[ns] content object url object dtype: object 

      You can simplify the regular expression:

       >>> re.findall(r'(?ms)' + '\s*'.join([ # allow whitespace between tokens ... '^date', ':', '(\S+)', # no whitespace in the timestamp ... '^content', ':', '(.*?)', # multiline non-greedy text until ^href ... '^href', ':', '(\S+)']), # no whitespace in the url ... text) [('2018-03-08T12:56:30+00:00', '[A very long string,\nwhich might contain multiple line breaks]', 'https://www.google.com/'), ('2018-03-08T12:56:30+00:00', '[Another long string ...]', 'https://www.google.com/')]