How to parse a file that is not divided into lines

Question

The file has the following structure:

{date}\t{title}\t{message}\n

The problem is that {message} sometimes contains both tabs (\ t) and hyphens (\ n)

Example:

 30.01.2017 Первое сообщение Текст первого сообщения, не содержащий табуляции или переноса строки 30.01.2017 Второе сообщение Текст второго сообщения с табуляцией 30.01.2017 Третье сообщение Текст третьего сообщения с переносом строки и табуляцией

How to parse the file into strings correctly?

Need to clarify the question in which programming language to write code?
In fact, this can be done in ten different programming languages (for example, it is quite realizable in PHP , JAVA and python ).
Well, look for a token at the beginning of the line that matches the date data format.
If it is not found, attach the line to the current record, and if it is found, drop the current record and begin parsing the new one.
If the text can contain a date, and the division of the text into lines is such that the date is at the beginning of the line, then I can’t see in the given description to determine if the date is a significant token or a partial content of the token.
Well, except that an additional attempt to detect 2 tabs in the same line can help.
the first immediately after the date ... or a check will show that the allocated date token conflicts with the logic of the semantic content of the information.
For good, you need to delete this log, and modify the logger to correctly serialize the message.

jfs jfs 44.5k 8 gold signs 53 silver marks 199 bronze marks · Accepted Answer · 2017-01-30T14:56:33

Split the file into parts according to the ^{date}\t{title}\t pattern:

 #!/usr/bin/env python3 import re from pathlib import Path text = Path('messages.txt').read_text() date_re = r'\d\d\.\d\d\.\d{4}' title_re = r'[^\t]+' # no tab in the title parts = re.split(f'^({date_re})\t({title_re})\t', text, flags=re.M)[1:] print(*parts, sep=' | ')

Conclusion (abbreviated) for the example in question:

 30.01.2017 | Первое сообщение | ... | 30.01.2017 | Второе сообщение | ... | 30.01.2017 | Третье сообщение | ...

Or more strictly, you can check that all dates are resolved and put each message in its object with named attributes:

 from collections import namedtuple from datetime import datetime Message = namedtuple('Message', 'date title message') triples = (parts[i:i+3] for i in range(0, len(parts), 3)) messages = [Message(datetime.strptime(datestr, '%d.%m.%Y').date(), title, text) for datestr, title, text in triples] print(*messages, sep='\n')

Result

 Message(date=datetime.date(2017, 1, 30), title='Первое сообщение', message='Текст первого сообщения, не содержащий табуляции или переноса строки\n') Message(date=datetime.date(2017, 1, 30), title='Второе сообщение', message='Текст второго сообщения с\tтабуляцией\n') Message(date=datetime.date(2017, 1, 30), title='Третье сообщение', message='Текст третьего сообщения\nс переносом строки\nи\tтабуляцией\n')

Answer 2 · 2017-01-30T11:12:44

A prerequisite for my example is the lack of tabs in the title and a clear date format.

If suddenly \ n (date) \ t (text) \ t appears in the message, everything will break

 $data = "\n".file_get_contents('messages.txt'); $lines = preg_split("/\n(\d{2}\.\d{2}\.\d{4})\t([^\t]+)\t/isu", $data, -1, PREG_SPLIT_DELIM_CAPTURE); print_r($lines); exit(); $messagesArray = []; for ($i = 1; $i < count($lines); $i+=3) { $messagesArray[] = [ 'date' => $lines[$i], 'title' => $lines[$i+1], 'message' => $lines[$i+2], ]; } print_r($messagesArray); /** Array ( [0] => Array ( [date] => 30.01.2017 [title] => Первое сообщение [message] => Текст первого сообщения, не содержащий табуляции или переноса строки ) [1] => Array ( [date] => 30.01.2017 [title] => Второе сообщение [message] => Текст второго сообщения с табуляцией ) [2] => Array ( [date] => 30.01.2017 [title] => Третье сообщение [message] => Текст третьего сообщения с переносом строки и табуляцией ) ) */

In general, I would recommend writing a serialized array to a log, for example JSON. Thus, the log will be easier to disassemble.

For example, made the following entry:

 30.01.2017 Второе сообщение Текст второго сообщения с табуляцией 11.01.2016 test test test

Got, as expected, this result:

 [1] => Array ( [date] => 30.01.2017 [title] => Второе сообщение [message] => Текст второго сообщения с табуляцией ) [2] => Array ( [date] => 11.01.2016 [title] => test [message] => test test )

But this problem can not be solved without changing the format of the log

So far, I am the only one who has proposed an option that really works on the given example from the author

How to parse a file that is not divided into lines

2 answers 2

Result

More articles: