Splitting lines in a format with a fixed field width and optional values

Question

There is a large array of data.

An example of a string from an array:

20046 2005 27.0 44.3 9.0 15.9 3.6 9.2 9.2 37.5 18.3 18.6 24.4 26.0

Where the first two values are the number of the meteorological station and the year, the rest are the air temperatures starting from January The values are separated by spaces, while the number of spaces varies from 1 to 3. Temperatures that were not recorded by the weather station are replaced by spaces, i.e. The following string is allowed in the array:

 20667 2014 5.5 2.4 7.9 8.1 42.7 10.1

A regular is needed which would break this string into an array of the form:

 ['20667','2014','5.5','2.4','7.9','8.1','','','42.7','','','10.1','','']

Under these conditions (unfixed field length), an unambiguous interpretation of the data is impossible.
For example, there is data for January, December, and one more approximately in the middle ... is this how to determine June or July?
And why can't we replace all spaces with commas with a regular one and then parse it into an array?
@Yuri because the term is “monospace” in which the dropped elements are also spaces, but you can use {}

Accepted Answer · 2017-04-18T18:28:47

According to the description of your input, it seems that this is a fixed-width file.

In this case it will be very convenient to use the Pandas module :

 import pandas as pd cols = ['id', 'year'] + ['m{}'.format(i) for i in range(1, 13)] df = pd.read_fwf(r'D:\temp\.data\655212.txt', header=None, names=cols) print(df)

Result:

 In [136]: df Out[136]: id year m1 m2 m3 m4 m5 m6 m7 m8 m9 m10 m11 m12 0 20046 2005 27.0 44.3 9.0 15.9 3.6 9.2 9.2 37.5 18.3 18.6 24.4 26.0 1 20047 2005 26.5 NaN 7.5 17.3 NaN NaN 10.2 39.9 19.7 NaN 20.4 20.0

You can also use the @jfs idea to name the columns by month names:

 import calendar cols = ['id', 'year'] + list(calendar.month_abbr)[1:] df = pd.read_fwf(r'D:\temp\.data\655212.txt', header=None, names=cols)

Result:

 In [139]: df Out[139]: id year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 0 20046 2005 27.0 44.3 9.0 15.9 3.6 9.2 9.2 37.5 18.3 18.6 24.4 26.0 1 20047 2005 26.5 NaN 7.5 17.3 NaN NaN 10.2 39.9 19.7 NaN 20.4 20.0

Original file:

 20046 2005 27.0 44.3 9.0 15.9 3.6 9.2 9.2 37.5 18.3 18.6 24.4 26.0 20047 2005 26.5 7.5 17.3 10.2 39.9 19.7 20.4 20.0

jfs jfs 44.5k eight 53 199 · Answer 2 · 2017-04-19T11:52:46

Assuming that exactly 6 positions are allocated for each temperature of the month (fixed field widths), you can recognize data from standard input or from files specified on the command line using the standard fileinput module:

 #!/usr/bin/env python import fileinput width = 6 for line in fileinput.input(): station_id, year, s = line.split(None, 2) s = s.rstrip('\n').rjust(12 * width) # pad with leading space temps = [s[i:i+width].strip() for i in range(0, len(s), width)] print(temps)

Example

 $ python parse-fixed-width-temps.py input.txt ['27.0', '44.3', '9.0', '15.9', '3.6', '9.2', '9.2', '37.5', '18.3', '18.6', '24.4', '26.0'] ['5.5', '2.4', '7.9', '8.1', '', '', '42.7', '', '', '10.1', '', '']

Answer 3 · 2017-04-18T14:17:07

If we assume that each line has a fixed length (82 characters), and 5 characters (XX.XX) are allocated for each month in a line, we find that the separator should be two space characters.

So, you can replace the extra spaces with the missing value + space-delimiters, and try to divide the line as follows:

 data = '20667 2014 5.5 2.4 7.9 8.1 42.7 10.1 ' print [val.strip() for val in data.replace(' ', ' n/d ').split(' ')] >>> ['20667 2014', '5.5', '2.4', '7.9', '8.1', 'n/d', 'n/d', '42.7', 'n/d', 'n/d', '10.1', 'n/d', 'n/d']

Answer 4 · 2017-04-18T16:16:50

Regular expression for your first line, so that you understand how else you can work with regulars besides the standard \d\w\s +*? . Very clear and intuitive:

 (\d{5})[ ]{1,3}(\d{4})[ ]{1,3}([0-9.]{4})[ ]{1,3}(([0-9.]{4}))[ ]{1,3}([0-9.]{3})[ ]{1,3}([0-9.]{4})[ ]{1,3}([0-9.]{3})[ ]{1,3}([0-9.]{3})[ ]{1,3}([0-9.]{3})[ ]{1,3}([0-9.]{4})[ ]{1,3}([0-9.]{4})[ ]{1,3}([0-9.]{4})[ ]{1,3}([0-9.]{4})[ ]{1,3}([0-9.]{4})

Result

A version that finds all the strings:

 (([0-9.]{1,5})([ ]{1,3})?)+?

Result

Usually visibility, simplicity and intuitiveness are not combined with regulars :)

Splitting lines in a format with a fixed field width and optional values

4 answers 4

Example

More articles: