I have the following text file:

Input data: N tn t1 t2 tk u1 u2 21 5.000 20.000 50.000 65.000 100.000 95.000 i Time Uvx Uvix 1 5.000 0.000 0.000 2 8.000 20.000 50.000 3 11.000 40.000 50.000 4 14.000 60.000 50.000 5 17.000 80.000 50.000 6 20.000 100.000 50.000 7 23.000 99.500 50.000 8 26.000 99.000 50.000 9 29.000 98.500 50.000 Длительность переднего фронта: Uvx Uvix 9.000 0.000 

The task is to read all numerical values ​​from this file, bypassing the lines with the names of variables and other text, as well as empty lines. So how to skip these very lines and start reading data from an arbitrary line? Language - C.

  • And what prevents just skip the first n lines? - VladD
  • @VladD, maybe I'll ask a stupid question now, but how to skip these n lines? - enkelad
  • @Ivan Kushchev, if the size of each of them in bytes is not known in advance, then only by sequential reading. - avp
  • @avp, just use the fscanf () function several times in a row? She just seems to start reading from the first line. And, let's say, the size of the rows is known to us, then how is this problem solved? - enkelad
  • At what step did the difficulties arise? Is it clear how to read a line from the input, assuming that the length of the largest line is fixed (eg, fgets() )? Is it clear how to skip non-numbers in a given string (eg, strcspn() )? Is it clear how you can read several numbers separated by spaces (eg, sscanf() )? - jfs

2 answers 2

To avoid loading strings into memory that will still be ignored and to support arbitrarily large input strings, you can read the file byte by using fgetc() .

To skip lines until you see a line that starts with a number (a line with numerical data), ignoring possible spaces, you can use an automaton with two states:

  1. read until newline is read ( \n )
  2. go to the waiting state of the digit ( expect_digit=1 ): all spaces and tabs are ignored, if the digit is read, then return it back to the stream and exit the function, otherwise (not the digit) return to the initial state 1.

 #include <stdio.h> // read until a line that starts with a digit (ignoring leading hor.space) int skip_until_line_startswith_digit(FILE* file) { int expect_digit = 1; // expect a digit as the next character for (int c; (c = fgetc(file)) != EOF; ) { switch (c) { case '\n': // newline expect_digit = 1; break; case ' ': // horizontal whitespace case '\t': // ignore break; case '0': // digit case '1': case '2': case '3': case '4': case '5': case '6': case '7': case '8': case '9': if (expect_digit) { if (ungetc(c, file) == EOF) return -2; // error return 0; // found digit } // unexpected digit: ignore break; default: // everything else expect_digit = 0; } } return feof(file) ? -1 : -2; // not found or an error } 

isdigit() may depend on the locale on Windows, so it is not used here.

Usage example

 FILE* file = stdin; if (skip_until_line_startswith_digit(file) < 0) exit(EXIT_FAILURE); // read 1st numeric block size_t N; double tn, t1, t2, tk, u1, u2; errno = 0; if (fscanf(file, "%zu %lf %lf %lf %lf %lf %lf", &N, &tn, &t1, &t2, &tk, &u1, &u2) != 7) { if (errno) perror("N tn t1 t2 tk u1 u2"); exit(EXIT_FAILURE); } printf("%zu %f %f %f %f %f %f\n", N, tn, t1, t2, tk, u1, u2); 

And about the dependence of isdigit on locale - can be more detailed (I think it will be interesting for everyone)? In what cases it will not work for the recognition of numbers (and why only in Windows)?

The standard says (for example, see footnote in 7.11.1.1 in n1570 ) that isdigit() does not depend on locale, but on Windows isdigit() may depend on locale, for example, '\xb2' in cp1252 encoded in Danish locale is recognized as a decimal number, which is not correct.

By the way, the Unicode standard is in solidarity with the C standard: superscripts such as U + 00B2 are explicitly excluded. See 4.6 Numerical Value (Unicode 6.2) .

  • @jfs, good idea. And then it's better if skip returns the number of reads \n . This will make a normal error diagnosis (line number). Well, perror after fscanf will almost always output - Success. It may be worthwhile to foresee that in principle there are negative numbers (by the way, scanf normally reads numbers that start with + ). - And about the dependence of isdigit on locale - can be more detailed (I think it will be interesting for everyone)? In what cases it will not work for the recognition of numbers (and why only in Windows)? - avp
  • @avp: added isdigit() about isdigit() . - jfs
  • Thanks, interesting. And then I began to fantasize about the hieroglyphs 〇, 一, 二, 三 ... 九 (joke) - avp
  • @avp: even if you add support for Unicode numbers (you have to read more bytes), these hieroglyphs are still not Unicode decimal digits :) - jfs
  • Of course, they are not, as well as the ones cited by you from cp1252 (I had never suspected such perversions before). - avp

@Ivan Kushchev , fscanf will not work, tk She reads not in the lines. You can, for example, like this:

  int c, nl = 0, skip_lines = сколько строк пропустить; while ((c = fgetc(f)) != EOF) if ((c == '\n') && (++nl == skip_lines)) break; if (feof(f)) fatal("Bad data"); 

If the size of the lines, or rather the offset of the line of interest in the file is known (or it can be calculated), see man 3 fseek .

Update

@VladD , after getline free is not needed (and if you call free, then the pointer must be reset again).

It is possible so (in one line):

  while (n-- && getline(&s, &len, file) > 0); if (n != -1) fatal(...); printf("last skipped line: %s", s); 

only eat a little more resources (and generally, I answered, I just forgot about getline).

-

@ Ivan Kushchev , m. in practice, it is better for your task to simply skip lines before the data begins?

  while (getline(&s, &len, file) > 0 && *s != 'i'); 

and then read line by line and use sscanf(s, "%d %d %d %d", ...) to fetch the data.

Update

Yeah, so be it. The idea remains. All the same, because 3 different data blocks.

Just 3 times while (getline(...) && ...); then read (once in a loop).


And if the author does not have getline in libc, then let him write his own version (he will practice at the same time, especially since he may need a dynamic array of structures in the second data block).

  • Why not just char * s = NULL; size_t len ​​= 0; while (n--) {getline (& s, & len, file); free (s); } (I hope getline is in the standard.) - VladD
  • @avp: Right! I misunderstood the documentation , now I re-read it and got it. - VladD
  • getline() in C is not. getline() is in POSIX. @avp: lines with 21 and `9.000` also need to be read, i.e. *s != 'i' condition cannot be used. - jfs