For those who want to play detective: find a bug in the function from Midnight Commander

Find a mistake!

We invite you to try to find a bug in a very simple function from the GNU Midnight Commander project. What for? Just. It is fun and interesting. No, we lied. Once again we want to demonstrate an error that a person hardly finds in the process of code review, but easily finds the PVS-Studio static code analyzer.

Recently we were sent a letter asking where the analyzer issues a warning to the EatWhitespace function, the code of which is given below. In fact, the question is not so simple. Try it yourself to guess what is wrong with this code.

static int EatWhitespace (FILE * InFile) /* ----------------------------------------------------------------------- ** * Scan past whitespace (see ctype(3C)) and return the first non-whitespace * character, or newline, or EOF. * * Input: InFile - Input source. * * Output: The next non-whitespace character in the input stream. * * Notes: Because the config files use a line-oriented grammar, we * explicitly exclude the newline character from the list of * whitespace characters. * - Note that both EOF (-1) and the nul character ('\0') are * considered end-of-file markers. * * ----------------------------------------------------------------------- ** */ { int c; for (c = getc (InFile); isspace (c) && ('\n' != c); c = getc (InFile)) ; return (c); } /* EatWhitespace */

As you can see, the EatWhitespace function is quite small. Even a comment on a function takes up more space than the body of the function itself :). Now for some details.

Description of the getc function:

 int getc ( FILE * stream );

The function returns the character pointed to by the internal file position indicator of the specified stream. The indicator then moves to the next character. If the end of the file is reached at the time the stream is called, the function returns the EOF value and sets the end-of-file indicator for the stream. If a read error occurs, the function returns the value EOF and sets the error indicator for the stream (ferror).

Description of the isspace function:

 int isspace( int ch );

The function checks whether the character is whitespace by the classification of the current locale. In standard locale, the following characters are whitespace:

space (0x20, '');
page change (0x0c, '\ f');
line feed LF (0x0a, '\ n');
CR carriage return (0x0d, '\ r');
horizontal tab (0x09, '\ t');
vertical tab (0x0b, '\ v').

Return value Non-zero if the character is whitespace, zero otherwise.

The EatWhitespace function should skip all characters that are considered whitespace, except for the line feed '\ n'. Another reason to stop reading from a file can be to achieve the end of the file (EOF).

And now, knowing all this, try to find a mistake!

To prevent the reader from looking at the answer at once by chance, add a couple of waiting unicorns.

Figure 1. Time to look for a mistake. Unicorns will wait.

Figure 1. Time to look for a mistake. Unicorns will wait.

Don't you see the error anyway?

The fact is that we have deceived readers about isspace . Haha This is not a standard function at all, but a self-made macro. Yes, we - byaki and made you confused.

Figure 2. Unicorn gives readers a false idea of what isspace.

Figure 2. Unicorn gives readers a false idea of what isspace .

In fact, of course, it is not us or our unicorn that is to blame. Confusion was made by the authors of the GNU Midnight Commander project, deciding to create their own isspace implementation in the charset.h file:

 #ifdef isspace #undef isspace #endif .... #define isspace(c) ((c)==' ' || (c) == '\t')

By creating such a macro, some developers have confused other developers. The code is written on the assumption that isspace is a standard function that considers carriage return (0x0d, '\ r') as one of the whitespace.

The implemented macro considers only spaces and tabs as whitespace characters. Let's substitute the macro and see what happens.

 for (c = getc (InFile); ((c)==' ' || (c) == '\t') && ('\n' != c); c = getc (InFile))

The subexpression ('\ n'! = C) is redundant (redundant), since its result will always be true. The PVS-Studio analyzer warns about this, issuing a warning:

V560 A part of the conditional expression is always true: ('\ n'! = C). params.c 136.

For complete clarity, let's look at 3 scenarios:

Reached end of file. End of file (EOF) is not a space or tab. The subexpression ('\ n'! = C) is not evaluated due to short circuit evaluation . The cycle stops.
Read any character that is not a space or tab. The subexpression ('\ n'! = C) is not evaluated due to the short circuit evaluation. The cycle stops.
Read space character or horizontal tab. The subexpression ('\ n'! = C) is calculated, but its result will always be true.

In other words, the code considered is equivalent to this:

 for (c = getc (InFile); c==' ' || c == '\t'; c = getc (InFile))

We found out that the code does not work as intended. Let's now figure out what the consequences are.

The programmer, who wrote the isspace call in the body of the EatWhitespace function, expected that the standard function would be called. That is why he added the condition that the translation of the string LF ('\ n') should not be considered a space character.

Consequently, the programmer planned that in addition to the space and horizontal tab, such characters as page change and vertical tab would be omitted.

Notably, it was planned to skip the carriage return symbol CR (0x0d, '\ r'). This does not happen and the cycle will stop when it encounters this symbol. This will lead to unpleasant surprises if the string separator in the file is a CR + LF sequence used in some non-UNIX systems, such as Microsoft Windows.

For those who want to learn more about the historical reasons for using LF or CR + LF as line delimiters, we offer you the article on Wikipedia " Line feed ".

The EatWhitespace function is supposed to process files in the same way, where both LF and CR + LF are used as separators. For the case of CR + LF, this is not the case. In other words, if your file came from the world of Windows, then you are not lucky :).

Perhaps this is not a serious mistake, especially since the GNU Midnight Commander is common in UNIX-like operating systems, where the LF character (0x0a, '\ n') is used for line breaks. However, because of such trifles, there are various annoying problems of incompatibility of data prepared in Linux and Windows systems.

The described error is interesting because it is almost impossible to detect with a classic code review. Not all project developers can know about the subtleties of the macro, and forget them very easily. This is a good example when static code analysis complements code reviews and other error-checking techniques.

Redefining standard functions is a bad practice. By the way, recently in the article " Love static code analysis " a similar case was considered with the #define sprintf std :: printf macro.

A better solution would be to give the macro a unique name, for example, is_space_or_tab . Then confusion would be impossible.

Perhaps the reason for creating the macro was the slow work of the standard isspace function and the programmer created its faster version, sufficient to solve all the necessary tasks. But still this decision is wrong. It would be safer to define isspace so that non-compiled code is obtained. And the necessary functionality is implemented in a macro with a unique name.

Thanks for attention. We invite you to download and try the PVS-Studio analyzer to test your projects. Plus, we remind you that the Java language support has recently appeared in the analyzer.

If you want to share this article with an English-speaking audience, then please use the link to the translation: Andrey Karpov. Wanna Play a Detective? Find the Bug in a Function from Midnight Commander .

Source: https://habr.com/ru/post/439374/

For those who want to play detective: find a bug in the function from Midnight Commander

More articles: