Russian characters (strings) from file (C) are not read.

Question

The program should read line by line strings of Russian characters from a file into an array and display them on the screen. Gcc compiler, linux. Text file in utf8 encoding. Lines of English characters are read, lines of Russian characters are not. The program goes into an infinite loop because The end of file character is never returned.

#include <stdio.h> #include <stdlib.h> #include <string.h> #include <wchar.h> #include <locale.h> int main(int argc, char *argv[]){ setlocale(LC_ALL,"Russian"); FILE *fp; wchar_t arrayofwords[64][64]; int i; i=0; if (argc!=2) { printf("Не указан исходный файл!\n"); exit(1); } if ((fp = fopen (argv[1],"r")) == NULL){ printf("Ошибка при открытии файла!\n"); exit(1); } while (!feof(fp)){ if (fgetws(arrayofword[i], 63, fp) != NULL){ wprintf(L"%s\n", arrayofwords[i]); i++; } } fclose(fp); }

in your program File * fp; not set, I think you know about this. {#include <stdio.h> int main () {FILE * fp; int c; fp = fopen ("file.txt", "r"); while (1) {c = fgetc (fp); if (feof (fp)) {break; } printf ("% c", c); } fclose (fp); return (0); }} - Ryslandeveloper

avp avp 37.4k 3 gold marks 35 silver marks 90 bronze marks · Answer 1 · 2016-08-14T21:01:11

The program loops, because the fgetws() function returns NULL both at the end of the file and in the case of any error (in this case, errno = 84: Invalid or incomplete multibyte or wide character ). The read pointer in the file remains in place, i.e. end of the file you never reach.

For verification, you can add :

  else if (ferror(fp)) err(2, "fgetws errno = %d", errno);

after

 if (fgetws(arrayofword[i], 63, fp) != NULL){ ... }

This error occurs because the locale "Russian" unknown to the system. In your case setlocale() returns NULL, and this error corresponds to - errno = 2 setlocale: No such file or directory .

In general, for the program to work, it is enough to set a suitable locale:

 avp@avp-ubu1:hashcode$ locale LANG=en_US.UTF-8 LANGUAGE=en LC_CTYPE="en_US.UTF-8" LC_NUMERIC=en_SG.UTF-8 LC_TIME=en_SG.UTF-8 LC_COLLATE="en_US.UTF-8" LC_MONETARY=en_SG.UTF-8 LC_MESSAGES="en_US.UTF-8" LC_PAPER=en_SG.UTF-8 LC_NAME=en_SG.UTF-8 LC_ADDRESS=en_SG.UTF-8 LC_TELEPHONE=en_SG.UTF-8 LC_MEASUREMENT=en_SG.UTF-8 LC_IDENTIFICATION=en_SG.UTF-8 LC_ALL= avp@avp-ubu1:hashcode$

call setlocale(LC_ALL, ""); and correct in your output the format to L"%ls\n" (as explained in the answer @Roman Khimov),
and also to analyze the result of calling functions and correctly respond to errors .

PS
Since locales are not always set, and functions for working with wchar_t sometimes react extremely painfully to errors in the input data (that is Invalid or incomplete multibyte or wide character ), and moreover, for example, the output of printf / wprintf functions does not work. ,
I prefer to work with UTF directly (since this is just a sequence of bytes ( char * ), in most cases standard printf , scanf , strcmp , etc. is enough), i.e. without translating into wchar_t . If a translation is required in wchar_t (UCS) (and vice versa, as well as some specific functions), then I use my functions (from ucsutf.h , ucsutf.c ) that do not depend on setlocale () .

Roman Khimov Roman Khimov 321 1 silver mark 10 bronze marks · Answer 2 · 2016-08-14T18:35:22

Apart from the fact that:

arrayofwords can be elementarily crowded with a large file.
lines longer than 64 (along with line breaks) characters will be perceived as multiple lines
setlocale() would be better done in the form of setlocale(LC_ALL,"") , this should be more visible to the user (and his environment)
the read (and put into the buffer) line feed from the file during the output will be supplemented by one more (maybe, of course, this is necessary, but)
fgetws() can be given 64, not 63, since it will be read one less character and the null character will be added anyway
the feof() check can be removed, the fgetws() result check is enough

It is difficult for me to say why such a program will loop at the final input file. However, it is well seen why she takes out the garbage instead of the expected data (it should be noted, it does not read, but it does, it reads everything regularly), instead of

 wprintf(L"%s\n", arrayofwords[i]);

it makes sense to write

 wprintf(L"%ls\n", arrayofwords[i]);

Since in the absence of the l modifier, the passed parameter is perceived as const char * and is converted via mbrtowc() before output (just as it is written in the manual ). In our case, this is no longer necessary, and in the presence of l parameter is already perceived as const wchar_t * and is output as is.

Russian characters (strings) from file (C) are not read.

2 answers 2

More articles: