How to identify text files in unix systems through the terminal?

Question

The task is to find the number of lines in each file in the current directory, order the result and write to the file.

The problem is that I do not know how to identify all the text files in the folder.

If more than a certain percentage of the size, the file is text.

Accepted Answer · 2012-11-13T15:49:38

I would suggest such a perversion:

find . -maxdepth 1 -type f \ -exec sh -c "file -bi '{}' | grep -q ^text/ && echo '{}'" \; \ | xargs wc -l | head -n -1 | sort -gk 1 > line_counts.txt

What is going on here:

find . -maxdepth 1 -type f find . -maxdepth 1 -type f - we search all files in the current directory (without breaking deeper) ( -type f )
-exec sh -c "…" \; - for each of them execute the command sh -c "…" (instead of " {} " the file name will be substituted). The meaning of this action is that we will not simply thrust a pipe into find, it will not understand, so we have to call the shell.
file -bi '{}' - define the MIME type of the file ( -i ), the file name itself is not displayed ( -q ). This may not be the most accurate definition, see notes below.
grep -q ^text/ - choose lines that start with " text/ ", but do not print anything ( -q ), but just tell the exit code whether something is found or not
&& echo '{}' - if found, the right part of && will be executed and the file name will be displayed.
xargs wc -l - all incoming file names will be supplied with wc arguments, which will count the lines ( -l ).
head -n -1 - cut the last line, with the total result
sort -gk 1 - numerically ( -g ) sorted by the first field ( -k 1 )

Variations are possible. In particular, I think that not only ^text/ all limited (some files representing text in UTF-8 have MIME types in application/* , but there are also any application/octet-stream , which, generally speaking, never text), so maybe something better is in the spirit of file -b '{}' | grep -Fq ' text' file -b '{}' | grep -Fq ' text' . Also, if there are a lot of files, and they have long names, you will have to call wc more than once, but once for each file, “ xargs -I '{}' wc -l '{}' ”.

Yes, I used mostly GNU's utilities (GNU findutils, GNU coreutils, GNU grep), except for the BSD's file . There may be other utilities on non-GNU systems that may not have any options, or they may not. In general, YMMV, if that - see the documentation.

All this, however, will break if any fan of the strange creates a file with the name containing the newline character ( \n ). Then the pipe, starting with xargs will break. To solve, you have to add && echo -e '\x00' (or something like that) and add the xargs argument to -0 ( --null ).

The rewrite is a little easier, maybe not very reliable, but it is written in the entry wc -l file * | grep text | awk '{ print $1 }' | tr -d :
@avp: Yes, but it will break on the first file, with a name containing spaces and colons.
Well, there (except for colons, spaces, etc.), the last line (total) must also be deleted.
If the real task, then you need to think, the main thing is to decide what to consider as a text file.
If there are any names at all, then I would not suffer, but I wrote C (if the file utility correctly defines the file type, then it would call it via popen).

How to identify text files in unix systems through the terminal?

1 answer 1

More articles: