Formulation of the problem. There is a folder with files, you need to sort through the files and check which ones are text.

How to make the check itself? How to check if the text is a file or not?

As a result, all text files will be compared by content.

  • but if you add a “missed” learning task assignment to the question, then everything will fall into place. and it will be possible not to be lazy to drive in a certain number of file extensions (although the two @shurik listed above will suffice) - jmu
  • Saying that the file is text, I mean that the file has text, meaning, and not arbitrary characters. The file may not have an extension at all, it may have an arbitrary extension. - Val
  • one
    Those. Not any text, but necessarily making sense? But you understand that in the general case the problem in such a formulation has no solution. - cy6erGn0m
  • : ~ $ which file / usr / bin / file: ~ $ file / usr / bin / file / usr / bin / file: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU / Linux 2.6.15, stripped: ~ $ file /var/log/boot.log /var/log/boot.log: ASCII Pascal program text, with CRLF, CR line terminators, with escape sequences - jmu
  • Rewrite file in Java? A good learning task (on a term paper pulls?) - avp

2 answers 2

There are a whole bunch of ways . Many of them are based on reading the first 512 bytes. However, this is probably not the fastest way. I am sure that for a really large file tree this will work for many hours.

  • Yes, Java is slow, but so much? (many hours). Recently I had to look for a grep -R in the megabyte tree for 100. 15 minutes. - avp
  • So what is 100 megabytes? I would say that this is somehow slow for a grep. In addition, a simple search is one thing, and the type definition is more complicated. - cy6erGn0m

I can assume that the listed extensions. Type .txt, .log ...

Well, another method is to read the file and see if there are any non-text characters there such as \ x00, \ x01, \ x02 ...

  • to list all types - different languages ​​may not be real then this method will not work. - Val