How to implement this method on java.
Need to check the file encoding at the bit level.
Code:

unc ::IsUTF8(unc *cpt) { if (!cpt) return 0; if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence if (((*(cpt + 1) & 0xC0) == 0x80) && ((*(cpt + 2) & 0xC0) == 0x80) && ((*(cpt + 3) & 0xC0) == 0x80)) return 4; } else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence if (((*(cpt + 1) & 0xC0) == 0x80) && ((*(cpt + 2) & 0xC0) == 0x80)) return 3; } else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence if ((*(cpt + 1) & 0xC0) == 0x80) return 2; } return 0; } 

Question:

  • How to transform this method in Java code?
  • @nazar_art, and in fact there can be 5 and 6 byte sequences in utf-8, and 1 byte (unsigned) less than 128 is also a correct utf-8 encoding. By the way, the number of high "single" bits in the first byte is just the size in bytes of this utf-8 "character". - avp

3 answers 3

I tried to make your task as easy as possible:

 // на счет значения не уверен, подставьте нужное private static final int UTF8_HEADER_SIZE = 8 ; public static boolean isUTF8 (String path) { return isUTF8(new File(path)) ; } public static boolean isUTF8 ( File file ) { // validate input if (null == file) { throw new IllegalArgumentException ("input file can't be null"); } if (file.isDirectory ()) { throw new IllegalArgumentException ("input file refers to a directory"); } // read input file byte [] buffer = new byte[UTF8_HEADER_SIZE]; try { readBytes(file, buffer) ; } catch ( IOException e ) { throw new IllegalArgumentException ("Can't read input file, error = " + e.getLocalizedMessage () ); } // validate file header // TODO: your validation goes here // if (0xF0 == (buffer[0] & 0xF8) ) { // } return false ; } private static void readBytes ( File input, byte[] buffer ) throws IOException { if (null == buffer || 0 == buffer.length) { return; } // read data FileInputStream fis = new FileInputStream ( input ) ; fis.read ( buffer ) ; fis.close (); } 
  • @jmu Super Painted! Encapsulation ++. I'm going to try now. - nazar_art
  • @jmu // if (0xF0 == (buffer [0] & 0xF8)) {=> here replaced with if ((buffer [0] & 0xF8) == 0xF0) {if (((buffer [1] & 0xC0) == 0x80) && ((buffer [2] == 0x80) && ((buffer [3] == 0x80)))) return true; = and the check failed. The question is why is that? What is the problem how to fix? ( Despite the fact that the file is 100% of the required format) - nazar_art
  • EncodindsCheck - here starting with if ((buffer [0] & 0xF8) == 0xF0) {... => start wrong. - nazar_art
  • @jmu added the correct method, like the answer below. There you can just read 4 bytes and everything works. - nazar_art

See it.

  1. For arrays in Java, the ArrayList<Byte> class is used.
  2. Bit operations are the same, only for an unsigned left shift is used >>> (but you do not need it).
  3. Instead of address arithmetic, use indexing.
  4. It is impossible to transfer something like a pointer to the middle of the array, just pass the array and the initial index.

Signature:

 // внутри класса public static int IsUTF8(ArrayList<Byte> cpt, int startIndex) { // ... 

Further yourself :-)

  • I need to transfer the file there and check I try now in the method public int byteSequence (File file) {if (file == null || (! File.isFile ())) return 0; if ((cpt.charAt (0) & 0xF8) == 0xF0) {// start of 4-byte sequence if (((cpt.charAt (1) & 0xC0) == 0x80) ... What do you say? - nazar_art
  • uh ... and where do you read the file? I would divide it into two procedures: reading a file (for example, < stackoverflow.com/a/8432074/276994> ) and IsUTF8. ( Moreover , if I understand correctly, you can only read the beginning of the file.) - VladD
  • @VladD So it should be just a layout in haste. - nazar_art
  • and where did you get the charAt function from the array? - VladD
  • public int byteSequence (File file) => Here I took it from a file - nazar_art

After a difficult search, the result (the son of difficult errors :)) checks the encoding UTF-8:

 class EncodingsCheck implements Checker { @Override public boolean check(File currentFile) { return isUTF8(currentFile); } public static boolean isUTF8(File file) { // validate input if (null == file) { throw new IllegalArgumentException("input file can't be null"); } if (file.isDirectory()) { throw new IllegalArgumentException( "input file refers to a directory"); } // read input file byte[] buffer; try { buffer = readUTFHeaderBytes(file); } catch (IOException e) { throw new IllegalArgumentException( "Can't read input file, error = " + e.getLocalizedMessage()); } if (0 == (buffer[0] & 0x80)) { return true; // ASCII subset character, fast path } else if (0xF0 == (buffer[0] & 0xF8)) { // start of 4-byte sequence if (buffer[3] >= buffer.length) { return false; } if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0)) && (0x80 == (buffer[3] & 0xC0))) return true; } else if (0xE0 == (buffer[0] & 0xF0)) { // start of 3-byte sequence if (buffer[2] >= buffer.length) { return false; } if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))) { return true; } } else if (0xC0 == (buffer[0] & 0xE0)) { // start of 2-byte sequence if (buffer[1] >= buffer.length) { return false; } if (0x80 == (buffer[1] & 0xC0)) { return true; } } return false; } private static byte[] readUTFHeaderBytes(File input) throws IOException { // read data FileInputStream fileInputStream = new FileInputStream(input); try{ byte firstBytes[] = new byte[4]; int count = fileInputStream.read(firstBytes); if(count < 4){ throw new IOException("Empty file"); } return firstBytes; } finally { fileInputStream.close(); } } } 
  • one
    And why are you checking buffer.Length in isUTF8 ? By virtue of your algorithm, this is a constant 4, in theory. - VladD