Implementing in Java method signatures?

Question

How to implement this method on java.
Need to check the file encoding at the bit level.
Code:

unc ::IsUTF8(unc *cpt) { if (!cpt) return 0; if ((*cpt & 0xF8) == 0xF0) { // start of 4-byte sequence if (((*(cpt + 1) & 0xC0) == 0x80) && ((*(cpt + 2) & 0xC0) == 0x80) && ((*(cpt + 3) & 0xC0) == 0x80)) return 4; } else if ((*cpt & 0xF0) == 0xE0) { // start of 3-byte sequence if (((*(cpt + 1) & 0xC0) == 0x80) && ((*(cpt + 2) & 0xC0) == 0x80)) return 3; } else if ((*cpt & 0xE0) == 0xC0) { // start of 2-byte sequence if ((*(cpt + 1) & 0xC0) == 0x80) return 2; } return 0; }

Question:

How to transform this method in Java code?

@nazar_art, and in fact there can be 5 and 6 byte sequences in utf-8, and 1 byte (unsigned) less than 128 is also a correct utf-8 encoding.
By the way, the number of high "single" bits in the first byte is just the size in bytes of this utf-8 "character".

jmu jmu 6,042 ten 21 · Accepted Answer · 2013-03-07T18:50:10

I tried to make your task as easy as possible:

 // на счет значения не уверен, подставьте нужное private static final int UTF8_HEADER_SIZE = 8 ; public static boolean isUTF8 (String path) { return isUTF8(new File(path)) ; } public static boolean isUTF8 ( File file ) { // validate input if (null == file) { throw new IllegalArgumentException ("input file can't be null"); } if (file.isDirectory ()) { throw new IllegalArgumentException ("input file refers to a directory"); } // read input file byte [] buffer = new byte[UTF8_HEADER_SIZE]; try { readBytes(file, buffer) ; } catch ( IOException e ) { throw new IllegalArgumentException ("Can't read input file, error = " + e.getLocalizedMessage () ); } // validate file header // TODO: your validation goes here // if (0xF0 == (buffer[0] & 0xF8) ) { // } return false ; } private static void readBytes ( File input, byte[] buffer ) throws IOException { if (null == buffer || 0 == buffer.length) { return; } // read data FileInputStream fis = new FileInputStream ( input ) ; fis.read ( buffer ) ; fis.close (); }

@jmu // if (0xF0 == (buffer [0] & 0xF8)) {=> here replaced with if ((buffer [0] & 0xF8) == 0xF0) {if (((buffer [1] & 0xC0) == 0x80) && ((buffer [2] == 0x80) && ((buffer [3] == 0x80)))) return true;
EncodindsCheck - here starting with if ((buffer [0] & 0xF8) == 0xF0) {... => start wrong.

Answer 2 · 2013-03-06T19:06:03

See it.

For arrays in Java, the ArrayList<Byte> class is used.
Bit operations are the same, only for an unsigned left shift is used >>> (but you do not need it).
Instead of address arithmetic, use indexing.
It is impossible to transfer something like a pointer to the middle of the array, just pass the array and the initial index.

Signature:

 // внутри класса public static int IsUTF8(ArrayList<Byte> cpt, int startIndex) { // ...

Further yourself :-)

I try now in the method public int byteSequence (File file) {if (file == null || (! File.isFile ())) return 0;
if ((cpt.charAt (0) & 0xF8) == 0xF0) {// start of 4-byte sequence if (((cpt.charAt (1) & 0xC0) == 0x80) ... What do you say?
I would divide it into two procedures: reading a file (for example, < stackoverflow.com/a/8432074/276994> ) and IsUTF8.
, if I understand correctly, you can only read the beginning of the file.)
public int byteSequence (File file) => Here I took it from a file

Answer 3 · 2013-03-10T21:54:46

After a difficult search, the result (the son of difficult errors :)) checks the encoding UTF-8:

 class EncodingsCheck implements Checker { @Override public boolean check(File currentFile) { return isUTF8(currentFile); } public static boolean isUTF8(File file) { // validate input if (null == file) { throw new IllegalArgumentException("input file can't be null"); } if (file.isDirectory()) { throw new IllegalArgumentException( "input file refers to a directory"); } // read input file byte[] buffer; try { buffer = readUTFHeaderBytes(file); } catch (IOException e) { throw new IllegalArgumentException( "Can't read input file, error = " + e.getLocalizedMessage()); } if (0 == (buffer[0] & 0x80)) { return true; // ASCII subset character, fast path } else if (0xF0 == (buffer[0] & 0xF8)) { // start of 4-byte sequence if (buffer[3] >= buffer.length) { return false; } if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0)) && (0x80 == (buffer[3] & 0xC0))) return true; } else if (0xE0 == (buffer[0] & 0xF0)) { // start of 3-byte sequence if (buffer[2] >= buffer.length) { return false; } if ((0x80 == (buffer[1] & 0xC0)) && (0x80 == (buffer[2] & 0xC0))) { return true; } } else if (0xC0 == (buffer[0] & 0xE0)) { // start of 2-byte sequence if (buffer[1] >= buffer.length) { return false; } if (0x80 == (buffer[1] & 0xC0)) { return true; } } return false; } private static byte[] readUTFHeaderBytes(File input) throws IOException { // read data FileInputStream fileInputStream = new FileInputStream(input); try{ byte firstBytes[] = new byte[4]; int count = fileInputStream.read(firstBytes); if(count < 4){ throw new IOException("Empty file"); } return firstBytes; } finally { fileInputStream.close(); } } }

By virtue of your algorithm, this is a constant 4, in theory.

Implementing in Java method signatures?

3 answers 3

More articles: