There is a huge file format ebcdic. Need to convert it to ANCII. File weighs from 250MB. If everything is counted, converted, divided into an array of strings, then it takes a lot of time, and most importantly, it takes a lot of memory (soooo much). It is necessary to implement a mechanism for fast reading and conversion, so that even less memory is spent on it.

Old algorithm that I do not recommend to use.

private static final char[] NON_PRINTABLE_EBCDIC_CHARS = new char[] { /*0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x11, 0x12, 0x13, 0x14, 0x21, 0x22, 0x23, 0x24 *//*, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0E, 0x0F, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20, 0x7F, 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0xA0*/ }; public String convert(String input) throws IOException { StringWriter writer = new StringWriter(); Reader reader = null; reader = new BufferedReader(new InputStreamReader(new FileInputStream(new File(input)), ebcdicCharset)); int[] ebcdicInput = loadContent(reader); close(reader); convert(ebcdicInput, writer); return writer.toString(); } private int[] loadContent(Reader reader) throws IOException { int[] buffer = new int[INITIAL_BUFFER_SIZE]; int bufferIndex = 0; int bufferSize = buffer.length; int character; while ((character = reader.read()) != -1) { if (bufferIndex == bufferSize) { buffer = resizeArray(buffer, bufferSize + INITIAL_BUFFER_SIZE); bufferSize = buffer.length; } buffer[bufferIndex++] = character; } return resizeArray(buffer, bufferIndex); } final int[] resizeArray(int[] orignalArray, int newSize) { int[] resizedArray = new int[newSize]; for (int i = 0; i < newSize && i < orignalArray.length; i++) { resizedArray[i] = orignalArray[i]; } return resizedArray; } private void convert(int[] ebcdicInput, Writer convertedOutputWriter) throws IOException { int convertedChar; for (int index = 0; index < ebcdicInput.length; index++) { int character = ebcdicInput[index]; if (fixedLength != -1 && index > 0 && index % fixedLength == 0) { convertedOutputWriter.append((char) LF); } if (fixedLength == -1 && character == NEL) { convertedChar = LF; } else { convertedChar = replaceNonPrintableCharacterByWhitespace(character); } convertedOutputWriter.append((char) character); } 
  • And why beat the file on the line. Stupidly blocks, multiples of 4 KB read and process the block, write to disk. then go to the next block. you can also use some java alternative to the mmap function (google says it is a MappedByteBuffer) - Mike
  • I read blocks, changed different blocks, etc. read in different ways. takes 18 hours to read and convert)) - Senior Pomidor
  • I can not even imagine what to do so that 250MB were processed for 18 hours. To get started, see how long it will take to read and write in the chosen way without conversion. The volume is a penny, in a couple of seconds it should be read / written - Mike
  • If I understand everything correctly, the characters are converted one-on-one, that is, you can use a table of 256 values ​​for transcoding. It is possible that there are characters that are not so easily recoded and you will need to agree on their replacement. Therefore, the whole cycle will be - read the block, run through the cycle, write the block. Considering that the speed of writing / reading files is more than the replacement in memory, the work of the whole algorithm is well maximally one and a half times longer than copying the file. Therefore, a maximum of a minute or two at 250MB. - KoVadim
  • @KoVadim I also thought so, BUFFER_SIZE = 2048, in the array I listed characters that need to be replaced with spaces, converted each byte and first checked whether it was necessary to replace such a character. 30 seconds takes reading. and conversion 18h. So it goes. 1600 characters in each line and 220 000 lines. I, too, was oochen surprised that everything takes so much time. - Senior Pomidor

1 answer 1

Here is how I implemented the ps and then wait a couple of days for other answers. This method has accelerated the whole procedure to 18 minutes.

  StringTokenizer tok = null; try { tok = new StringTokenizer(ebcdicToAscii(IOUtils.toByteArray(new FileInputStream(fileName))), "\r\n"); // сразу передаю в StringTokenizer, чтобы не хранить в памяти объект. } catch (IOException e) { error("Cannot read file " + fileName); } while (tok.hasMoreTokens()) { String line = tok.nextToken(); // делаешь свои дела } public static String ebcdicToAscii(byte[] e) { try { return new String(ebcdicToAsciiBytes(e, 0, e.length), "ISO8859_1"); } catch (UnsupportedEncodingException var2) { return var2.toString(); } } public static byte[] ebcdicToAsciiBytes(byte[] e, int offset, int len) { byte[] a = new byte[len]; for(int i = 0; i < len; ++i) { a[i] = EBCDIC2ASCII[e[offset + i] & 255]; } return a; } public static final byte[] EBCDIC2ASCII = new byte[] { (byte)0x0, (byte)0x1, (byte)0x2, (byte)0x3, (byte)0x9C, (byte)0x9, (byte)0x86, (byte)0x7F, (byte)0x97, (byte)0x8D, (byte)0x8E, (byte)0xB, (byte)0xC, (byte)0xD, (byte)0xE, (byte)0xF, (byte)0x10, (byte)0x11, (byte)0x12, (byte)0x13, (byte)0x9D, (byte)0xA, (byte)0x8, (byte)0x87, (byte)0x18, (byte)0x19, (byte)0x92, (byte)0x8F, (byte)0x1C, (byte)0x1D, (byte)0x1E, (byte)0x1F, (byte)0x80, (byte)0x81, (byte)0x82, (byte)0x83, (byte)0x84, (byte)0x85, (byte)0x17, (byte)0x1B, (byte)0x88, (byte)0x89, (byte)0x8A, (byte)0x8B, (byte)0x8C, (byte)0x5, (byte)0x6, (byte)0x7, (byte)0x90, (byte)0x91, (byte)0x16, (byte)0x93, (byte)0x94, (byte)0x95, (byte)0x96, (byte)0x4, (byte)0x98, (byte)0x99, (byte)0x9A, (byte)0x9B, (byte)0x14, (byte)0x15, (byte)0x9E, (byte)0x1A, (byte)0x20, (byte)0xA0, (byte)0xE2, (byte)0xE4, (byte)0xE0, (byte)0xE1, (byte)0xE3, (byte)0xE5, (byte)0xE7, (byte)0xF1, (byte)0xA2, (byte)0x2E, (byte)0x3C, (byte)0x28, (byte)0x2B, (byte)0x7C, (byte)0x26, (byte)0xE9, (byte)0xEA, (byte)0xEB, (byte)0xE8, (byte)0xED, (byte)0xEE, (byte)0xEF, (byte)0xEC, (byte)0xDF, (byte)0x21, (byte)0x24, (byte)0x2A, (byte)0x29, (byte)0x3B, (byte)0x5E, (byte)0x2D, (byte)0x2F, (byte)0xC2, (byte)0xC4, (byte)0xC0, (byte)0xC1, (byte)0xC3, (byte)0xC5, (byte)0xC7, (byte)0xD1, (byte)0xA6, (byte)0x2C, (byte)0x25, (byte)0x5F, (byte)0x3E, (byte)0x3F, (byte)0xF8, (byte)0xC9, (byte)0xCA, (byte)0xCB, (byte)0xC8, (byte)0xCD, (byte)0xCE, (byte)0xCF, (byte)0xCC, (byte)0x60, (byte)0x3A, (byte)0x23, (byte)0x40, (byte)0x27, (byte)0x3D, (byte)0x22, (byte)0xD8, (byte)0x61, (byte)0x62, (byte)0x63, (byte)0x64, (byte)0x65, (byte)0x66, (byte)0x67, (byte)0x68, (byte)0x69, (byte)0xAB, (byte)0xBB, (byte)0xF0, (byte)0xFD, (byte)0xFE, (byte)0xB1, (byte)0xB0, (byte)0x6A, (byte)0x6B, (byte)0x6C, (byte)0x6D, (byte)0x6E, (byte)0x6F, (byte)0x70, (byte)0x71, (byte)0x72, (byte)0xAA, (byte)0xBA, (byte)0xE6, (byte)0xB8, (byte)0xC6, (byte)0xA4, (byte)0xB5, (byte)0x7E, (byte)0x73, (byte)0x74, (byte)0x75, (byte)0x76, (byte)0x77, (byte)0x78, (byte)0x79, (byte)0x7A, (byte)0xA1, (byte)0xBF, (byte)0xD0, (byte)0x5B, (byte)0xDE, (byte)0xAE, (byte)0xAC, (byte)0xA3, (byte)0xA5, (byte)0xB7, (byte)0xA9, (byte)0xA7, (byte)0xB6, (byte)0xBC, (byte)0xBD, (byte)0xBE, (byte)0xDD, (byte)0xA8, (byte)0xAF, (byte)0x5D, (byte)0xB4, (byte)0xD7, (byte)0x7B, (byte)0x41, (byte)0x42, (byte)0x43, (byte)0x44, (byte)0x45, (byte)0x46, (byte)0x47, (byte)0x48, (byte)0x49, (byte)0xAD, (byte)0xF4, (byte)0xF6, (byte)0xF2, (byte)0xF3, (byte)0xF5, (byte)0x7D, (byte)0x4A, (byte)0x4B, (byte)0x4C, (byte)0x4D, (byte)0x4E, (byte)0x4F, (byte)0x50, (byte)0x51, (byte)0x52, (byte)0xB9, (byte)0xFB, (byte)0xFC, (byte)0xF9, (byte)0xFA, (byte)0xFF, (byte)0x5C, (byte)0xF7, (byte)0x53, (byte)0x54, (byte)0x55, (byte)0x56, (byte)0x57, (byte)0x58, (byte)0x59, (byte)0x5A, (byte)0xB2, (byte)0xD4, (byte)0xD6, (byte)0xD2, (byte)0xD3, (byte)0xD5, (byte)0x30, (byte)0x31, (byte)0x32, (byte)0x33, (byte)0x34, (byte)0x35, (byte)0x36, (byte)0x37, (byte)0x38, (byte)0x39, (byte)0xB3, (byte)0xDB, (byte)0xDC, (byte)0xD9, (byte)0xDA, (byte)0x9F };