Hello There was an urgent request. From Google nothing is really made. I have a text file that I upload to HashMap, via BUfferReader, by this method.

try { AssetManager assetManager = context.getAssets(); InputStreamReader istream = new InputStreamReader(assetManager.open("dictionary.txt")); BufferedReader in = new BufferedReader(istream); while ((word = in.readLine()) != null) { dictionary.put(word, 0); } in.close(); } catch (FileNotFoundException e) { // FileNotFoundExpeption } catch (IOException e) { // IOExeption } 

Tell me how to serialize, so that it does not compile each time? And then how to pull out lines from it, if this file is a dictionary of English words, and my program compares the lines and, if necessary, edits, that is, a spell check. Thank.

  • one
    And if this is all driven into the database? And at the start, do not load words into memory. Perhaps a sample from the database will be at least slower than a sample from a dictionary in memory, but not so much that the user will notice something. Otherwise, you will have to make your class structure and your file format in order to ensure a quick start. - KoVadim
  • @bboybboy I didn’t understand from the question (especially about the requirement that arose) what you want to do and why you should put the words in the HashMap, provided that the value is always 0. What do you want to serialize and why? - a_gura
  • I have a dictionary of English words in the * .txt file, each word = line, which I write in HashMap. The program is a spelling checker. And when I submit a string to the input of the method, it is checked for the presence of the same in the HashMap. If not, the word is corrected in such a way that all possible relative variants are constructed, which are also checked after each construction. But the verification process takes a long time, if you put everything in the SQLite database, and do the checking with the test, the time for checking will be spent even more. Here I think that serialization, my way out of this situation. - bboybboy
  • @bboybboy you do not understand the question? Why do you use HashMap, if your key is a word, and the value is always 0? Or are you going to change the value? If not, HashSet will be enough for you. Secondly, it is not clear that you want to serialize. Ready HashMap? - a_gura

1 answer 1

I will offer my version how I would optimize it. It will seem wild, but if speed is needed, sometimes you have to sacrifice beauty. (further on the code real numbers will be visible - I just found Google a dictionary of English words for 349900 words - for my purposes it is enough).

First you need to prepare the file. (I assume that the file consists of single words). The very first thing is sorting the words in the list. We look at the longest word. Most likely it will not be longer than 16 characters. And now we finish each line with spaces up to this length (in fact, it will be necessary to finish up to 15, or even 14 - we also have line breaks - in any case, the hex editor checks that each word starts from a position that is a multiple of 16).

Why 16? Simply, this number is a power of two, which will have a positive impact on performance. But maybe 32 or 64 will suit you. But 64 is already a lot.

The file will now of course become larger (maybe even two times).

Further loading. In the code, we create an array of characters that need the size (by file size, number of words * 16) and in one step we load the dictionary there. This will be much faster than loading one word at a time. (Loading one word per dict will most likely have logarithmic complexity. And this method is almost linear).

Now we need to solve two problems - search for a word by index and search for a word in a dictionary. The first problem is solved simply. If we know the index, then simply multiply it by 16 and this will be the index in the array of characters. Copy the 16 characters and cut off the final space.

Finding a word in a dictionary is also not difficult. At first glance it may seem that you need to run through the entire list (which is a long time). But I did not write for nothing that the list should be sorted. In this case, you can apply a binary search. But since this is a dictionary, you can use a more cunning way - the index. Such a file can be prepared in advance - run through the entire list of words and write out the line numbers in which the first two letters change. For English, it will be 676 lines of the form

 aa 0 ab 82 ac 1216 ad 2904 ... zz 349895 

(but there will be only 635, 41 combination is not found). The most popular combinations are ca, re, ma, co un (incremental). for Russian - a little more than 1000 (33 * 33 = 1089, but I don’t remember the words for combinations ъъ or ъъ. But you shouldn’t exclude from the index). The search will be much faster - since we immediately find a small area. (I found out that the largest section is un - 11,780 words 115 combinations are more than 1000 words.

Next on optimizations. If you look at the statistics, then words that are shorter than 8 characters are quite a few - about 40 percent. Therefore, it is possible to break words into two groups - those that are inserted into 8 characters and those that are inserted into 16 (here and there is also a group of 32, but I have not found such words :)). And in the index will need to write more and group number.

How to implement it in the code - I will leave for independent work.

  • I would then suggest using a compressed prefix tree. Of course, the words will not be available by index, but there is no such requirement in the original problem. The search will have a logarithmic complexity, the data will be kept quite compact. - a_gura
  • the complexity of it may logarithmic, but not the fact that the search will be faster. But my implementation actually sets itself the task of making a quick load. - KoVadim