There are very many lines of a given length that are in the file.

Before you start searching in a file, you can do anything with it (for example, sort it), you can also use additional memory.

After all manipulations with the file in it, you need to find (or not find, depending on the availability) the line as soon as possible.

What are the search options? :)

  • What does "given length" mean? Are all the lines in the file exactly the same length? - AnT
  • one
    Build a hash for each line in the file and you're done. - AnT

3 answers 3

Use ready-made algorithms, such as Aho - Korasik a.

In short, here's an idea. Let there be lines abc, abd dce. It can be seen that the first two lines have the same beginning. Therefore, it makes sense to look only for either a or d. And only when one of these characters finds, check b or c, respectively. If the number of lines is more than the number of letters, then the output is already obvious.

Second idea. If there are many lines, then there is one more “trick”. If the lines are longer than 4 bytes (characters), then it considers for them just the sum of the first 4 bytes (or recall the math and invent your own hash). And then just run through the file and count the sum of the next 4 bytes. since the sum of 4 bytes will not be more than 1024, then check that the current 4 byte sequence is in the list very quickly. The sum itself can always be calculated according to the principle of adding the current byte, subtracting the byte from 4 (or 5, depending on how you count) positions back. In principle, you can play a long or function (well, xor can come up) and get a minimum of false positives. Of course, when the beginning coincided, you need to check the entire line.

  • The Aho-Korasik algorithm is designed to simultaneously search for several lines and requires the preprocessing of the required lines. In the statement of the problem, there is not a word that the required lines will arrive in groups. And preprocessing is only for browsing lines. Therefore, from which side Aho-Korasik is here - I don’t see something point-blank. - AnT
  • You can expand the condition. - KoVadim

Sorry, not enough reputation for comment, so I will answer. If you can do "anything" before the search, then what prevents to make a list of hashes of lines of the file? Then the search will be to calculate the hash of the search string and compare it with the list.

You can sort the rows. Then the search will be on this algorithm:

  1. N = 1, M = 1

  2. Compare the character N of the search line with the character N of the line M. If N = N + 1 are equal, repeat step 2, otherwise step 3.

  3. M = M + 1 (go to the next line), compare the characters 1..N of the required line and the line M. If we are equal, repeat step 2, otherwise there is no required line (we interrupt the search).

Add: So as not to compare the initial characters in each line, after sorting, you can attach to each line a number equal to the length of the same number of characters from the beginning of the line. Then in step 3 we check that N-1 <= K (the length of the same segment with the previous line). If N-1> K, then there is no search string.

Example:

0-abvgd (there is no identical section with the previous one)

3-abway (first three characters match)

0-bbbav (no matches)

0-way (no match)

4-dyet (4 first characters match)

Search: "abvgi" In the second step we get to 5 characters. at the third step, N-1> 3, so there is no search string.

  • one
    and how it will help? Manipulations will be carried out over lines, but not over hashes? - SkiF
  • one
    Not carefully read the question. Lines of the same length are given. I understand correctly, all the lines of the file and the desired line of the same length? Those. we are not looking for a substring, but the entire string? - dr. FIN
  • @ dr-fin yes, right. Looking for a string. - gistrec
  • @ dr-fin Nuuu, somehow did so :) - gistrec

Hash to a hash table, in which, in addition to the hash, to store the position of the beginning of the line in the file. Maybe some other digest like sha256 :)

Then you just calculate the hash. In case of collisions, use this digest, and get the position in the file in the end.