Acceleration of work with strings

Question

There is an array (1-10 elements) consisting of long (up to 10,000 characters) immutable strings. There is also one (the same long) line that is actively changing. Strings from an array are very actively compared to a variable string.

Added inline to functions, added register to the variable line, the program still runs slowly. The array of strings is implemented as string *, the other string is just string.

Tell me please, how can I speed up the execution of the program? Will string.erase work faster with string.substr? Or is it generally better to use classic char strings instead of string strings?

Thanks in advance!

UPD: added some code

the min-str string does not change (just stores the value for sub-str) (the '_' is not inserted into the message)

sub_str = min_str.substr(j, i);

After the sub-str gets the value, the function is checked.

 if (is_inside(sub_str, str, amount))

And this is, in fact, the function itself, which checks whether the substring is included in all the strings of a given array.

 inline bool is_inside(string sub_str, string *str, int n) { int m = 0; for (int i = 0; i < n; i++) { if ( str[i].find(sub_str) != -1 ) { m++ ;} } if (m == n) { return true; } else { return false; } }

I just realized that the program should work faster if I pass the values to the function by reference (or pointer). Any more suggestions?

if inside the inline function there is a cycle, then it automatically becomes normal
Show the code, how you manipulate the string, maybe you can add it in a cycle character-by-character, which will be very slow.
But you can obviously immediately allocate a little space - it will be faster.
Maybe you need to rope or even a coherent list of characters?
Do not try to win in speed with small tips to the compiler, compilers today are not smarter than you and me.
It is necessary to optimize the algorithm and data structures.

Accepted Answer · 2013-02-20T03:19:36

Remarks on the code:
1. Pass the string by reference.
2. Pre-increment is faster post-increment
3. Search not all entries, but the first
4. It is correct to compare with std :: string :: npos, and not with -1

Only these micro-optimizations will not help much, because a naive and inefficient algorithm is optimized with complexity O (k * N * N) where k is the number of array elements, N is the maximum string length.
Replace std :: string :: find with fast ILC or Boyer-Moore and get O (k * N)
Aho-Korasik is not suitable, because solves the inverse problem.

Ps. since only the required string changes, you can think of a specialized algorithm based on the prefix function.

@Dith: <bother mode on> about 2. I would still notice that (1) for POD types, preincrement and postincrement are the same, and (2) a good optimizer (gcc, MSVC, Clang, Intel Compiler, Comeau) will notice that the result is not needed, and will produce the same code. <bore mode off>
@Dith: but +1 for micro-optimizations and analysis of the complexity of the algorithm.
Thank you, I also thought that Aho-Korasik solves a slightly different problem.
And about 4 points - it was necessary, gritting my teeth, to write -1, because npos in g ++ did not compile (in MSVS it was compiled)
@VladD The moment is controversial, so as not to breed holivars, I refer to the most authoritative source: Sutter / Alexandrescu - C ++ Programming Standards, advice 9 and 28.
@Dith, I don’t know what the authority on pre / post increments meant, but as you would expect, the author’s function code is that with i ++, m ++, that c ++ i, ++ m translated g ++ -S is the same.

Community spirit ♦ one · Answer 2 · 2013-02-19T17:39:36

For starters, you can simplify the comparison code to this:

 bool is_inside(const string& sub_str, string str[], int n) { for (int i = 0; i < n; i++) { if ( str[i].find(sub_str) == -1 ) return false; } return true; }

Because if you already found a mismatch, then you can not check.

Then, calculate the substring - unnecessary copying data. Pass an array of characters (by reference, of course), and two indices:

 char* min_str_data = min_str.c_str(); // вычисляется один раз if (is_inside(min_str_data, j, i, str, amount)) // ... bool is_inside( const char* whole_str, int start_idx, int len, string str[], int n) { for (int i = 0; i < n; i++) { if (str[i].find(whole_str, j, i) == -1) return false; } return true; }

So it should be a little sooner.

In addition, it is no longer necessary to write inline for a long time: the meaning of this keyword is not exactly what you think.

The opposite method will actually work much faster. What about strings and charms?
In if (str [i] .find (whole_str, j, i) == -1), -1 must be replaced by string :: npos.
Replacing string with char [] will most likely not result in a win.
Strictly speaking, it, of course, depends on the implementation of string, so I would stupidly implement and measure speed.
npos Taran: TS wrote that he had problems with npos , a -1 works.
My replacement is not for acceleration, but in order to avoid unnecessary copying of data: substr creates a copy of the data, but we do not need it.
And without substr one can't do: std::string::find does not have an overload with std::string , initial index and length.

VadimTukaev VadimTukaev 1,973 eight 14 · Answer 3 · 2014-04-21T09:19:35

Calculate hash for strings. Obviously, for static strings, this must be done once. I recommend the FNV-1a algorithm, even its 32-bit version is very good and very rarely gives collisions. For reinsurance, you can take the 64-bit version, but in general there is a 1024-bit.

 unsigned long long fnv64_hash(const char* s) { unsigned long long h = 14695981039346656037LL; while( *s ) { h ^= (unsigned long long)*s++; h *= 1099511628211LL; } return h; }

VadimTukaev

1,973 eight 14

And how to attach it to the question of the author? Calculate fnv for all possible substrings in an array of immutable strings? - If we are really serious about speeding up the search in the described task, then we probably need to make an index of pairs of characters (in the worst case, 64K pieces per line) addressing the positions of such pairs in the string for each unchanging string (in which substrings are searched for). And instead of .find substrings for the entire line, look at the strncmp of the required substring only at these points. If the distribution of characters in the lines is more or less uniform, then for lines of 10,000 characters (IMHO) everything will just fly ... - avp
I quote the top-starter: There is an array (1-10 elements) consisting of long (up to 10,000 characters) immutable strings. There is also one (the same long) line that is actively changing. Strings from an array are very actively compared to a variable string. Change the string, calculate its hash. If the hash coincided with one of the hashes for an unchanged string, then the changeable string matches it (99.9999999%, but for complete certainty you can do strcmp). Maybe I misunderstood something? Maybe I'm here is one such fool who literally understands what he sees, while others think out, as if playing charades - VadimTukaev
one
@VadimTukaev, answering old questions (from February of the 13th year), it is advisable to read more and already made answers and comments. Usually there is a lot of information that the author initially did not display in the first lines of his question. For example, inline bool is_inside (string sub_str, string * str, int n) {int m = 0; for (int i = 0; i <n; i ++) {if (str [i] .find (sub_str)! = -1) {m ++;} ...., etc., around which, in fact, was the main part of the discussion here. - avp
These are the author's problems, aren't they? I gave the correct answer to the question that he asked. If he has another question, let him ask another question. hashcode.ru/questions/18428/… - VadimTukaev

|

Acceleration of work with strings

3 answers 3

More articles: