Optimization of the algorithm for counting the number of unique values by module

Question

At first glance, the task does not seem to be difficult: to get the number of unique integral values modulo from the source container. The first thing that came to mind was to dump everything in absolute value into a set and see its size:

#include <iostream> #include <vector> #include <set> size_t getUniqueByModuleItemCount(const std::vector<int>& input) { std::set<unsigned> unique; for(int item : input) { unique.insert(abs(item)); } return unique.size(); } int main() { std::vector<int> srcVec = {1, -4, 3, 2, -12, 0, 4, 17, -2, -2, -3, 13, 7}; size_t uniqueItemCount = getUniqueByModuleItemCount(srcVec); std::cout << "Unique items count by absolute value is " << uniqueItemCount << std::endl; }

This works, but, obviously, the complexity of such an algorithm is O(N*lnN) (a walk through all the elements of the original container with an insert into the tree, each insert lnN). Accordingly, the question is - are there any ideas how to optimize the complexity of this algorithm before O(N) ?

I think that even though it is O(N*lnN) , but by the real time of work, the vector sorting + uniq will be faster ...
@Harry thanks, I will look at all different options in the evening on a different set of data, including the original one

Accepted Answer · 2016-10-04T07:08:37

I did a little experiment here ...

Randomly created vector of N numbers with positive and negative values, using different containers:

 size_t usingSet(const std::vector<int>& input) { std::set<unsigned> unique; for(int item : input) { unique.insert(abs(item)); } return unique.size(); } size_t usingHash(const std::vector<int>& input) { std::unordered_set<unsigned> unique; unique.reserve(input.size()); for(int item : input) { unique.insert(abs(item)); } return unique.size(); } size_t usingVec(const std::vector<int>& input) { std::vector<unsigned> uniq; uniq.reserve(input.size()); for(int item : input) { uniq.push_back(abs(item)); } sort(uniq.begin(),uniq.end()); return unique(uniq.begin(),uniq.end()) - uniq.begin(); }

1000 times find out the number of unique elements. Time in microseconds, it is clear that plus or minus ... but for evaluation it is quite suitable. I did not begin to round, I think, it’s not difficult to estimate in my mind in seconds-milliseconds.

  Set Hash Vector ----------------------------------------------------------- N = 10: 790 1256 253 N = 100: 8304 8243 1192 N = 1000: 130070 78354 27573 N = 10000: 1543452 757059 498093 N = 100000: 14545110 4232742 6182957 N = 1000000: 127603400 27039847 61045621

So, the vector starts to lose between 10 and 100 thousand hashes. At small values, the hash loses even set 'y, comparing at the level of 100 elements. Further set hopelessly losing to everyone - as I understand it, not only because of O(N*ln N) , but also because of the large number of memory reallocations (he has no reserve() member, unlike the hash and vector).

As a simpler test of consumed memory, I did not come up with something, but it is clear that here a vector will give everyone a head start :)

In a word, the saddest option is set , it doesn't make sense to use it at all. With small sizes - up to tens of thousands - it is better to take a vector, and then look at what is more important - speed or memory.

All this in Visual C ++ 2015, 32 bit. For Visual C ++ 2010, 32 bit advantages of the vector are seen even brighter :) And the hash here generally loses:

  Set Hash Vector ----------------------------------------------------------- N = 10: 813 1459 223 N = 100: 8688 10795 1448 N = 1000: 135905 102716 29086 N = 10000: 1623786 1068878 519483 N = 100000: 18334735 8822607 6316194 N = 1000000: 179540082 83667899 61899536

Who wants to repeat for other compilers - you are welcome ...

By the way, with the vector you can still optimize - since we only need the number of unique elements - you can not use unique with its shuffle in memory, but just go through and recalculate those elements that have neighboring ones are not the same.

Update . I assumed that the number of matches is small; in the comments corrected me. I still did not make the number of unique elements scanty - the hand does not lie - but I filled the array like this:

 for(int i = 0; i < N; ++i) src.push_back((rand()%(N/5))*(i%2 ? 1 : -1));

Those. There are a lot of matches, but still there are unique elements - O(N)

Results for 2015, a million did not wait:

  Set Hash Vector ----------------------------------------------------------- N = 10: 279 760 229 N = 100: 2693 2785 1185 N = 1000: 53087 24940 26121 N = 10000: 806965 273205 472123 N = 100000: 12350458 3440321 6006227

The border has shifted somewhat down, but the principle remains: set flies, but hits the hash on arrays up to 100; the hash begins to lead somewhere from 1000, and not from 10,000.

In the limit, when all N elements are units, the results are almost paradoxical:

  Set Hash Vector ----------------------------------------------------------- N = 10: 200 895 219 N = 100: 888 1840 493 N = 1000: 7676 10020 2778 N = 10000: 73524 102685 25881 N = 100000: 733058 1167238 259100 N = 1000000: 7396809 11645557 3561249

vector, and only vector! Hash in this case loses even set 'y.

Everything, there is no more time to do experiments; who wants to go on :)

Update2 - In his "Optimized C ++", Günterot wrote for nothing that the unordered_set although it has an effect, is not at all the one that should have been expected, reading as it is being praised :) Quote from the book: The hash table std::unordered_map std::map , but not the order of magnitude that suggests .

Any tree memory costs at least 2 times higher than the array.
In hash tables, memory costs depend on the load factor (I don’t remember how many are there by default, 30% chtoli).
And you can check it out roughly through the task manager or by logging calls to new and new []
but from 2010 ms something is clearly wrong ... Well, it cannot increase by a factor of 10, increase the work time by less than 10 ... It seems that the problems of launching WoW64 ... Your opinion is interesting.
@pavel Here's the vpaste.net/jKv2n code - muTimer just uses QueryPerformanceCounter and QueryPerformanceFrequency .
For 2010, we had to rewrite loops through iterators of the type for(auto item = input.begin(), end = input.end(); item != end; ++item) { unique.insert(abs(*item)); }
for(auto item = input.begin(), end = input.end(); item != end; ++item) { unique.insert(abs(*item)); } and remove the reserve from unordered_set .
@pavel What about why so - well, maybe a smaller amount of memory redistribution?
I am not Günterot (author of "Optimized C ++"), there is no such experience, I will not explain :) Like the Chukchi from the anecdote - what I see, I sing about it ...
And all this, alas, is really expensive operations ... Plus, there may be (but may not be) the effect of worsening cache locality.

Optimization of the algorithm for counting the number of unique values by module

1 answer 1

More articles:

Optimization of the algorithm for counting the number of unique values ​​by module

1 answer 1

More articles:

Optimization of the algorithm for counting the number of unique values by module