On C # there is a surprisingly fast dictionary (Dictionary), I would like to know, is there such a performance only in C ++? I tried unordered_map , hash_map , map , but the performance is several times lower than that of the Sisharpov dictionary ... PS: The pair is <unsigned int, unsigned int>

  • > but the performance is several times lower than that of the Sisharpovsky Dictionary. It seems to me that you are confusing something. - falstaf
  • 17
    Code and results of measurements in the studio. Otherwise, stuffing. - Dith
  • one
    @Dith: since the default allocation in C ++ is slower than in .NET (heap lock and tacoe), the results of the TS do not seem to me so suspicious. But yes, I would like numbers. - VladD
  • 2
    @VladD typical artificial test that does not prove anything. In the example for sisharp memory is not actually allocated. It is worth saying thanks for the links, they will be useful here. - Dith
  • 2
    @Dith, could you provide an example of a more correct code? - VDIGIT

4 answers 4

In fact, language comparison is ungrateful. There will always be tests in which one of the languages ​​will win compared to another, and there will always be people who believe that this test is not relevant and such code will never occur in reality.

However, I would not say that the results of the TS are very unexpected: in .NET, memory allocation usually takes place faster than in native languages ​​without a custom allocator. And small tests usually load the allocator much more than, say, function call mechanisms.

The reason for such a difference in the performance of the allocator is that it is impossible to move C ++ objects in memory, which means that the usual memory allocation algorithm (which, as is well known, maintains a list of free blocks and looks for a suitable allocation) runs rather slowly, and, worse, requires a global memory lock (which further worsens the situation in a multithreaded scenario). In addition, objects in C ++ tend to be released as quickly as possible, which leads to an additional load on the release of memory, which also requires a global lock.

In the .NET environment, everything happens differently. Objects always stand out at the top of the heap memory, which means that the selection is no slower than InterlockedIncrement . .NET does not need to maintain a list of free blocks because heap memory is compacted during garbage collection: objects move, filling in "holes".

In addition, the news that C ++ code may well be slower than similar C # code is not new. Here, for example, is a wonderful series of articles about simple applications from masters of native programming, and Jeff Atwood's resume :

To get around the performance of the C # version, Raymond had to write his own I / O procedures, rewrite the string class, use a custom allocator, as well as his own code page display procedure.

This is confirmed by the benchmark, which is shown below: native containers out of the box essentially lose to dotnet, (some) self-written containers win.

Now the most interesting: measurement.

C #:

 using System; using System.Collections.Generic; using System.Diagnostics; namespace Sharp { class Program { static void Main(string[] args) { var dict = new Dictionary<int, int>(); int seed = 1; var timer = new Stopwatch(); timer.Start(); for (int i = 0; i < 10000000; i++) { seed = 1664525 * seed + 1013904223; dict.Add(seed, i); } timer.Stop(); Console.WriteLine( "elapsed time = {0} ms, dictionary entries count = {1}", timer.ElapsedMilliseconds, dict.Count); } } } 

C ++:

 #include "stdafx.h" #include <ctime> #include <map> #include <iostream> using namespace std; int main(int argc, char* argv[]) { map<int, int> dict; int seed = 1; auto begin = clock(); for (int i = 0; i < 10000000; i++) { seed = 1664525 * seed + 1013904223; dict.insert(make_pair(seed, i)); } auto end = clock(); double elapsedMs = double(end - begin) * 1000.0 / CLOCKS_PER_SEC; cout << "elapsed time = " << elapsedMs << " ms, dictionary entries count = " << dict.size() << endl; return 0; } 

Measurement results (release mode, 5 launches in a row without a debugger):

C #

elapsed time = 1138 ms, dictionary entries count = 10,000,000
elapsed time = 1127 ms, dictionary entries count = 10000000
elapsed time = 1133 ms, dictionary entries count = 10000000
elapsed time = 1134 ms, dictionary entries count = 10,000,000
elapsed time = 1129 ms, dictionary entries count = 10,000,000

C ++

elapsed time = 8377 ms, dictionary entries count = 10,000,000
elapsed time = 8408 ms, dictionary entries count = 10000000
elapsed time = 8377 ms, dictionary entries count = 10,000,000
elapsed time = 8377 ms, dictionary entries count = 10,000,000
elapsed time = 8361 ms, dictionary entries count = 10000000

Average time: C # = 1132 ms, C ++ = 8379 ms.

I do not claim that my tests are perfect. In addition, they are relevant only on my computer. If someone offers the best measurement technique, I will gladly apply it too. However, in my conditions, the System.Collections.Generic.Dictionary to add elements significantly exceeds the performance of std::map .

Notice that Dictionary uses hash tables, while std::map in my implementation uses a red-black tree as the carrier data structure. Hash tables are usually faster on their own, so the allocation speed is not the only reason for a better dictionary speed.

Replacing in C ++ make_pair(seed, i) with pair<int, int>(seed, i) according to @igumnov advice did not lead to a big difference: 8361/8392/8361/8408/8361/8345.

Replacing std::map with std::unordered_map in C ++ on the advice of @ Kitty led to significant acceleration: 2230/2230/2230/2230/2246 (average 2233). However, C ++ is still almost twice as slow.

Replaced in C ++ std::unordered_map with uthash on the advice of @ igumnov . The result is slightly worse than std::unordered_map : 2963/2932/2948/2948/2932. Code:

 void testUhash() { struct myint { int key; int value; UT_hash_handle hh; }; struct myint* dict = NULL; int seed = 1; auto begin = clock(); for (int i = 0; i < 10000000; i++) { seed = 1664525 * seed + 1013904223; struct myint* ps = (struct myint*)malloc(sizeof(struct myint)); ps->key = seed; ps->value = i; HASH_ADD_INT(dict, key, ps); } auto end = clock(); double elapsedMs = double(end - begin) * 1000.0 / CLOCKS_PER_SEC; cout << "elapsed time = " << elapsedMs << " ms, dictionary entries count = " << HASH_COUNT(dict) << endl; } 

Added capacity = 10000000 in C ++ and for fair comparison in C # too. Changes:

C ++:

 unordered_map<int, int> dict(10000000); 

C #:

 var dict = new Dictionary<int, int>(10000000); 

Indeed, it became sooner:

  • C ++: 1826/1856/1857/1841/1825, average 1841
  • C #: 790/786/801/790/791, mean 792

Still C # more than double in front.

On the advice of @KoVadim, I removed the seed calculation (capacity left), now the working cycle is as follows:

C ++:

 for (int i = 0; i < 10000000; i++) { //seed = 1664525 * seed + 1013904223; dict.insert(pair<int, int>(i, i)); } 

C #:

 for (int i = 0; i < 10000000; i++) { //seed = 1664525 * seed + 1013904223; dict.Add(i, i); } 


  • C ++: 1498/1514/1498/1498/1498, average 1501
  • C #: 129/129/135/133/132, average 132

According to the advice, @igumnov added khash to the benchmark. Code:

 KHASH_MAP_INIT_INT(32, int) void testKhash() { int seed = 1; khiter_t iter; khash_t(32)* dict = kh_init(32); int dummy; auto begin = clock(); for (int i = 0; i < 10000000; i++) { seed = 1664525 * seed + 1013904223; iter = kh_put(32, dict, seed, &dummy); kh_value(dict, iter) = i; } auto end = clock(); double elapsedMs = double(end - begin) * 1000.0 / CLOCKS_PER_SEC; cout << "elapsed time = " << elapsedMs << " ms, dictionary entries count = " << kh_size(dict) << endl; } 

Result: 577/577/608/577/577, average 583, massive wines for native code. Let me remind you that the best result of a standard .NET container is 792 ms.

Who will offer a custom container under .NET?

I tried implementation for .NET FastDictionary ( MapReduce.NET project). It turned out a little slower than the built-in Dictionary : 853/865/842/841/842, average 849.

I tried the speed of pure allocation to test the @Dith hypothesis: 10 million times the constructor of the empty class is launched. Code:

C #:

 static class Allocation { class Foo { } static public void Test() { const int size = 10000000; var timer = new Stopwatch(); timer.Start(); for (int i = 0; i < size; i++) { new Foo(); } timer.Stop(); Console.WriteLine("elapsed time = {0} ms", timer.ElapsedMilliseconds); } } 

C ++:

 void testAlloc() { const int size = 10000000; LARGE_INTEGER li; if (!QueryPerformanceFrequency(&li)) exit(1); double freq = double(li.QuadPart)/1000.0; QueryPerformanceCounter(&li); auto begin = li.QuadPart; for (int i = 0; i < size; i++) new Foo(); QueryPerformanceCounter(&li); auto end = li.QuadPart; double elapsedMs = double(end - begin) / freq; cout << "elapsed time = " << elapsedMs << " ms" << endl; } 


  • C #: 58/54/28/55/55 (average 50)
  • C ++: 407.719 / 400.693 / 401.674 / 401.926 / 399.976 (average 402.4)
  • one
    And if so try to replace? dict.insert (make_pair (seed, i)); dict.insert (std :: pair <int, int> (seed, i)); - igumnov
  • one
    @VladD, good, visual testbench! +1 Since you started the experiments, and you have already explained the reason with words, for clarity, you could also test the С++ with the allocator, which will allocate 10 million in advance. - mega
  • one
    And I also know one non-standard hash table. Maybe it will be faster. troydhanson.github.com/uthash More tests good and different. - igumnov
  • 2
    Note still std::unordered_map - this will probably be more honest. - Costantino Rupert
  • 3
    Thanks friends. I did not expect such an extended response ... - VDIGIT

Continuing a very interesting discussion:

as @VladD said, the dotnetovsky Dictionary is based on a HashMap, so

The most unsuccessful implementation of the GetHashCode function in the .NET Framework is the default implementation in structures. The fact is that this function for structures does the following. She uses reflection to search through all the fields and tries to get a hash code. Having found the first field from which the hash code can be obtained, the function ends its work by returning this value. In most cases, it returns the hash code of the first field that is found. However, if the structure consists of reference types, and all of them are set to null, then the function, by analogy with the class, returns the sequence number of the object.


This approach has two drawbacks. The first is that such a hash code can and often turns out to be incorrect, since it is difficult to identify the entire structure by one field. Secondly, due to the use of reflection, this method is far from ideal performance . Therefore, if it is necessary to obtain hash values ​​from structures, it is better to implement the corresponding functions independently.

a source

@igumnov Regarding memory measurements, there is a good article " Processing large amounts of data in memory in C # "

  • As for the implementation of GetHashCode holy truth. Reflection is pretty hard. I have, however, the suspicion that there is optimization for the basic types. And for custom structures GetHashCode would be nice, of course, to manually handle it. - VladD
  • Hmm, apparently, in .NET 4, the GetHashCode function has changed. Test: class HashTest {struct Test {public int x; public int y; } public static void Run () {T (new Test () {x = 1, y = 2}); T (new Test () {x = 1, y = 100}); T (new Test () {x = 200, y = 2}); } static void T (Test t) {Console.WriteLine ("x = {0}, y = {1}, hash = {2}, hash ({0}) = {3}, hash ({1}) = {4} ", tx, ty, t.GetHashCode (), txGetHashCode (), tyGetHashCode ()); }} - VladD
  • The result:> x = 1, y = 2, hash = 1745183319, hash (1) = 1, hash (2) = 2> x = 1, y = 100, hash = 1745183281, hash (1) = 1, hash ( 100) = 100> x = 200, y = 2, hash = 1745183390, hash (200) = 200, hash (2) = 2 - VladD
  • However, for those who will independently implement GetHashCode , here is a useful article . - VladD
  • @VladD fun fact: the implementation of GetHashCode changed - but the implementation of Equals remains the same! - Pavel Mayorov

First, recall that with map the first parameter is the type of the key, the second is the type of the value. You add the key seed = 1664525 * seed + 1013904223 in each entry; where seed = 1 is constant. Accordingly, the key value is transmitted separately and the same, which is the worst value to insert.

Secondly, and most importantly. map is effective for finding values ​​by key and is very inefficient when adding or removing elements. Your test does only add. The main costs of such a test are not allocation, but adding to the tree (red and black, as you correctly noted). If you want to compare the dictionnary in dotnet and map in stl, you need to write a test in which values ​​are filled in first, and then measure the access time to random values ​​in a loop. If you need to frequently search for values ​​in the application, use the map, but then we write another test. If you need to add many times and many values ​​- use vector, if you add / delete - list. Then the test (for adding) will be different:

 vector<int> vec; for (int i = 0; i < 1000000; i++) { vec.push_back(i); } 

Your code for 1,000,000 elements is executed in my 8386 ms, an example with a vector is 251 ms., 33 times faster.

  • Are you answering a question or an answer? :) - VladD
  • (Well, try to go through the code in the debugger, you will see that the seed is changing.) - VladD
  • 1. Comments on the question of the illegality of testing. 2. Yes, you are right. Screwed up. seed is changing. An interesting algorithm. - Denis Pantyushev

I will write here, since my comment limit is over

I tried the speed of pure allocation to test the @Dith hypothesis: 10 million times the constructor of the empty class is launched.

This test shows the lack of knowledge of the compiler and the runtime. In C ++, the code will be leaked, but in Sharipov it will not, since the GC will clean everything up. This is the first bell that the tests are not equivalent. Added at least a call to the destructor. (But tests show that this does not greatly speed up the process).

BUT! In the Sharp code, in fact, the amount of allocated memory will not increase - the object will be destroyed, and the new one will be placed in the old memory. Since the object is empty, initialization will be fast. Plus, the memory is already allocated. Formally, the assignment of a pair of pointers + memset ().

But the jit optimizer can see that the object is created, but not used, and the constructor is empty and discarded. And in fact - drive an empty cycle. Although, ideally, then he can throw it away. But it is necessary to look separately.

I am not so strong in C #, but I think that if I start outputting the address of an object that is being created, it will always be the same (or there will be a dozen different addresses that are looped through in the cycle).

That is, in fact, with ++ the code looked somewhere like this (very much simplified code, on the knee):

 Foo * f; // это глобальная переменная. for (int i = 0; i < size; i++) { // это как бы конструктор Foo * t; if (f == NULL) // если нет объекта, создадим t = alloc_memery(); init(t); else { t = f; init(t); // а это часть конструктора. Память то уже выделена, нам только поля поправить. f = NULL; // объект мы забрали } // а это деструктор. f = t; // вернули объект назад. t = NULL; } 

It turned out such a pool of objects on one object.

But you can do better, which will make the tests more correct.

 char * t = new char[1000]; // выделим себе немножко памяти. for (int i = 0; i < 10000000; i++) { new (t) Foo(); // видите странный оператор new?:) } delete[] t; 

(the operator new, which in code is such a special form, is called placement new ). The most interesting thing is that now there is no leakage, and the code works 10 times faster than the original. I consider this code to be the most plausible analogue of a Sharpe code. But on the other hand, you can write more beautiful

 for (int i = 0; i<10000000; i++) Foo f(); // да, здесь будет создаваться объект и вызываться конструктор-деструктор 

I would not be surprised if the Sharpe optimizer also did in the original code of memory allocation on the stack (this is possible, I know for sure that 7 java can do this). And memory allocation on the stack is very cheap in terms of speed.

Someone will say, in C # I took it straight and wrote it quickly, but in C ++ you need to add something else. But in C ++, “hacks” are made easy and not forced, and in C # you can quickly run into obstacles (yes, I know about Unsafe code). But I don’t consider the above code hack. And I most likely will not create millions with the help of new objects in a cycle. And if it is needed, I will preallocate to the required number of objects. And it will be faster.

  • This is an extra answer. I will not even vote. @VladD just made a benchmark to show that the difference is already in the memory allocation model . How this degenerates while in C# no longer has any meaning. The difference was calculated, it filled the gap of the first tests and this is the main thing. - mega
  • I deliberately made memory leak to avoid calling destructors in C ++. (My last comment to the answer is exactly what it says.) I want to measure only 10 million designers + memory allocations on the heap. Objects are not deleted (well, or in C # are deleted by garbage collection). It is easy to verify that objects are created by removing the constructor call in the loop. - VladD 2:49 pm
  • with 10kk objects, momory leak already plays a role In your example with ++, there were 10 million allocations + initializations. in Sharpev - 1-2 allocations, 10 million initializations. @mega is not an answer, it's just a comment :) but I wanted to show that the test is very contrived and does not explain anything in the original question. - KoVadim 2:55 pm
  • one
    If you look closely at my examples, you can see that 1. Optimization as in the examples @VladD does not work, because we put everything in an array. 2. Garbage collection also does not work, and if this is not obvious, you can be sure of comparing it by comparing pastebin.com/FqNYLrXm and pastebin.com/4Nv1kdEh There will be no overhead with the list, the cache will be the same misses in both languages. Since the speed of the radiant sisharp is more important here than common sense and the desire to get to the point, all I have to do is leave the discussion. - Dith
  • one
    > this is not the answer, this is just a comment. For this supposedly not the answer, the maximum that you deserve is a minus, because you are trying to confuse the participants. > and I wanted to show that the test is very contrived and does not explain anything in the original question. The test explains everything, because in the original algorithm it does not matter which operations go below the new operator, they do not affect this algorithm, therefore, for the purity of the experiment, it was necessary to exclude the time it took under similar conditions from both sides, which was made by both parties: @VladD and @Dith. (That his "purity" is doubtful and so clear to everyone - mega