Implement the fastest bitmap

Question

std :: bitset can not be used, this is the condition!

A continuous array of bytes of sufficient volume is set, two functionalities need to be defined: read the Nth bit, set the Nth bit

Conditions:

it is necessary to provide the order of bits (big / litle endian), in some architectures it is switchable, therefore, you know, the question is not simple
you need to provide the number of bits in a byte other than eight
if you decide to format with defaults, rather than inline functions, provide normal "security" for a fellow programmer (such as ... passing as an argument - an argument with a unary operation, the ability to protect against renaming existing program entities)

In order to unify the measurements, I propose to measure the execution time in a uniform C + + code:

#include <chrono> // ... auto start = std::chrono::high_resolution_clock::now(); // код для замеров < ---------------------------- auto stop = std::chrono::high_resolution_clock::now(); std::cout << "Прошло: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop - start).count() << " ms" << std::endl; // ...

Comments are not intended for extended discussion; conversation moved to chat . - Nick Volynkin ♦

VladD VladD 183k sixteen 223 433 · Answer 1 · 2016-10-30T22:05:35

Since the question is about the fastest implementation, then of course it makes sense to try the implementation in the simplest way: selecting int for each bit. Let me remind you that the issue was not economical about the implementation.

I used the @avp code to compare the speed. MSVC 2015, Release mode, x64.

Code:

 #include "stdafx.h" #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #include <limits.h> #include <time.h> #include <iostream> #include <bitset> using namespace std; int main(int ac, char *av[]) { const int n = 1000000; bitset<1000000> bcp; printf("bcp %d\n", (int)sizeof(bcp)); const unsigned long ebits = (sizeof(unsigned long) * CHAR_BIT); const unsigned long sz = (n + sizeof(unsigned long) * CHAR_BIT - 1) / ebits; unsigned long bc[sz], bysz = sizeof(bc); printf("bc sz = %d ebits = %d bysz = %d\n", (int)(sz), (int)ebits, (int)bysz); int* crude_array = new int[n]; printf("crude array size %d\n", n); clock_t s, e; int nn = 0; s = clock(); for (int r = 0; r < 100; r++) { srand(9); bcp.reset(); for (int i = 0; i < n; i++) bcp.set(rand() % n); for (int i = 0; i < n; i++) if (bcp.test(i)) nn++; } e = clock(); cout << "nn = " << nn << " t = " << e - s << '\n'; nn = 0; s = clock(); for (int r = 0; r < 100; r++) { srand(9); memset(bc, 0, bysz); for (int i = 0; i < n; i++) { int bi = rand() % n; bc[bi / ebits] |= (1UL << (bi % ebits)); } for (int i = 0; i < n; i++) { if (bc[i / ebits] & (1UL << (i % ebits))) nn++; } } e = clock(); cout << "nn = " << nn << " t = " << e - s << '\n'; nn = 0; s = clock(); for (int r = 0; r < 100; r++) { srand(9); memset(crude_array, 0, sizeof(crude_array)); for (int i = 0; i < n; i++) { int bi = rand() % n; crude_array[bi] = 1; } for (int i = 0; i < n; i++) { if (crude_array[i]) nn++; } } e = clock(); cout << "nn = " << nn << " t = " << e - s << '\n'; delete[] crude_array; }

Result:

 bcp 125000 bc sz = 31250 ebits = 32 bysz = 125000 crude array size 1000000 nn = 3276800 t = 2158 nn = 3276800 t = 2067 nn = 3276800 t = 1979

Repeated test runs show consistent results.

I note that the fact that the “naive” implementation turned out to be the fastest is not so self-evident. Indeed, she suffers a cache locality, so she could well be slower than others. Therefore, the advice is in any case to profile, profile and profile your application code again on real data. "Effort mind" to guess that it would be rather unreal (except for very very trivial cases).

Answer 2 · 2016-10-30T18:55:42

Just an example of measurements unsigned long vs bitset

 #include <stdio.h> #include <stdlib.h> #include <stdint.h> #include <string.h> #include <limits.h> #include <time.h> #include <iostream> #include <bitset> using namespace std; int main (int ac, char *av[]) { int n = 1000000; bitset<1000000> bcp; printf("bcp %d\n", (int)sizeof(bcp)); unsigned long ebits, sz = (n + sizeof(unsigned long) * CHAR_BIT - 1) / (ebits = (sizeof(unsigned long) * CHAR_BIT)), bc[sz], bysz = sizeof(bc); printf("bc sz = %d ebits = %d bysz = %d\n", (int)(sz), (int)ebits, (int)bysz); clock_t s, e; int nn = 0; s = clock(); for (int r = 0; r < 100; r++) { srandom(9); bcp.reset(); for (int i = 0; i < n; i++) bcp.set(random() % n); for (int i = 0; i < n; i++) if (bcp.test(i)) nn++; } e = clock(); cout << "nn = " << nn << " t = " << e - s << '\n'; nn = 0; s = clock(); for (int r = 0; r < 100; r++) { srandom(9); memset(bc, 0, bysz); for (int i = 0; i < n; i++) { int bi = random() % n; bc[bi / ebits] |= (1UL << (bi % ebits)); } for (int i = 0; i < n; i++) { if (bc[i / ebits] & (1UL << (i % ebits))) nn++; } } e = clock(); cout << "nn = " << nn << " t = " << e - s << '\n'; }

Operation test and set pull out in # #define themselves. Similarly, they need to make another clear (conditional names).

Just in case, here are the macros

 #define B_SET(a, i) ({__typeof__((a)[0]) *_a = (a); __typeof__(i) _i = (i); \ _a[_i / (sizeof(_a[0]) * CHAR_BIT)] |= (1UL << (_i % (sizeof(_a[0]) * CHAR_BIT)));}) #define B_CLR(a, i) ({__typeof__((a)[0]) *_a = (a); __typeof__(i) _i = (i); \ _a[_i / (sizeof(_a[0]) * CHAR_BIT)] &= ~(1UL << (_i % (sizeof(_a[0]) * CHAR_BIT)));}) #define B_TEST(a, i) ({__typeof__((a)[0]) *_a = (a); __typeof__(i) _i = (i); \ _a[_i / (sizeof(_a[0]) * CHAR_BIT)] & (1UL << (_i % (sizeof(_a[0]) * CHAR_BIT)));})

checked in g++.real (Ubuntu 5.4.0-6ubuntu1~16.04.2) 5.4.0

continue the theme =) my output is nn = 63223897 t = 2760 nn = 63223897 t = 2804 I ran on Windows, replaced srandom () with srand and replaced random with int myrand(){ return (rand()<<16) + rand(); }
int myrand(){ return (rand()<<16) + rand(); } In the debug, the beatset loses 30 percent.
@pavel, looked at the output from the ideon, well, it means that bitset is a bit faster in Windows (distribution is different? Or processor?
By the way, vector even lags far behind my bitset (i.e., it works longer).
In 4 launches once a little faster and 3 slightly slower unsigned int bit version (apparently due to 8 times more memory, although I have 6Mb cache)
In general, I was surprised that the bitset with a constant in the template is not always the fastest.

Majestio Majestio 2,917 one five 32 · Answer 3 · 2016-10-30T18:30:21

@PashaPash, ok ... let's continue. I myself offer options!

1) First

 #define GETBIT(Array,Num) ((*(Array+(Num/CHAR_BIT))&(1<<(Num%CHAR_BIT)))!=0)

2) Second

 bool Getbit(void *Array, size_t Num) { const int char_bits = std::numeric_limits<unsigned char>::digits; return (reinterpret_cast<unsigned char*>(Array)[Num / char_bits] & 1 << Num % char_bits) != 0; }

3) Third

 bool Getbit_other(void *Array, size_t Num) { const int char_bits = std::numeric_limits<unsigned char>::digits; div_t tmp = div(Num, char_bits); return (reinterpret_cast<unsigned char*>(Array)[tmp.quot] & 1<<tmp.rem) != 0; }

Criticize. Surely somewhere shoals with the order of bits will be. Offer your options / "patches" ...

Your benchmark code was added to the question, the answer is deleted.

Implement the fastest bitmap

3 answers 3

More articles: