The expediency of using nibble strings with dictionary size N <= 16

Question

The human genome, yes, and in general, genomics. Work with strings of high degeneracy (dictionary size does not exceed 16 members). It is very interesting to try to save memory and make a replacement:

'A' -> 0000 'C' -> 0001 и т.д

. And the lines are in the form of array of bytes.
But then, most likely, you will have to rewrite the search functions, most likely, on assembler. Will the game be worth the candle?

will ... And in order not to rewrite everything - use the brazen hack.
Where there used to be a regular character - type 2 characters (16 bits first and 16 bits of the second).
The search will not even break almost) But without seeing the load profile at all without a profiler and without knowing the task, it is impossible to answer.
But the usual search functions are not very suitable - they can be used, of course, but the substring will have to be formed, and this is already a chore for even a 4-letter alphabet.
Although you have to watch But in any case you have to rewrite the functions of a fuzzy search.
Maybe not quite on the topic but ... You know about prefix trees?
But while I read about these trees, there were references that they are popular in the field of dna processing.
In particular, there are algorithms for compressing these trees.

Accepted Answer · 2017-11-28T14:38:39

The temptation is great :) But a clear win is only visible in order to save memory, where our lines will be stored. I suspect that at the current price of RAM, plus the fact that most of the mat. boards support at least 16 gigabytes of RAM (de facto, the minimum standard for bioinformatics tasks for today), this is not so relevant. Yes, save twice (in severe cases, if suddenly someone mistakenly decided to use WideString - in 4 :)). This is in 2006, when I wrote the first in the world (well, I must brag :)) a full-featured triter for the human genome that worked on a regular PC, saving memory was very relevant. But 11 years have passed.
But the speed is likely to lose. And not on search functions, but on converting string -> array, which will eat up all the winnings. After all, sequences and SAM / VCF / BED, - all use the usual string type. Is that your data will be stored in half the size, and their loading / saving, too, will save time a little.

However, the advantages can still be:

Work on weak machines. Many of these are on the devices, sometimes you need something quickly process
Converting once, and then a lot of work. If the functions are optimized, you see, and there will be a gain.
Work with very big data. At a minimum, the human genome, with cross-scalable research. Or even coniferous genomes.

Search functions have to be rewritten. I can give as an example a few of my own, who used it. Written for 64-bit applications, but minor ones can be rewritten for 32-bit systems.

Moreover, I used the following pseudo-string representation:

 type ByteArray = array of byte; TBA = record oddity:integer; ar:ByteArray; end;

Why is that? oddity is needed in order to specify whether it is an even number in a string or not. Or it is necessary to sacrifice one member of the alphabet and mark the end of the line with a nibble byte (or byte) - an analogue of Z-lines. Or the first? bytes will set the length of such a string. Both of these options, in my opinion, will lose using the record.

Here, for example, two search functions for three-letter substrings:

 function Search3Values(ba:ByteArray; mask:word):int64; overload; asm push rbx mov bx, dx // дублируем маску rol dx, 4 // и сразу же её сдвигаем mov r9,[rcx-8] // сохраняем количество байтов в байтовом массиве @loop: mov ax, [rcx] // загрузка байтов push ax // сохраняем для ещё одной проверки push bx or ax, $F000 // накладываем маску and bx, ax cmp bx, ax // сравниваем pop bx pop ax // восстанавливаем AX jne @not_found shl r10, 1 // умножаем на 2, указывая адрес в "полубайтах" mov rax, r10 pop rbx ret @not_found: or ax, $000F push dx and dx, ax cmp dx, ax pop dx jne @not_found_ror shl r10, 1 // умножаем на 2, указывая адрес в "полубайтах" inc r10 // и увеличиваем на 1 - был сдвиг mov rax, r10 pop rbx ret @not_found_ror: inc rcx inc r10 cmp r10, r9 jne @loop @not_found_global: mov rax, - 1 // возвращаем "-1": ничего не нашли pop rbx ret end; function Search3Values(ba:ByteArray; mask:word; pos:integer):int64; overload; asm push rbx mov bx, dx // дублируем маску rol dx, 4 // и сразу же её сдвигаем // xor r10,r10 mov r9,[rcx-8] // сохраняем количество байтов в байтовом массиве mov r10, r8 shr r10,1 // делим пополам dec r9 sub r9, r10 // сколько осталось проверять js @not_found_global jz @not_found_global mov r9,[rcx-8] add rcx, r10 // перемещаем указатель dec r9 //мы сравниваем байты со словом с шагом в один байт, таких сравнений будет length - 1, поэтому сразу уменьшаем счётчик на 1 mov ax, [rcx] // загрузка байтов bt r8w,0 // проверка на чётность jc @not_found @loop: mov ax, [rcx] // загрузка байтов push ax // сохраняем для ещё одной проверки push bx or ax, $F000 // накладываем маску and bx, ax cmp bx, ax // сравниваем pop bx pop ax // восстанавливаем AX jne @not_found shl r10, 1 // умножаем на 2, указывая адрес в "полубайтах" mov rax, r10 pop rbx ret @not_found: or ax, $000F push dx and dx, ax cmp dx, ax pop dx jne @not_found_ror shl r10, 1 // умножаем на 2, указывая адрес в "полубайтах" inc r10 // и увеличиваем на 1 - был сдвиг mov rax, r10 pop rbx ret @not_found_ror: inc rcx inc r10 cmp r10, r9 jne @loop @not_found_global: mov rax, - 1 // возвращаем "-1": ничего не нашли pop rbx ret end;

And here - to search for a seven-letter substring:

 function Search7Values(ba:ByteArray; mask:cardinal; pos:integer):int64; overload; asm push rbx xor rax,rax mov ebx, edx // дублируем маску rol edx, 4 // и сразу же её сдвигаем // xor r10,r10 mov r9,[rcx-8] // сохраняем количество байтов в байтовом массиве mov r10, r8 shr r10,1 // делим пополам sub r9, r10 // сколько осталось проверять sub r9, 3 js @not_found_global jz @not_found_global mov r9,[rcx-8] add rcx, r10 // перемещаем указатель sub r9,3 //мы сравниваем байты со словом с шагом в одно слово, таких сравнений будет length - 3, поэтому сразу уменьшаем счётчик на 3 mov eax, [rcx] // загрузка байтов bt r8w,0 // проверка на чётность jc @not_found @loop: mov eax, [rcx] // загрузка байтов push rax // сохраняем для ещё одной проверки push rbx or eax, $F0000000 and ebx, eax cmp eax, ebx pop rbx pop rax // восстанавливаем AX jne @not_found shl r10, 1 // умножаем на 2, указывая адрес в "полубайтах" mov rax, r10 pop rbx ret @not_found: push rax push rdx or eax, $0000000F and edx, eax cmp eax, edx pop rdx pop rax jne @not_found_ror shl r10, 1 // умножаем на 2, указывая адрес в "полубайтах" inc r10 // и увеличиваем на 1 - был сдвиг mov rax, r10 pop rbx ret @not_found_ror: inc rcx inc r10 cmp r10, r9 jl @loop @not_found_global: mov rax, - 1 // возвращаем "-1": ничего не нашли pop rbx ret end;

The principle is trivial: when searching for long substrings, the string, depending on the length, beat on a combination of smaller ones and the corresponding functions were used. For very long substrings (and not very old processors) movdqu / vmovdqu (SSE2 / AVX) bundled with psrlw proved to be good (shift in register by 4 bits): when using different registers, the profiler showed almost double acceleration.
In general, for good, it is necessary to write functions of a fuzzy search for such things. In theory, this is where you can dig out a gold mine :)

Update 1

Strangely enough, everything new is well forgotten old :) Now, optimizing some libraries for Zen cores (new fleet of counting machines - AMD Ryzen), we noticed that the standard data when working with genes, if used in the nibble format , very well "fit" into the kernel cache Zen. Processing speed grows very seriously.

Sometimes I just google :) But usually on the table are branded "bricks" from Intel (once certified, so they still send documentation free of charge :)) and print out AMD references.
All this can be downloaded from the support sites of the manufacturers of CPU.
to Update1: we do not have Zen yet, although now we are considering the purchase of 2 new machines for processing.
For us, they are the best in terms of quantity * performance / price - for our tasks.

The expediency of using nibble strings with dictionary size N <= 16

1 answer 1

More articles: