On the question of multiplication, square root, import substitution, and Milander

“Entropy, ergodic source, multidimensional message space, bits, multi-sense, Markov process — all these words sound quite impressive, in whatever order they are arranged. If we arrange them in the correct order, they acquire a certain theoretical content. And a real expert can sometimes with their help find a solution to everyday practical tasks. ”

John PEARCE "See no evil"

This post is full of discussions about the fine optimization of performing mathematical operations on MCs with limited resources, as well as subjective evaluations of various aspects of embedded software development.

Those whom this warning is not scared, please under the cat.

Before we proceed to the description of the procedure for extracting the square root of an integer, the operation inverse to squaring and, accordingly, multiplication, let's talk about the latter.

Suppose that we have the ability to multiply an 8-bit number by an 8-bit one, getting a 16-bit result (8 * 8 = 16), how to get an implementation of the operation 16 * 16 = 32 on the basis of this operation. The obvious way is to represent 16 as the sum of two 8, then we get

А(16)*Б(16)=(а1(8)*256+а2(8))*б1(8)*256+б2(8)) =а1*б1*256*256+а1*б2*256+а2*б1*256+а2*б2

If the resulting expression replaces the multiplication by 256 by a left shift of 8 bits, then we get a completely working algorithm. We estimate the time spent on implementation - we will need 4 multiplications of 8 * 8 = 16 and 4 additions of 4 byte numbers 32 + 32 = 32. For AVR type MCs, we get 4 * 2 + 4 * 4 = 24 clocks, but this is for a head-on solution. Let's try to improve the result. The fact that we need not 4, but 3 additions and one assignment simplifies the situation somewhat, since the initial zeroing of the result is not required, but we still did not take it into account, although it was necessary and the total time should be 24 + 4 = 28 cycles. But, if we take into account the presence of the shift in the first three terms (respectively, we have that the youngest (two lower bytes) is zero and there is no sense to add it to the result), then we will have to add not 4 bytes, but three and two, which will reduce execution time for 1 * 2 + 2 = 4 clock cycles and we get 20 clock cycles. Next, we can draw attention to the fact that the first and last addends do not intersect at all, which will replace the zeroing of the upper half of the result with the assignment of the first addend and reduce the execution time by 2 more cycles to 18. Next, using the architecture features, namely the presence of the register transfer command couples, save two more bars and the final result - 16 cycles instead of the original 28 - a trifle, but nice.

Similar optimization methods work for operation 32 * 32 = 32, for which you can shorten the execution time from the expected 4 * 4 * (2 + 4) + 4 = 100 cycles to (3 + 5 + 4 + 3) + (5 + 3 3) + (4 + 3) + 3 = 36 cycles, which is quite good. Well, at the end of the consideration of various multiplication options, we note that 16 * 16 = 16 can be obtained in 3 + 3 + 3 = 9 cycles. Note that all these considerations are valid only on the assumption of the operation 8 * 8 = 16 for 2 cycles, and if it is not on the target MC, the execution time of all other versions of the operation will not become faster.

Let us summarize the time required for the multiplication (8 * 8 = 8 2, 8 * 8 = 16 9, 16 * 16 = 16 16, 16 * 16 = 32 36) and consider now the original problem.

We need to extract the square integer root of 32 bit number H, that is, to find the largest 16 bit number n such that n * n <= H. We all from the middle school course know the method of successive approximation to the square root (n = (N / N '+ n) / 2), but when using it we will have to divide integers, and this is a very time consuming operation.

Therefore, other computational schemes have been developed, one of which is the bitwise approximation method, which in the pseudo-code looks like this:

initial values -> n = 0; b = 0x8000;
perform 16 times -> if ((n + b) * (n + b)> = H) n = n + b; b = b >> 1;

You can immediately estimate the time spent on this option 16 (number of bits of the result) * (2 (cycle organization) +2 (addition) + X (multiplication) +5 (comparison and solution) +2 (result modification) / 2 (on average half times) +2 (bit shift)) = 16 * (12 + X). You ask, why in the formula X instead of the number 16, and it turns out that an ambush awaited us, since we write in C, and not in assembler. The fact is that in the standard library there is no multiplication operation with changing the digit capacity and we cannot use 16 * 16 = 32, but are forced to use 32 * 32 = 32, which leads to X = 36 instead of X = 16 and the final figure is 16 * 48 = 768 cycles to extract the integer value of the square root of a 32-bit number.

Of course, this is much better than Newton's method, but a bit too much, let's see what can be done.
So, it is obvious that most of the time is spent on the calculation of the next multiplication result. Of course, you can rewrite in assembler and use a less expensive version of multiplication, getting 16 * (12 + 16) = 448 cycles, but we will leave this way as a last resort. Consider the process more carefully and see that we do not calculate the random number multiplying by itself, but the multiplication of the previous value with some increase, and the square of the previous value is known. Hence, we can resort to the difference scheme, based on the formula (n + b) * (n + b) = n * n + 2 * n * b + b * b. At first glance, it looks like a mockery: instead of one multiplication, we will need to perform four pieces, and even two additions of long (32-bit) numbers. But let's start to understand: n * n we already have, b * b, taking into account the fact that b = b '/ 2 is easy to get, like b' * b '/ 4, and similarly 2 * n * b = 2 * n * b '/ 2.

The following calculation scheme appears:

initial values -> nn = 0; n = 0; b = 0x8000; bb = b * b;
repeat 16 times -> if (nn + n + bb> = n) {n = n + b; nn = nn + bb + n}; bb >> 2; b> 1;

We estimate the implementation costs 16 * (2 (cycle organization) +12 (assignment and two additions) +5 (comparison and solution) + (2 (addition) +8 (two additions)) / 2 (about half times) +8 (shift to the right by 2) +2 (shift to the right) = 16 * 34 = 544 cycles. Already better than with the wrong multiplication 32 * 32, and we still have reserves.

What are they - pay attention to the most costly operation - the addition and comparison of a total of 17 cycles and remake the main loop of the algorithm:
2. repeat 16 times -> T = Nbbn; if (T> = 0) {H = T; n = n + b);}; bb >> 2; b> 1;
Then the cycle time will be 16 * (2 (organization of the cycle) +12 (calculation of the new difference) +1 (comparison and solution) + ((4 (assignment) +2 (addition)) / 2 (an average half times) +8 +2) = 16 * 28 = 448 cycles, if we take into account the peculiarities of the architecture, you can save another 2 + 2 = 4 * 16 = 64 cycles and fit in less than 400 cycles.

We get even a slightly better result, as when using the correct multiplication of 16 * 16 = 32, but without assembler, "in pure C". However, there is a significant disadvantage - if in the version with multiplication everything is intuitively clear, then the version with a difference scheme without comments gives the impression of a black magic session, you choose. We also note that we exchanged the number of cycles for additional memory for intermediate variables, which usually happens.

The necessary remark is that we did not receive a significant (at times) gain compared to multiplications, since we have a fast implementation of 8 * 8 = 16. If it is absent in MK (and this happens) or not so fast (and this also happens), then the difference scheme becomes multiple times faster, since it uses only standard addition and shift operations, which are guaranteed to be in any MK.

It seemed that it would be better not to succeed, but it turns out there are still reserves for improving the performance of the algorithm. Let's try to use another classic method of acceleration - divide and conquer. What if we first extract the square root from the upper half of the argument, and then refine it? First of all, we show that it is fundamentally possible. Indeed, we represent the argument in the form H = H '<< 16 + H' 'and the result in the form n = n' << 8 + n ''. Since n '' <256, then the square of it will certainly be less than the square of the number n = n '<< 8 + 256 = (n' + 1) << 8. It follows that the highest part of the result does not exceed the square root of the highest part of the argument.

The implementation of this approach to leave to the inquisitive reader.
What a similar approach will give us, because the total number of iterations will remain unchanged - we can spend the first half of iterations with numbers of shorter length, and this leads to a decrease in time costs. This approach can be applied to the multiplication option and the difference option, the total gain will be up to a quarter of the total execution time.

The necessary note is that the applicability of this approach is not at all obvious; when implemented for AVR-type MCs, execution acceleration does take place, but for some architectures, for example, for x86, a slowdown appeared. Apparently, working with non-native data (16 bit) in this architecture is significantly more expensive in time than with native (32 bit) data. I did not conduct a deep research, but the fact took place and should report about it in order to avoid misunderstanding.

But that's not all. Since we have already embarked on the path of separation and rule, then why not go further - extract the root from the bits, step by step, starting with the older ones (starting with the younger ones is counter-productive in our case). The algorithm of the algorithm is the same - we add the next bit of bits to the current result and try to add the next bit to the result, checking whether we have gone beyond the root value. The peculiarity is that we can only check the high bits of the argument, until we reach the younger bits.

When implementing, we use another trick - instead of moving our subtracted numbers to the right, we will move our diminishing argument to the left, the meaning of this does not change, and the speed increases. Increasing due to two factors - 1) we only need to subtract only 16 bit numbers (there is one feature, and it must be taken into account, but we consider the training example, howl) and 2) we will not need to shift the square of the next bit, because it will always equal to one. But for everything in this world you have to pay and we will have a shift of the extended difference (6 byte) to the left, and by 2 bits per clock. We look at the pseudocode

initial values -> n = 0; H1 = 0;
repeat 16 times -> (H1, H) << 2; T = H1-n-1; if (T> 0) {H1 = T; n = n + 2}; n << 1;

and estimate the execution time by getting 16 * (12 (extended shift) +4 (calculation of the race) +1 (decision) +2 (assignment) +1 (increase) +2 (shift)) = 16 * 22 = 352 cycles, perhaps The result is close to perfect. When implementing this option there are small pitfalls, I leave it again to the inquisitive reader (well, he gets the work).

Well, in conclusion of the section that I was inspired to write this post. There is an absolutely wonderful McuCpp library, authored by Anton Chizhov, in which, based on the loki class of authorship, Andriescu is unusually elegant (well, as far as elegance can be applied to C ++ templates), the work with the pins <a " github.com/KonstantinChizhov/ Mcucpp »I have great respect for the mentioned author (for both of them) and recently, due to circumstances that I will say later, I looked at the source code of this library and once again admired.

However, among other files I saw template_utils.h, in which some auxiliary subroutines were implemented, and among them an integer root from a 32-bit number. The fact that it uses the simplest algorithm of sequential approximation with multiplication is not terrible, because this algorithm does not lose much in speed, and in comprehensibility it gives a lot of points ahead and still wins. But I didn’t like the fact that it was implemented a little sloppy (in terms of speed), because "children can see it." Inaccuracy consists in representing a selectable number of 32 bits, because we firmly know that the root of the 32-bit number does not exceed 16 bits, so why should we make zero-byte shifts. And this is exactly the case when the compiler itself would never guess to optimize and should be helped.

Obvious function conversion

 static inline uint32_t sqrt(uint32_t value) { uint16_t result = 0; uint16_t add = 0x8000; for (uint8_t i = 16; i !=0; ++i) { uint32_t rootGuess = result | add; uint32_t guess = rootGuess * rootGuess; if (value >= guess) { result = rootGuess; } add >>= 1; } return result; }

allows us to save 2 clocks on a bit shift and 2 clocks on creating the next multiplier on each cycle, and organizing the cycle in the specified form is 4 more clocks (I know that the compiler can do this optimization for us, but why not help it explicitly ), which is quite good for purely cosmetic code changes that do not affect its clarity at all.

Late note - one comment made me think that it would be more correct

  for (uint_fast8_t i= ...)

Thanks, Oleg, for the hint.

The cherry on the cake is located just below the function of extracting the whole square root of the sign number, which states that √-1 = 65635 = -1. On the other hand, why not, than this result is worse than any other, it’s not an exception call in MK, and the whole square root of a negative number does not exist.

And the conclusion about why I turned to the library of Anton Chizhov. He pushed me to this recent post on the national RTOS for MK called MAKS (MultiAgent Coherent System) - see the epigraph to the post advertised by its creators and which is ported to Milandr produced by MK. Remark - this post is by no means a promotional material and it will soon become clear to readers. From the mcucpp mentioned above, the OS authors used the implementation of the ring buffer (without detracting from the advantages of the library of Anton, I must declare that this part of it is not a reference, and this is still a mild formulation, which I wrote in another post that I can’t post in any way). Since I work closely with MK production, I was interested in the material and I followed the link to the developers site.

Here begins another cry of Yaroslavna.

Last year, when the creation of a national RTOS was first announced, I downloaded a description of a software product from this site, but somehow my hands did not reach the point of studying. By the nature of my business, I have to deal with domestic components (I understand enough ...), so having the appropriate software would be quite good. Remembering how in the last year's release the director of the company told about the millions of rubles spent on development and a large team working on the creation of this software, I decided to see a trial version available for download for free, so I share the results.

For a start, the description for half a year has almost halved in volume (from 115 to 55 pages), and if the disappearance of applications with screenshots describing the process of launching third products from the Program Description is welcome, then the appearance of these materials (to create it was spent, though not very significant, but still time and money) in a document like “Operator's Guide”, I personally am bewildered. Further, in the first sentence of the document, we see a frank deviation from the truth, since the RTOS itself is not intended to “create programs”, for some reason the authors did not allow themselves such statements in the previous version of the document, the influence of the marketing service is felt. It also delivers that if earlier the description was in the / docs folder of the root directory, and it was logical, now it is hidden in / toolchain / macs / docs, okay, as they said in my youth, “everyone is frantic in his own way”, moving on.

I start to look at the description, looking at the source code (it is kindly included in the trial version) and I am surprised to find the absence of any peripheral device drivers adapted to work with this OS. At first I suggested that this is a feature of the trial, then on the forum in the information from the developers I find out that there really are no drivers, but they are working on it. More than half a year (half a year, Carl, generally close to a year) from the moment of OS release for MK, and they work on drivers. Naturally, or as they say, it goes without saying that no third-party products (file system, network stack, USB stack) are out of the question. A funny idea from the authors about the requirements for software development for MK, okay, drove again.

That is, the declared OS, the underlined feature of which is the organization of interaction within the multicontroller system, has no native means of organizing this interaction. What we have in the bottom line - and we have task management, the actual scheduler, the minimum time service and task synchronization tools, and that’s all fun, if not more. Okay, we will look further, even in such a set of components interesting solutions are possible, especially if we consider that on one site (not the manufacturer’s company), following the link, I saw the “expertise” of the source code of this OS. In this document it was said that the software product does not use third-party (imported) components and is distinguished by originality, it is necessary to make sure.

The first remark is that if the original ARM files included in the source package are used to port to a specific Cortex-M0 (1986 BE1T) architecture, this is very similar to the use of third-party (imported) text fragments - personally it seems to me that this is exactly the use, but I guess I do not know everything. Well, secondly, the source code of the sheduler and its associated task management components are really original and have no analogues (at least, they are not known to me), but this is the kind of originality that recalls the phrase of the old shaman from the movie “Evil Spirit of Yambuya” the great hunter: "Ears cut, cooked and eaten - would you have guessed?".

I will try to explain myself - in the design of the OS in general and in the RTOS in particular, one of the most difficult is the question of ensuring access of all processes in the system to a shared resource - processor execution time. Дело в том, что неправильно спроектированная система (и неудачно написанная задача) может заблокировать исполнение всех задач с меньшим приоритетом, что наверняка озадачит программиста. Речь идет не об выполнении запрещенных операций типа управления прерываниями (это тема отдельного разговора и в рамках простых МК решения просто не имеет, хотя авторы рассматриваемой ОС утверждают, что решили эту задачу путем использования MPU), а о непрерывном исполнении без входа в ожидание.

Данная проблема широко известна, можно решать ее по разному, на этот счет имеется обширная литература, как правило, используется набор очередей задач с разным приоритетом и модифицированный шедулер. Это требует определенных затрат для организации очередей с временем доступа О(1) и, например, во FREE-RTOS при попытке задать более 20 возможных уровней приоритета программист получает недоуменный вопрос компилятора, а действительно ли ему нужно столько (и даже, если на этот вопрос имеется утвердительный ответ, без модификации исходного кода столько приоритетов не получить).

Поэтому я был слегка удивлен, обнаружив, что рассматриваемая ОС позволяет иметь количество приоритетов до 60 (и даже больше). Удивление рассеялось, когда я полистал исходники. Вместо отдельных очередей задач с равными приоритетами авторы используют одну очередь (есть еще и вторая очередь заблокированных задач) готовых к исполнению задач, чем достигается экономия памяти (наверное, это было целью такого решения) за счет того, что

операция внесения задачи в очередь становится О(n) и
это делает применение модифицированного шедулера невозможным — по моему, дороговато за 20*(3*4)=240 байт оперативной памяти. Решение необычайно оригинальное, но, с моей точки зрения, это его единственное достоинство.

В общем, я так и не понял, за что авторы собираются брать деньги (но они еще и не решили, делать ли это, судя по форуму) и какие именно решения и особенности позволяют дать программному продукту столь звонкое имя. Особенно учитывая, какой объем софта предоставляют бесплатно многочисленные поставщики МК (конечно же, импортные). При просмотре форума фирмы в попытке найти ответ я увидел отсылки к упомянутому ранее программному продукту mcucpp (авторы МАКС якобы вдохновлялись идеями Чижова — три раза ха), в котором и обнаружил описанный выше мелкий недочет.

Вывод из данного раздела — если и остальные решения по импорто-замещению в области программных средств выполняются аналогичными способами со схожими по качеству результатами, то перспективы построения отечественных встроенных систем мне представляются весьма слабо определенными.

В заключение хотелось бы обратиться к руководству (нет, не разработчика ОС, я даже не хочу упоминать ее рядом с хорошими разработчиками) фирмы-разработчика и изготовителя упомянутого МК — Миландр. Вы делаете неплохие микроконтроллеры (врать не стану, уступающие по параметрам зарубежным аналогам, но не фатально), например, в свое время (в 2013) ВЕ1Т был чуть ли не топовым среди одноклассников, но на дворе 2019 год и за это время его многие догнали и перегнали.

Но, если у выпускаемых фирмой хороших МК нет:

нормальной (лучше бы хорошей, но я реалист) документации (я знаю, что Вы думаете, будто бы она есть, уверяю Вас, Вы заблуждаетесь),
недорогих версий в удобных корпусах (это у Вас есть),
недорогих и удобных в работе (малогабаритных) отладочных плат на базе пункта 2,
набора библиотек работы с периферией типа HAL, CMSIS (кое-что есть),
набора исчерпывающе документированных примеров использования компонент МК,
набора адаптированных и проверенных стеков под стандартные интерфейсы,
адаптированных и проверенных внешних компонент (3rd part), включая ОСРВ,
большого набора готовых примеров реализации стандартных устройств,
выверенных пакетов адаптации под популярные среды программирования,
своей собственной среды разработки, интегрированной с перечисленными компонентами (понимаю, что я замечтался, но может все таки ..) полностью готовой к использованию «из коробки»,
обучающих материалов, базирующихся на вышеперечисленном, включая печатную продукцию и организацию семинаров разного уровня (у Вас МИЭТ под боком, конечно, MIT был бы лучше, но «других писателей у меня для Вас нет»), то сфера возможного применения данных МК несколько сужается, не правда ли (воут?).

Конечно, все это не появится само собой и стоит денег, но мне кажется, что фирма Вашего уровня могла бы себе позволить себе работу 5 человек в течении полугода для создания всего выше перечисленного (за исключением пункта, возможно, пункта 10, хотя сейчас все значимые и многие небольшие производители МК обзавелись собственной IDE). Если это будет реализовано, то исчезнет почва для появления поделок типа ОС, описываемой в данном посте.

Хотя вполне может быть, что я чего то не знаю и на самом деле все не так просто, жаль, если это действительно так.

Заранее приношу свои извинения, если мои заметки покажутся излишне резкими, именно Вас (Миландр) я не хотел обидеть.

On the question of multiplication, square root, import substitution, and Milander

More articles: