On the issue of transformations and other operations

Blue Caterpillar: Well, you can’t kill us. We sit, we know: they are waiting for transformation. Why? And nothing! Sit, smoke, wait ...
Alice- doll: What?
Blue Caterpillar: What, what! Transformations. The house is in smoke, the smoke is in the lady, and the lady is in the mother. That's it. Do not interfere, do not jump forward, otherwise you will turn into some kind of butterfly prematurely.

Looking through the code on one of the forums devoted to Arduino, I discovered a fun way to work with a floating-point number (PT). The second common name for numbers in this format is floating point, but the abbreviation that arises in this case (PZ) personally causes quite different associations to me, so we will use this option. The first impression (from the seen code) - what kind of garbage is written here (I must say that the second is the same, although there are nuances, but more on that later), but the question arises - how is it really necessary - the answer to which is given in further text.

Part one - posing the question

We formulate the task - we need to output to the console (turn into a character representation) a floating-point number, without using the print options, intended for this purpose. Why do we want to do this on our own -

The use of the% f format entails connecting a floating-point library and an extended version of the prntf function (or rather, makes it impossible to use its truncated version), which leads to a significant increase in the size of the executable module,
A standard solution requires considerable time (it always works with double precision), which may be unacceptable in this particular situation,
well (last, but not least), it's just interesting.

To begin, consider the option that was proposed in the above material, something like:

for (float Power10=10000.0; Power10>0.1; Power10/=10.0; ) {char c=(int)(Fdata/Power10); Fdata -=Power10*c; };

and agree that the task he is completely solves. Moreover, this is not the worst option, since its speed can be quite acceptable. Now let's take a closer look at this point - we see the division of numbers of PTs, but if we delve deeper into the essence of the issue, it turns out that it is almost as fast as the division of integers of the corresponding capacity. Actually, before evaluating the performance of the algorithm, you should evaluate the performance of various elementary operations, which we will do.

Part two - performance evaluation of elementary operations

The first interesting operation is addition (subtraction, in the sense of time spent, they are equivalent) for integers, and we can assume that it takes a unit of time (tact) with the following proviso — this is true only for “native” data. For example, for the AVR MK series it is an 8-bit word, for MSP430 it is a 16-bit word (and, of course, smaller), for Cortex-M it is a 32-bit word and so on. Then the operation of adding numbers of length N times larger than the native can be estimated as H cycles. There are exceptions, for example, AddW in AVR controllers, but it does not cancel the rule.

The next operation is the multiplication of integers (but not division, it differs in the sense of speed) and for it is not so simple. First of all, multiplication can be implemented in hardware and, for example, in AVR MEGA it requires 2 ticks, and in the improved 51 integers 6 (to multiply the native numbers).

But consider the case when there is no hardware implementation and we will have to implement multiplication as a subroutine. Since multiplying H bit numbers yields a 2N bit product, we can find an estimate of the classical version with shifts as follows: we need H shifts of a factor with 1 tick per shift, H shifts of the second factor of 2H length with 2 ticks per shift, then N decisions and , on average, N / 2 addition of numbers of length 2H, at the end of the organization of a cycle of 2 cycles. Total H + 2H + H + 2H / 2 + 2H = 7N cycles, and the actual execution of arithmetic operations of them takes only H cycles (efficiency wow, although we managed to get around the engine).

That is, to multiply two 8p numbers by 8p MK 56 clocks will be needed, and to multiply 16p numbers there are already 112 (slightly less, but neglect the exact value) clocks, which is slightly more than desired. Fortunately, the direction of the shifts can be modified and there is, moreover, the only way to carry out multiplication, which will require only H shifts of a number of 2H digits and H / 2 additions of native numbers, which improves the operation time of the multiplication algorithm to 0 + 2 + 1 + 1/2 + 2 = 5.5N - of course, it’s impossible to compare with the hardware implementation, but at least some gain without loss of functionality. There are improvements to this algorithm, for example, analysis of 2 bits per cycle, but they do not fundamentally assess the situation - the time of multiplication is orders of magnitude greater than the time of addition.

But with the division, the situation is worse - even the hardware-implemented division loses the multiplication by almost two times, besides there are MCs with hardware multiplication, but without hardware division. Under certain conditions, it is possible to replace division by multiplication by the reciprocal, but these conditions are specific and give a similar result - two iterations of multiplication with the subsequent sum are required, so the loss is 2 times. If we implement division as a subroutine, then we need H divider shifts of 2H length, H subtractions from the dividend of 2H length, H result shifts, 2H cycle organization, but all this is preceded by alignment, which will take another 5H ticks, so the total figure will be 2 + 2 + 1 + 2 + 5 = 12H, which is about 2 times worse than multiplication.

Now let's consider the operations of the PT, and here the situation is somewhat paradoxical - the multiplication operation takes almost as much time as for integers (the corresponding bit depth, as a rule, 24 bits), since we have to multiply the mantissas and just add up the orders, the normalization is not required. The division is also good, we divide the mantissas and subtract the orders, normalization is again not needed. Therefore, for these two operations, the loss compared with the integers is not very significant, although it does take place.

But the operation of addition and subtraction requires, first of all, alignment of orders (and these are shifts and there may be many, although there are nuances), then the actual operation, and (when subtracting) normalization (if added, too, but no more than 1 shift is required ), which is wasteful in time, therefore, operations of this class are significantly slower for PT than for integers, especially in relative terms.

Let us return to our sheep and agree that, based on earlier estimates, the proposed method may not be too long, especially since it immediately gives a result, but it has a significant limitation - it is applicable to a very limited range of input values of PT. Therefore, it will look for a universal (more or less) solution.

Immediately make a reservation that our solution should not use floating point operations at all (from the word at all) to emphasize the merits of our option. And the puzzling question is where from when there will be a number of this type, if operations are not available, we answer - it may well appear, for example, when reading information from the light sensor (as it was in the original example), which gives the data in PT format.

How exactly the number of PTs are arranged, you can easily find on numerous sites, there was an article on Habré recently, this should not cause problems. Nevertheless, a number of questions are of interest to the PT format in the style of “if I were the director” - why this is so and not otherwise. I will give my answers to some of them, if anyone knows the more correct ones, I ask in the comments.

The first question is why the mantissa is stored in the direct code, and not in the additional code? My answer is because it is easier to work with a normalized mantissa with a hidden (optional) bit.

The second question is why the order is stored with offset, and not in any other way? My answer is because in this case it is easy to compare the modules of two PTs as integers, with other methods it is more difficult.

The third question is why a negative sign is encoded by a unit, and not by zero, because then it would be possible to simply compare two PTs as integers? My answer is - I don’t know, just “it’s so accepted here.”

Part Three - Necessary Explanations

In the previous paragraph, I could give incomprehensible terms, so a little about the representation of numbers. Of course, they are different, otherwise there would be no need to discuss them. Immediately, we note that the memory of the MC (the same is true about computers, although I’m not so categorical about the most modern architectures - they are so complex that you can expect everything) there are no numbers, there are only elementary storage units - bits, grouped in bytes and further into words. When we talk about the representation of a number, we mean that we somehow interpret a set of bits of a specific length, that is, we define a law by which we can find a certain number corresponding to a given set of bits, and nothing more.

There are countless such laws, but some of them will have a number of useful properties in terms of various operations, so they will be more often used in practice. One of these properties, which is implicitly implied, for example, is determinism, and the other - independence from the environment - properties, at first glance, are obvious, although there are nuances. Other properties of the type of one-to-one correspondence are already a subject of discussion and do not always have a place in a particular representation. In itself, the topic of the representation of numbers is extraordinarily fascinating, with Knut (in volume two) it is fully revealed, so that there is a depth behind it, and we will run over the surface.

Assuming that a set of bits has a length n (we number them in a row from 0 to n-1) and weighted evenly in increments of 2 and the least significant bit (number 0) has a weight of 1 (which, generally speaking, is not necessary at all, we just they are used to such things, and they seem obvious to us) we get a binary representation of a number, in which the reduction formula looks like this: a number displayed by a set of bits (Ч2) = В(0)*2^0 + В(1)*2^1 + ... + В(н-1)*2^(н-1) or in cascade form Ч2(н) = В(0)+2*(В(1)+2*(...+2*(В(н-1))..))) , hereinafter B (k) denotes the bit with the number k. Note that under This representation does not impose any restrictions on the location of the bytes of the number in memory, but it would be more logical to place the low byte in the lower addresses (that's how easy and easy I resolved the “eternal dispute between Slavs among themselves” regarding which end it is more convenient to break an egg).

With this interpretation of a set of bits of length n (= 8), we get a representation for numbers from 0 to (2 ^ n) -1 (= 255) (hereinafter, in brackets there will be a specific value for a set of 8 bits), which has a number of remarkable and useful properties, why it is widespread. Unfortunately, it also has a number of drawbacks, one of which is that we cannot, in principle, present negative numbers in such a record.

You can offer a variety of solutions to this problem (representation of negative numbers), among which there are those that have practical significance, they are listed below.

The representation with offset is described by the formula Ч = 22 (n) - the offset (C), where 22 is the number obtained in binary recording with n bits, and C is some preselected value. Then we represent numbers from 0-С to 2 ^ (n) -1-С and, if we choose C = 2 ^ (n-1) -1 (= 127) (this is not at all necessary, but very convenient), then get the range from 0- (2 ^ (n-1) -1) (= - 127) to 2 ^ (n-1) (= 128). The main advantage of this view is monotony (moreover, increase) throughout the interval, there are also drawbacks, among which we highlight asymmetry (there are others related to the complexity of performing operations on the number in this view), but the developers of the IEEE 457 standard (this is the standard for PT) turned this disadvantage into dignity (using the extra value for the nan coding of the situation), which once again underlines the correctness of the class statement: “If you are higher than the enemy, then this is your advantage. If the enemy is higher than you, then this is also your advantage. ”

Note that since the total number of possible combinations of any number of bits is even (if you do not have combinations forbidden for religious reasons), the symmetry between positive and negative representable numbers is fundamentally unattainable (or rather, achievable, but under certain additional conditions, more on) .

Representation in the form of a direct code when one of the bits (most significant) represents the coded sign of the number H = (-1) ^ B (n-1) * P2 (n-1) has a range from 0- (2 ^ (n-1) -1) (= -127) to 2 ^ (n-1) -1 (= 127). It is interesting to note that I have just stated that symmetry is impossible in principle, and here it clearly is - the maximum representable positive number is equal to the modulus of the minimum representable negative number. This result is achieved by having two representations for zero (00 ... 00 and 10 ... 00), which is usually considered the main disadvantage of this method. This is really a disadvantage, but not as scary as is commonly believed, since there are more significant ones that have limited its use.

Representation in the form of a reverse code, when in the direct representation we invert all bits of the value for negative numbers H = (1-B (n-1)) * Ch2 (n-1) + B (n-1) * (2 ^ (n -1) -CH2 (n-1)) - this is from the definition, you can make a much more understandable formula H = P2 (n-1) -B (n-1) * (2 ^ (n-1) -1), which allows to represent numbers from 0-2 ^ (n-1) +1 (= - 127) to 2 ^ (n-1) -1 (= 127). It can be seen that this representation is with displacement, but the displacement changes in steps, which makes this representation not monotonous. Again we have two zeros, which is not very scary, the occurrence of a circular transfer when adding is much worse, which creates certain problems in the implementation of the ALU.

It is extremely simple to eliminate the last drawback of the previous view, it is enough to change the offset by one, then we get H = P2 (n-1) -B (n-1) * 2 ^ (n-1) and we can represent numbers from 0-2 ^ ( n-1) (= - 128) to 2 ^ (n-1) -1 (= 127). It is not difficult to see that the representation is asymmetric, but zero is unique. The following property is much more interesting, “it is absolutely clear that” ring transfer for an operation of the type of addition does not arise, which was the reason (along with other pleasant features) of the general distribution of this particular method of coding negative numbers.

Create a table of interesting values for different ways of coding numbers, denoting by H the value 2 ^ (n-1) (128)

Bits	00..00	01..11	10..00	11..11
H (n)	0	H-1 (127)	H (128)	2 * Н-1 (255)
H (n-1)	0	H-1 (127)	0	H-1 (127)
Offset H	-N + 1 (-127)	0	one	H (128)
Straight	0	H-1 (127)	0	-N + 1 (-127)
Back	0	H-1 (127)	-N + 1 (-127)	0
Additional	0	H-1 (127)	-N (-128)	-one

Well, in the end of the topic, we present the graphs for the listed representations, from which their merits and demerits are immediately visible (of course, not all that bring to mind the interesting saying “The advantage of graphical presentation of information is in clarity, and it has no other advantages”).

Part Four - the actual solution to the original problem (better late than never).

Small retreat

For a start, I wanted to print the PT in hexadecimal format (and ultimately I did it), but quite expectedly / completely unexpectedly (I needed to substitute) came across the following result. What, in your opinion, will be printed as a result of the execution of statements:

 printf("%f %x", 1.0,1.0); printf("%f %x",2.0,2.0); printf("%x %d",1.0,1.0); printf("%x %d",2.0,2.0);

, also pay attention to the following construction and its result:

 printf("%x %x %f",1.0,1.0);

I will not give explanations to this phenomenon, “smart enough”.

However, how can we correctly print a hexadecimal representation of the PT? The first solution is obvious - union, but the second one is for fans of one-liners printf ("% x", * ((int *) (& f))); (I apologize if anyone was offended by the extra brackets, but I could never, and was not going to, remember the priorities of operations, especially if one considers that the brackets of the code do not generate, so I will continue this way further). And here it is, the solution of the task - we see a string of characters, 0x45678, which unambiguously determine the required number for us, but in such a form that we (I don’t know how you are, I’m for sure) cannot say anything intelligible about this number. I think that Academician Karnal, who could have indicated an error in the punched tape with the source code, would have coped with this task, but not everyone is so advanced, so we will continue.

We will try to get information in a more understandable form.

To do this, we return to the PT format (hereinafter, I consider only float), which is a set of bits from which three sets of bits can be extracted (according to certain rules) to represent three numbers — the sign (h), the mantissa (m) and the order (n), and the desired number encoded by these numbers will be determined by the following formula: CH * CHM * CHP. Here, the characters H denote the numbers represented by the corresponding set of bits, therefore, in order to find the desired number, we need to know the laws by which we extract these three sets from the original set of bits, as well as the type of coding for each of them.

In solving this problem, we turn to the IEEE standard and find out that the sign is one (most significant) bit of the original set and the formula for encoding is Fz = (- 1) ^ B (0). The order occupies the next most significant 8 bits, written in a code with an offset of 127, and is a power of two, then PE = 2 ^ (22 (8) -127). The mantissa occupies the following order of 23 digits and is a number Fm = 1 + P2 (23) / 2 ^ 23.

Now we have all the necessary data and we can easily solve the task - to create a string with characters, which, with a certain reading, will represent a number equal to that encoded in the PT. To do this, we should by simple operations extract the above numbers, and then print them out, providing them with the necessary attributes. We assume that we can convert an integer with no more than 32 bits into a character string, then it is not difficult.

Unfortunately, we are just at the beginning, as very few of the readers of this post in the record "+ 1.625 * 2 ^ 3" find out the unlucky number that is encoded by the more familiar decimal "13", and only guess in the record "1.953125 * 2 ^ 9 "simple" 1E3 "or" 1 * 10 ^ 3 "or just the usual" 1000 ", only a few people are capable of anything, I definitely don’t belong to them. It was strange how it happened, because we completed the original task, which once again shows how to be attentive to the wording. And the point is not that decimal is better or worse than binary (in this case we mean two at the base of the degree), but the fact that we are used to decimal from childhood and it’s much harder to remake people than the program, so we’ll bring our write to more familiar.

From the point of view of mathematics, we have a simple operation - there is a record PT = (- 1) ^ c * m * 2 ^ n, and we need to convert it to the form PT = (-1) s' * m '* 10 ^ n'. We equate, transform, and obtain (one of the possible variants) the solutions s '= s', m '= m, n' = n * lg (2). If we leave out the need to multiply by an obviously irrational number (this can be done if the number is rationalized, but we will talk about this later), then the problem looks solved until we see the answer, because if the record is "+1.953125 * 2 ^ 9 "seems to us incomprehensible, then the entry" + 1.953125 * 10 ^ 2.70927 "is even less acceptable, although it seemed that there was no place worse.

Продолжаем улучшать решение и находим следующее решение — уравнения приведения к степени по основанию 10 м'=м * 10^{п * lg(2)}, п'= [п * lg(2)], где фигурными и квадратными скобками обозначены дробная и целая часть некоего числа соответственно. Тогда для рассматриваемого примера получим (1.953125*10^0.7 0927)*10^2=«10*10^2», что уже значительно более приемлемо, хоть и не идеально, но уже можно реализовывать.

Дело за малым, нам надо научиться:

умножать целое число (п) на заранее известное иррациональное (lg(2)) (это несложно при определенных ограничениях по точности результата);
брать целую и дробную часть числа с фиксированной точкой (это несложно);
возводить известное целое (10) в иррациональную степень (мда...);
умножать целое на произвольное иррациональное («мы упростим вычисления, говорили они ...»).

Тем не менее, попробуем двигаться в этом направлении и рассмотрим то, что сделать несложно, а именно пункт 1. Сразу отметим, что задача эта принципиально нерешаемая и мы не можем вычислить н * lg(2), от слова «совсем» не можем, за исключением тривиального случая н=0 (ну и очевидного случая н=к/lg(10)). Интересное заявление, особенно после утверждения «это несложно», но видимое противоречие снимается фразой «с определенной точностью». То есть мы все таки можем вычислить произведение произвольного целого на известное иррациональное и это несложно для результата с определенной точностью. Например, если нас интересует результат с точностью до одного процента, то представив искомый результат п' = п * lg(2) в виде п * [lg(2) *256 + 1/2] / 256 мы получим значение с нужной нам точностью, поскольку возможная относительная погрешность не может превзойти 1/2/77 = 1/144, что явно лучше, чем требуемая 1/100. Следует принять во внимание одно важное соображение — малая величина относительного отклонения ровно ничего не говорит о поведении функции при применении к ней нелинейного преобразования, а операция взятия целой части очевидно нелинейна. Приведем простой пример [4.501]=5, а [4.499]=4 и, несмотря на то, что относительное отклонение в исходных данных составит 0.002/4.5=0.04%, отклонение результата составит целых 1/4=25%. К сожалению, в общем виде задача не решается вообще, при применении любого алгоритма округления. Можно решить только частный случай, когда входные данные ограничены и, более того, принимают фиксированный набор значений, тогда подбором начального смещения и угла наклона можно получить абсолютно точную, в смысле округления, аппроксимацию.

Для нашего случая такой идеальной аппроксимацией будет функция п'=п*77/256

Прежде чем продолжить проектирование алгоритма, мы должны оценить необходимую нам точность. Поскольку мантисса 24 разрядная, то представимое ей число имеет относительную погрешность 2^-24=2^-4*2^-20=16^-1*(2^10)^-2~(10)^-1*(10^3)^-2=10^-7, что означает 7 точных десятичных цифр. Умножение двух 24 битовых чисел будет достаточно для удержания точности в этом диапазоне (ну почти достаточно). Сразу заметим, что переход к 32 битовым числам (оба сомножителя) уменьшает относительную ошибку более, чем в 100 (256) раз, этот факт нам дальше пригодится.

Самая неприятная в смысле точности формула вычисляет новую мантиссу и выглядит так

м' = м * 10^{п * lg(2)}

Почему она самая неприятная — 1) она содержит цепочку вычислений относительно п и погрешность будет накапливаться, 2) в ней есть крайне плохая с точки зрения точности операция, и это не взятие дробной части, поскольку, если делать ее одновременно со взятием целой части, все не так уж и плохо, а экспонента. Если остальные операции — это умножения и относительные погрешности при этом просто складываются, являются прогнозируемыми и зависят исключительно от длины разрядной сетки представления операндов, то для экспоненты все крайне плохо и совершенно очевидно, что относительная погрешность будет весьма велика при больших значениях аргумента.

«Ну да, совершенно очевидно, что»

q(10^x) = Δ(10^x)/10^x = (10^(x +Δx) — 10^x)/10^x = 10^Δx -1 = 10^(x*qx)-1,
10^(x*qx) >~ 10^(x*0) + (10^(x*0))'*qx = 1 + x*ln(10)*10^(0)*qx = 1+x*ln(10)*qx,

отсюда получаем

q (10^{x}) = x * l n (10) * q x .

$q(10^x) = x*ln(10)*qx.$

Что означает приведенное выражение — то, что на краю диапазона значений, при п=127, относительная погрешность вырастет в 292 раза и для того, чтобы удержать точность результата в требуемом пределе, нам надо точность аргумента существенно поднять.

Вспомним, что переход от 24 к 32 бит дает нам требуемое увеличение точности (не совсем дает, но весьма близко), отсюда понимаем, что первое умножение (п*lg(2)) следует проводить с 32 разрядными операндми, то есть с такой точностью следует выразить логарифм двойки, тогда он будет равен 1'292'914'005/2^32. Отметим, что в коде числитель этой константы следует записываться, как (int)((lg(2)*float(2^32))+0.5), но ни в коем случае не как загадочное 0х4d104d42, пусть даже и с комментарием относительно его вычисления, поскольку хорошо написанный код самодокументирован.

Далее нам потребуется целая часть результата, это несложно, поскольку мы точно знаем положение десятичной точки в обоих сомножителях и, соответственно, в результате.
А вот далее нам надлежит вычислить 10 в степени от 0 до 1 и здесь для получения требуемой точности прибегнем к небольшому трюку. Поскольку, согласно формуле для погрешности, точность на правом краю диапазона падает более, чем двое, то если мы представим значение аргумента как сумму логарифма десятичного двух и некоторого остатка, п''=lg(2)*i+(п''-lg(2)*i), то первый член суммы приведет к умножению на 2 в соответствующей степени, что легко осуществить с нулевой погрешностью (пока не наступит переполнение), а оставшийся будет ограничен значением lg(2) и мы не потеряем в точности при вычислении 10^п'' (при условии правильного вычитания).

Тем не менее, экспоненциальную функцию для ограниченного значением lg(2) аргумента все равно придется вычислять и единственный путь, который я вижу, это разложение в ряд Тэйлора. К сожалению, он не слишком быстро сходится на нашем диапазоне, и, например, для достижения точности в 10Е-7 нам потребуется 9 членов суммы, что приводит к необходимости осуществить 1+9*2=19 умножений 32-разрядных целых чисел, что несколько превышает желательные показатели. Также все еще остаются смутные сомнения относительно нашей способности провести вычисление п'=п*lg(2) с такой точностью, чтобы ее хватило при максимальном значении п.

Тем не менее агоритм получился вполне работоспособный и нам требуется для получения результата только 32-разрядное умножение в количестве 1+19+1=21 операций что и определяет вычислительную сложность алгорима

Можно ли уменьшить вычислительную сложность нашего преобразования — вроде бы нет, мы все аккуратно посчитали — но внезапно оказывается, что все таки можно. Несколько неожиданное заявление, ключ к пониманию такой возможности лежит в природе порядка ПТ — он принимает фиксированный (и относительно небольшой) набор значений, а мы с Вами при выводе формул преобразования это не учитывали, а неявно работали с непрерывной величиной.

Простейшее решение — меняем память на время — заранее вычислить для всех возможных (2^8=256) показателей степени п соответствующие значения [п'] (наиболее подходящего показателя степени 10) и {п'} (коректирующего множителя для мантиссы), занести их в таблицу и далее просто использовать в процессе вычислений. Формула получается совсем несложной — ПТ=м*2^п=м*10^п'*(2^п/10^п')=(м*(2^п/10^п'))*10^п'.

В простейшем случае нам потребуется 256*3 (корректирующий множитель из 24 разрядов, больше не нужно) + 256*1 (порядок по основанию 10 гарантированно меньше порядка по основанию 2) = 1кбайт констант. При этом нам остается сделать всего лишь одно умножение 24*24 разряда (скорее всего это будет 32*32), что существенно ускоряет работу по сравнению с вариантом настоящего вычисления.

Посмотрим, что можно сделать с точки зрения экономии памяти (при этом нам опять придется платить временем, так что ищем разумный компромисс). Прежде всего, если мы будем учитывать знак порядка отдельно, то можно обойтись только половиной требуемой памяти (из 256 байт для порядка 10) и проинвертировать результат в случае необходимости. К сожалению, с корректирующим множителем так просто не получится, поскольку

2^-п/10^-п' = 1/(2^п/10^п') != 2^п/10^п',

it's a pity. Мы либо должны оставить длинную таблицу, либо для отрицательных показателей проводить деление на константу для положительных показателей. Конечно, деление — это не 18 умножений, но все равно по скорости оно двум умножениям точно эквивалентно, так что время точно возрастет в два раза, чтобы сэкономить память в два раза, до 512 байт. Стоит ли оно того — вопрос непростой, но, к счастью, у нас есть значительно более красивый способ, который позволяет избавиться от страданий выбора.

Способ этот в общем случае называется кусочно-линейная аппроксимация и заключается в задании констант не для каждой точки исходного значения, а лишь для некоторых и вычисления отсутствующих значений (с требуемой точностью) с использованием заданых значений по несложной формуле. Применительно к нашей задаче получается (не учитывая знак)

ПТ=м*2^п=м*2^(п0+п1)=м*10^п'*(2^(п0+п1)/10^п')=м*(2^п0/10^п')*2^п1*10^п',

где п0-некоторое опорное значение, а п1=п-п0. Тогда новая мантисса вычисляется путем умножение двух чисел с фиксированной точкой с последующими сдвигом результата, который точности не ухудшает.

Тогда возникает законный вопрос — а зачем нам вообще таблица, ведь можно в качестве п0 взять минимальный показатель и обойтись всего одним значением корректирующего множителя? К сожалению, такой подход контрпродуктивен в силу двух взаимодополняющих обстоятельств — необходимости получить наиболее подходящий показатель степени 10 и появлением очень длинных сдвигов при подобном подходе. Последнее обстоятельство наводит на мысль о границах применимости подобного метода — если мы проводим умножение 32*32, а исходная мантисса имеет 24 разряда, то сдвиг на 8 разрядов не приведет к переполнению и нам потребуется одна опорная точка на 8 значений двоичного порядка. Тогда общий объем требуемой памяти составит 256/8*4=32*4=128 байт — хорошая экономия памяти ценой времени выполнения из за необходимости сдвига целого результата произведения на максимум 8 разрядов.

Можно еще немного сократить объем констант за счет симметричности показателя степени относительно 0, о чем я говорил ранее, но экономия составит 32/2=16 байт, не уверен, что это оправдает усложнение (и увеличение длины кода) собственно программы.

Кстати, недавно смотрел код широко известной в узких кругах библиотеки adafruit и был слегка удивлен следующим фрагментом кода

 const UINT8 Bits[] = {0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80}; ... data = data | Bits[n];

с комментарием, что операция 1 << n на AVR исполняется долго. В своем посте я уже показывал, какие чудеса творит компилятор при константной параметре, но это не тот случай.

Мне показалось сомнительным, чтобы взятие битовой маски из массива было быстрее, нежели выполнение непосредственно операций сдвига, и последующий анализ кода (с помощью сайта godbolt, хотя и крайне маловероятно, чтобы его создатель читал Хабр, тем не менее в очередной раз приношу ему свою искреннюю благодарность) показал, что это действительно так.

Сгенерированный компилятором код для обоих вариантов (вот правильный вариант со сдвигами с учетом особенности задачи, ведь нам нужен только 1 бит)

  ldi r18,lo8(4) sbrs r25,1 ldi r18,lo8(1) sbrc r25,0 lsl r18 sbrs r25,2 swap r18

занял ровно одинаковое место в памяти, а если сделать все аккуратно на ассемблере, то вариант с индексом вырывается вперед 8:7 за счет лишних 8 байт программы (разумеется, если мы не воспринимаем всерьез воистину восхитительное решение с отдельным хранением инвертированной маски, которое обойдется в 16 байт — и ЭТО используют повсеместно — «я знал, что будет плохо, но не знал, что так скоро»). Ну да упомянутый пакет — это вообще отдельная песня, как нельзя лучше описываемый следующей цитатой из одной замечательной книги: «Этот замок просился на обложку труда по фортификации с подписью „Как не надо строить замки или найди 12 ошибок“ (»Последний кольценосец", если кто не читал, рекомендую).

Вернемся к нашим баранам с плавающей точкой и построим результирующую формулу

ПТ = м * 2^п=(м * пк[п/8]) * 2^(п%8) * 10^пп[п/8],

где квадратные скобки означают взятие элемента массивов пк-коррекция показателя и пп-порядок показателя. Сразу видна вычислительная сложность алгоритма, которая определяется умножением 32*32(24*24) и последующими сдвигами. Дальше можно учесть возможность объединения в одном 32 разрядном слове показателя степени 10 и корректирующего множителя, это оставим на долю пытливого (и терпеливого, ведь он дочитал до конца) читателя данного поста.

Единственное замечание напоследок — когда будем создавать таблицу констант, ни в коем случае нельзя это делать в следующем стиле

const uint32_t Data[32] PROGMEM = { 0xF82345,… }

и дело, конечно, не в атрибутах описания массива, а в самих данных в виде магических чисел. Как справедливо отмечали авторы, точно не глупее меня, хорошо написанный код самодокументирован и, если мы напишем вышеприведенную константу (и остальные) в виде

 #define POWROUD(pow) ((uint8_t)((pow & 0x07)*log(2)+0.5)) #define MULT(pow) (2^pow / 10^POWROUND(pow)) #define MULTRAW(pow) (uint32_t((MULT(pow) << 24) +0.5)) #define BYTEMASK 0xFF #define POWDATA(pow) ((POWROUND(pow) & BYTEMASK)| (MULTRAW(pow) & (~BYTEMASK))) const uint32_t Data[(BYTEMASK/8)+1] = { POWDATA(0x00),POWDATA(0x08), ..POWDATA(0xF8)}

то нам никто не пришлет недоуменных вопросов, а если кто и пришлет, мы можем точно на них не отвечать, все равно бесполезно.

Можно предложить модификацию данного метода, в которой подходящая степень десяти будет вычисляться не для правого края сегмента, а для левого и тогда результат будет сдвигаться не вправо, чтобы учесть степень двойки, а влево. С точки зрения математики способы абсолютно равнозначные. Посмотрим на полученный результат:

1.953125*2^9=1.953125*2^(8+1)=1.953125*42949673/256/256/256(2.56)*2*10^2=10*10^2

тут уже очень легко узнать 1000. Конечно, надо бы еще преобразовать полученные мантиссу и порядок в строки, аккуратно выполнить округление, подогнать результат к требуемому формату, добавить знак, учесть особые случаи и так далее, но это все уже не так интересно, главную часть преобразования мы выполнили.

Source: https://habr.com/ru/post/439578/

On the issue of transformations and other operations

Part one - posing the question

Part two - performance evaluation of elementary operations

Part Three - Necessary Explanations

Part Four - the actual solution to the original problem (better late than never).Small retreat

More articles:

Part Four - the actual solution to the original problem (better late than never).

Small retreat