Collection of frequencies of bigrams

Question

Good day, dear community. I have a question for you that is not related to programming, but perhaps some of you have already encountered a similar problem. I am writing a program for decrypting a simple replacement cipher using frequency analysis. I decided to decipher the frequencies of the digrams. But instead of using ready-made frequency statistics, I would like to collect my own. That is, I will enter texts in Russian, the program will count the frequency of digrams in them and update the statistics table in the file. Here is the question itself: how can I update the statistics? The first thought was to take the arithmetic mean of the frequency of the digram from the file and from the text, and record it in the statistics. But this is clearly not the case.

PS Forgive me, that the question is off topic, just nowhere else to look.

Accepted Answer · 2016-11-24T12:02:18

What is the frequency in the context of collecting these statistics?

This is the ratio of the number of bigrams in the text with one specific value to the total number of all digrams.

So, we have two texts and a certain digram X.

For the first text, in which there are only N ₁ bigrams and N _1X bigrams X, the frequency is P _1X = N _2X / N ₁ .
For the second text, respectively, P _2X = N _2X / N ₂ .

Now we have the sum of these two texts. Roughly we think that concatenation without bigrams at the junction.

What we want to receive looks like (N _1X + N _2X ) / (N ₁ + N ₂ ): how many times this digram occurs, divided by the total number of digrams in the texts. Obviously, the arithmetic average of P _1X and P _1X will give something completely different. Moreover, it is clear that it is not enough to know one frequency, you need to at least know the total number of digrams in both texts.

You need a weighted average , where the weights are the volumes (in bigrams) of the corresponding texts. Actually, if in the formula above we express quantities in terms of frequencies, it will turn out directly:

(P _1X N ₁ + P _2X N ₂ ) / (N ₁ + N ₂ )

... and it's all nice and cute, if the calculations are absolutely accurate. But if you cannot afford to count the probabilities with arbitrary accuracy, it is better to store the numbers of specific bigrams and their total number directly. Or vice versa, if you want to speed up the processing of numbers at the cost of some error, you can store probabilities in type with limited accuracy.

Everything works, thank you very much for your help. - Alexander Pozdniakov

KoVadim KoVadim 85.4k four 66 126 · Answer 2 · 2016-11-24T11:51:49

In principle, the task is to reduce how to calculate the arithmetic average, not remembering all the measurements.

Once upon a time I had a calculator that could count the average of an array of numbers. I thought for a long time that he remembers all the numbers, but he clearly did not have how much memory. I still do not know exactly how he thought it, but I found three ways

We remember not the average, but the amount and quantity. And the average can always be calculated
we count by the following formula "new average" = ("old average" * "number" + current) / ("number" + 1). where "qty" is the number of numbers for which the average has already been calculated.

But these two methods have a significant drawback. If the numbers for which the average is in a too large range, they will give too much error.

Therefore, a third way to add numbers comes to the rescue, dividing them into two (or three) groups. In one, large numbers are summed, in the second - small ones.

In general, I think for your case just remember the total number of pairs found. When you get statistics on the next text, then just add up the number of pairs for each pair. The amount itself can be safely stored in the usual int, most likely it will be enough.

Collection of frequencies of bigrams

2 answers 2

More articles: