Method for estimating the probability of n-grams

Question

The task itself: An N-gram is a sequence of N words. When solving, we can assume that all words are numbered, and work with a sequence of N integers.

Learning Problem: Suppose there is a sequence X = {x (i)}, where i is the index (i = 1..T). This set is usually referred to as a training data set. It is necessary to calculate and remember the occurrence probabilities for any subsequence of N consecutive values of p (x (i), x (i + 1), .., x (i + N)) = c (x (i), x (i + 1), .., x (i + N)) / T - N, where c (x (i), x (i + 1), .., x (i + N)) is the number of matches of the subsequence x (i ), x (i + 1), .., x (i + N) in X. Calculation of probability for N-grams: For a sequence of numbers y (i), y (i + 1), .., y (i + N) it is necessary to return the probability of its meeting in the sequence X - p (x (i), x (i + 1), .., x (i + N)). With the task of learning everything is more or less clear, but with the calculation of the probability for the N-gram is not quite. What are these numbers y (i), y (i + 1), .., y (i + N)? arbitrary numbers? when the numbers x (i + 1), .., x (i + N) in X are the numbers that must be included in X? And what is X - p (x (i), x (i + 1), .., x (i + N)) ??? When X is a set, and p (x (i), x (i + 1), .., x (i + N)) is a number. And how to calculate for this the probability of occurrence of numbers y (i), y (i + 1), .., y (i + N) ?? Who can, explain in more detail please.

VladD VladD 183k 16 gold signs 228 silver marks 434 bronze marks · Answer 1 · 2013-03-09T11:41:12

Well, here's an example, maybe it will be clearer

Suppose you have a sequence of zeros and ones:

X = 1 0 0 0 1 0 0 0

and N = 4. Then we have such 4-subsequences:

 X = 1 0 0 0 1 0 0 0 [1 0 0 0] [0 0 0 1] [0 0 1 0] [0 1 0 0] [1 0 0 0]

The probabilities are obtained by the formula given to you:

 p([1 0 0 0]) = 2 / 5 = 0.4 p([0 0 0 1]) = 1 / 5 = 0.2 p([0 0 1 0]) = 1 / 5 = 0.2 p([0 1 0 0]) = 1 / 5 = 0.2

Next, we need to calculate the probability for some sequence [y1 y2 y3 y4] . If [y1 y2 y3 y4] is in our list, the probability is already calculated (for example, if [y1 y2 y3 y4] = [0 0 0 1] , the probability is 0.2). If not, the probability will be, apparently, 0.

Have you dropped those cases in which the number of occurrences in the sequence is 0?
@rekrut: I just selected the subsequences of the given sequence X.

Method for estimating the probability of n-grams

1 answer 1

More articles: