The task itself: An N-gram is a sequence of N words. When solving, we can assume that all words are numbered, and work with a sequence of N integers.

Learning Problem: Suppose there is a sequence X = {x (i)}, where i is the index (i = 1..T). This set is usually referred to as a training data set. It is necessary to calculate and remember the occurrence probabilities for any subsequence of N consecutive values ​​of p (x (i), x (i + 1), .., x (i + N)) = c (x (i), x (i + 1), .., x (i + N)) / T - N, where c (x (i), x (i + 1), .., x (i + N)) is the number of matches of the subsequence x (i ), x (i + 1), .., x (i + N) in X. Calculation of probability for N-grams: For a sequence of numbers y (i), y (i + 1), .., y (i + N) it is necessary to return the probability of its meeting in the sequence X - p (x (i), x (i + 1), .., x (i + N)). With the task of learning everything is more or less clear, but with the calculation of the probability for the N-gram is not quite. What are these numbers y (i), y (i + 1), .., y (i + N)? arbitrary numbers? when the numbers x (i + 1), .., x (i + N) in X are the numbers that must be included in X? And what is X - p (x (i), x (i + 1), .., x (i + N)) ??? When X is a set, and p (x (i), x (i + 1), .., x (i + N)) is a number. And how to calculate for this the probability of occurrence of numbers y (i), y (i + 1), .., y (i + N) ?? Who can, explain in more detail please.

    1 answer 1

    Well, here's an example, maybe it will be clearer

    Suppose you have a sequence of zeros and ones:

    X = 1 0 0 0 1 0 0 0 

    and N = 4. Then we have such 4-subsequences:

     X = 1 0 0 0 1 0 0 0 [1 0 0 0] [0 0 0 1] [0 0 1 0] [0 1 0 0] [1 0 0 0] 

    The probabilities are obtained by the formula given to you:

     p([1 0 0 0]) = 2 / 5 = 0.4 p([0 0 0 1]) = 1 / 5 = 0.2 p([0 0 1 0]) = 1 / 5 = 0.2 p([0 1 0 0]) = 1 / 5 = 0.2 

    Next, we need to calculate the probability for some sequence [y1 y2 y3 y4] . If [y1 y2 y3 y4] is in our list, the probability is already calculated (for example, if [y1 y2 y3 y4] = [0 0 0 1] , the probability is 0.2). If not, the probability will be, apparently, 0.

    • And why, with N = 4, exactly such subsequences? Have you dropped those cases in which the number of occurrences in the sequence is 0? - rekrut
    • @rekrut: I just selected the subsequences of the given sequence X. - VladD
    • Clear. Thanks and on it !!! - rekrut