The task itself: An N-gram is a sequence of N words. When solving, we can assume that all words are numbered, and work with a sequence of N integers.
Learning Problem: Suppose there is a sequence X = {x (i)}, where i is the index (i = 1..T). This set is usually referred to as a training data set. It is necessary to calculate and remember the occurrence probabilities for any subsequence of N consecutive values ​​of p (x (i), x (i + 1), .., x (i + N)) = c (x (i), x (i + 1), .., x (i + N)) / T - N, where c (x (i), x (i + 1), .., x (i + N)) is the number of matches of the subsequence x (i ), x (i + 1), .., x (i + N) in X. Calculation of probability for N-grams: For a sequence of numbers y (i), y (i + 1), .., y (i + N) it is necessary to return the probability of its meeting in the sequence X - p (x (i), x (i + 1), .., x (i + N)). With the task of learning everything is more or less clear, but with the calculation of the probability for the N-gram is not quite. What are these numbers y (i), y (i + 1), .., y (i + N)? arbitrary numbers? when the numbers x (i + 1), .., x (i + N) in X are the numbers that must be included in X? And what is X - p (x (i), x (i + 1), .., x (i + N)) ??? When X is a set, and p (x (i), x (i + 1), .., x (i + N)) is a number. And how to calculate for this the probability of occurrence of numbers y (i), y (i + 1), .., y (i + N) ?? Who can, explain in more detail please.