I try on articles from Habr to deal with text processing in scikit-learn If you take a test sample, everything works fine. But I tried to load my database and all texts are categorized as 'first' . What am I doing wrong. And immediately the next question is whether it is possible to show the probability of the relationship of the text to this class.

 from sklearn.datasets import load_files from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.naive_bayes import MultinomialNB categories = ['first', 'second', 'third'] a = load_files('db', encoding='utf-8', categories=categories) count_vect = CountVectorizer() X_train_counts = count_vect.fit_transform(a.data) tfidf_transformer = TfidfTransformer() X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts) clf = MultinomialNB().fit(X_train_tfidf, a.target) docs_new = ['God is love', 'OpenGL on the GPU is fast'] X_new_counts = count_vect.transform(docs_new) X_new_tfidf = tfidf_transformer.transform(X_new_counts) predicted = clf.predict(X_new_tfidf) for doc, category in zip(docs_new, predicted): print('%r => %s' % (doc, a.target_names[category])) 

    1 answer 1

    Use the predict_proba () method.

    Example:

    Initial data:

     In [19]: X = np.random.randint(5, size=(6, 100)) In [20]: y = np.array([1, 2, 3, 4, 5, 6]) In [21]: clf = MultinomialNB() 

    we train model:

     In [22]: clf.fit(X, y) Out[22]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True) 

    predict class:

     In [23]: clf.predict(X[2:3]) Out[23]: array([3]) 

    all classes:

     In [24]: clf.classes_ Out[24]: array([1, 2, 3, 4, 5, 6]) 

    predict probabilities for all classes:

     In [25]: clf.predict_proba(X[2:3]) Out[25]: array([[ 4.69205412e-31, 9.16479809e-30, 1.00000000e+00, 2.47492746e-28, 2.13947776e-31, 2.04949820e-34]])