Here is the code from the answer to the question asked earlier :

# https://ru.stackoverflow.com/questions/790609 # Corpus download: http://study.mokoron.com/ # positive: https://www.dropbox.com/s/fnpq3z4bcnoktiv/positive.csv?dl=0 # negative: https://www.dropbox.com/s/r6u59ljhhjdg6j0/negative.csv?dl=0 # join them together: type positive.csv negative.csv > pos_neg.csv #cols = 'id tdate tmane ttext ttype trep tfav tstcount tfol tfrien listcount'.split() from pathlib import Path import pandas as pd import numpy as np from sklearn.feature_extraction.text import CountVectorizer from sklearn.model_selection import GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline, make_pipeline from sklearn.externals import joblib def fit_log_regression(X, y, **grid_kwargs): # pipe line: vectorize tweets (one hot encoding), LogisticRegression pipeline = Pipeline([ ("vect", CountVectorizer()), ("LogRegr", LogisticRegression())]) param_grid = dict(vect__min_df=[3, 5], # [2, 3, 5, 10] #vect__ngram_range=[(1,1),(1,2),(1,3),(1,4),(1,5),(2,2),(2,3),(2,4),(2,5)], vect__ngram_range=[(1,1),(2,5)], vect__analyzer=['word', 'char_wb'], LogRegr__C=[10, 100, 500], # [0.1, 1, 10, 100], LogRegr__max_iter=[100, 200]) # optimize hyperparameters, using [param_grid] grid_search = GridSearchCV(pipeline, param_grid=param_grid, **grid_kwargs) grid_search.fit(X, y) return grid_search def fit_multinomial_nb(X, y, **grid_kwargs): # pipe line: vectorize tweets (one hot encoding), MultinomialNB pipeline = Pipeline([ ("vect", CountVectorizer()), ("MultinomNB", MultinomialNB())]) param_grid = dict(vect__min_df=[3, 5], vect__ngram_range=[(1,1),(2,5)], vect__analyzer=['word', 'char_wb'], MultinomNB__alpha=[0.01, 0.05, 0.1, 0.5, 1.0]) # optimize hyperparameters, using [param_grid] grid_search = GridSearchCV(pipeline, param_grid=param_grid, **grid_kwargs) grid_search.fit(X, y) return grid_search def print_grid_results(grid_search): print('Best score {}'.format(grid_search.best_score_)) print('-' * 70) print('Best estimator') print(grid_search.best_estimator_) print('*' * 70) print('Best parameters:') print('*' * 70) print(grid_search.best_params_) print('-' * 70) def main(path): # read data set into DF. Only the following columns: ['id','tdate','ttext','ttype'] df = pd.read_csv(path, sep=';', header=None, names=['id','tdate','ttext','ttype'], usecols=[0,1,3,4]) # Speed up: randomly select 5% of data # comment it out for better prediction performance df = df.sample(frac=0.1) grid_lr = fit_log_regression(df['ttext'], df['ttype'], cv=3, verbose=1, n_jobs=-1) grid_nb = fit_multinomial_nb(df['ttext'], df['ttype'], cv=3, verbose=1, n_jobs=-1) print_grid_results(grid_lr) print_grid_results(grid_nb) # persist trained models joblib.dump(grid_lr, 'grid_search_lr.pkl') joblib.dump(grid_nb, 'grid_search_nb.pkl') features = np.array(grid_lr.best_estimator_.named_steps['vect'].get_feature_names()) coefs = pd.Series(grid_lr.best_estimator_.named_steps['LogRegr'].coef_.ravel(), features) print('top 20 positive features:') print(coefs.nlargest(20)) print('-' * 70) print('top 20 negative features:') print(coefs.nsmallest(20)) print('-' * 70) test = pd.DataFrame({ 'ttext':['Погода сегодня полная фигня, но настроение все равно отличное', 'Ну сходил я на этот фильм. Отзывы были нормальные, а оказалось - отстой!', 'StackOverflow рулит' ] }) test['expected'] = [1, -1, 1] test['pred_lr'] = grid_lr.best_estimator_.predict(test['ttext']) test['pred_nb'] = grid_nb.best_estimator_.predict(test['ttext']) pd.options.display.expand_frame_repr = False print(test) if __name__ == "__main__": main(r'pos_neg.csv.gz') 

Conclusion:

 [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 1.5min [Parallel(n_jobs=-1)]: Done 144 out of 144 | elapsed: 7.1min finished [Parallel(n_jobs=-1)]: Done 34 tasks | elapsed: 37.6s [Parallel(n_jobs=-1)]: Done 120 out of 120 | elapsed: 2.4min finished Fitting 3 folds for each of 48 candidates, totalling 144 fits Fitting 3 folds for each of 40 candidates, totalling 120 fits Best score 0.999030110655557 ---------------------------------------------------------------------- Best estimator Pipeline(memory=None, steps=[('vect', CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=3, ngram_range=(2, 5), preprocessor=None, stop_words=None, st...ty='l2', random_state=None, solver='liblinear', tol=0.0001, verbose=0, warm_start=False))]) ********************************************************************** Best parameters: ********************************************************************** {'LogRegr__C': 10, 'LogRegr__max_iter': 100, 'vect__analyzer': 'char_wb', 'vect__min_df': 3, 'vect__ngram_range': (2, 5)} ---------------------------------------------------------------------- Best score 0.9843935987303267 ---------------------------------------------------------------------- Best estimator Pipeline(memory=None, steps=[('vect', CountVectorizer(analyzer='char_wb', binary=False, decode_error='strict', dtype=<class 'numpy.int64'>, encoding='utf-8', input='content', lowercase=True, max_df=1.0, max_features=None, min_df=5, ngram_range=(2, 5), preprocessor=None, stop_words=None, strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)), ('MultinomNB', MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True))]) ********************************************************************** Best parameters: ********************************************************************** {'MultinomNB__alpha': 0.01, 'vect__analyzer': 'char_wb', 'vect__min_df': 5, 'vect__ngram_range': (2, 5)} ---------------------------------------------------------------------- top 20 positive features: ) 6.615932 :d 2.652038 :d 2.011449 d 1.995172 :* 1.726206 )) 1.631845 :) 1.618852 :* 1.362751 * 1.352714 (((( 1.032752 :d 1.018513 ((((( 0.946603 а) 0.912792 ) 0.857700 о) 0.855670 :d 0.776020 ). 0.743704 ь) 0.743562 я) 0.718062 ;) 0.690497 dtype: float64 ---------------------------------------------------------------------- top 20 negative features: ( -8.277490 (( -2.706023 :( -2.454728 o_o -2.329807 _o -2.046104 o_ -1.830299 :| -1.535142 | -1.450972 :| -1.417402 ( -1.160076 ;( -0.902893 о_о -0.871438 о_ -0.870233 о_о -0.869731 _о -0.861229 _о -0.859388 -/ -0.847609 :-/ -0.847609 :| -0.831238 :| -0.831238 dtype: float64 ---------------------------------------------------------------------- ttext expected pred_lr pred_nb 0 Погода сегодня полная фигня, но настроение все... 1 1 -1 1 Ну сходил я на этот фильм. Отзывы были нормаль... -1 1 -1 2 StackOverflow рулит 1 1 1 

Actually, it is not clear why such values ​​are given in param_grid, it is also not completely clear what these parameters are responsible for (each separately)

Also, I ask knowledgeable people to suggest whether it is normal that various emoticons are the most "influencing"? If not, how to fix it? It occurred to me, except for just manual deletion (in sublime text 2, for example), nothing came to mind

Also, it became necessary to take comments from the database as test data. I do this as follows:

  db = pymysql.connect(host='localhost', user='root', passwd='', database='mom_db', charset='utf8') test = pd.read_sql("SELECT comm FROM comments ", db) test['comm'] = test['comm'].apply(delete_tabs) #test['expected'] = [-1, -1, 1, 1] test['pred_lr'] = grid_lr.best_estimator_.predict(test['comm']) test['pred_nb'] = grid_nb.best_estimator_.predict(test['comm']) pd.options.display.expand_frame_repr = False print(test) 

Where

 def delete_tabs(str): str = str.lstrip() str = str.rstrip() return str 

in order to remove unnecessary tabs, which for some reason were in the class with comments on the site.

I retrained the model on the case in its initial form (without deleting smiles, etc.) ... here is part of the output:

 top 20 positive features: ) 6.175323 :d 2.393173 d 2.010635 :d 1.867269 :* 1.654194 )) 1.575834 :) 1.386353 :* 1.199153 * 1.188369 :d 0.916147 ((((( 0.886899 а) 0.832452 (((( 0.828972 ) 0.718146 ь) 0.711038 :d 0.708241 ). 0.669910 о) 0.594219 е) 0.591869 :) 0.589400 dtype: float64 

embarrassing "((((" and "(((((" as one of the most "positive" ones. Why can this be so? The corpus is truly marked up ...

And further. What do you think, really, or teach the model to recognize sarcasm, irony? Maybe someone has an idea?

    1 answer 1

    I will try to answer some questions:

    1. Such values ​​in param_grid are chosen intuitively and based on the author’s personal experience of the answer. You can choose a much larger grid of parameters, but the running time of the GridSearchCV will increase dramatically.
    2. concerning parameters - it is better to look in documentation
    3. Smileys and Emoji, in my opinion , one of the most significant characteristics for the tonal / emotional characteristics of posts in the social. networks
    4. text can be processed directly in Pandas DataFrame - using regular expressions with replacements
    • emoticons, of course, a sign of good, but here's how to be in a situation when a person writes in the comment "everything is because he is" a bad person "!!))" with emoticons at the end? comments are negative, but the smiley pulls everything into a “+” - lynx
    • Yes, by the way, the site from which I take the data for analysis, unfortunately, uses smiles-pictures. I won't analyze them in any way? - lynx
    • @lynx, what do these pictures look like in your Data set? Like file names or different? - MaxU
    • @MahU, when parsing the site, they are simply deleted, but, in theory, you can not delete it. and on the site, like this: '<img src = "images / smilies / crazy.gif" border = "0" alt = "" title = "Crazy" class = "inlineimg">' - lynx
    • one
      I meant - the model will need to be trained on your file names. You can add entries with file names to the training building - MaxU