Suppose we have a trace. DataFrame:
In [74]: df Out[74]: Text Category 0 This is cool Positive 1 Lovely story Positive 2 Wow, it is very good! Positive 3 The plot is awful Negative 4 Bad movie Negative 5 Not that bad Neutral 6 Actors good, but plot is labored Neutral
Converting a category to a digital value:
In [75]: df['cat_no'] = pd.Categorical(pd.factorize(df.Category)[0]) In [76]: df Out[76]: Text Category cat_no 0 This is cool Positive 0 1 Lovely story Positive 0 2 Wow, it is very good! Positive 0 3 The plot is awful Negative 1 4 Bad movie Negative 1 5 Not that bad Neutral 2 6 Actors good, but plot is labored Neutral 2 In [77]: df.dtypes Out[77]: Text object Category object cat_no category dtype: object
Now we "tokenize" the text and transform it into a form that is understandable for classifiers:
#import nltk from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english') X = vect.fit_transform(df.Text) r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names()) t = df[['Category','cat_no']].join(r)
Result:
In [82]: t Out[82]: Category cat_no actors awful bad cool good labored lovely movie plot story wow 0 Positive 0 0.000000 0.000000 0.000000 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1 Positive 0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.707107 0.000000 0.000000 0.707107 0.000000 2 Positive 0 0.000000 0.000000 0.000000 0.0 0.638709 0.000000 0.000000 0.000000 0.000000 0.000000 0.769449 3 Negative 1 0.000000 0.769449 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.638709 0.000000 0.000000 4 Negative 1 0.000000 0.000000 0.638709 0.0 0.000000 0.000000 0.000000 0.769449 0.000000 0.000000 0.000000 5 Neutral 2 0.000000 0.000000 1.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6 Neutral 2 0.544082 0.000000 0.000000 0.0 0.451635 0.544082 0.000000 0.000000 0.451635 0.000000 0.000000