Preparing data from a csv file for learning the model in scikit-learn

Question

There is a csv file containing data in two columns. One of them contains the category "Positive", "Negative" or "Neytral". In another column is a set of words "Bread Light Happyness". The data is entirely text. The task is to convert them into a format suitable for training the classifier. I'm not looking for a turnkey solution. Tell me, please, where can I start? Except for reading data from a file. I read them like this:

import pandas as pd df1 = pd.read_csv('~/projects/text_classifier/learm_ck.csv')

MaxU MaxU 52.2k 6 18 50 · Accepted Answer · 2017-02-24T17:51:37

Suppose we have a trace. DataFrame:

 In [74]: df Out[74]: Text Category 0 This is cool Positive 1 Lovely story Positive 2 Wow, it is very good! Positive 3 The plot is awful Negative 4 Bad movie Negative 5 Not that bad Neutral 6 Actors good, but plot is labored Neutral

Converting a category to a digital value:

 In [75]: df['cat_no'] = pd.Categorical(pd.factorize(df.Category)[0]) In [76]: df Out[76]: Text Category cat_no 0 This is cool Positive 0 1 Lovely story Positive 0 2 Wow, it is very good! Positive 0 3 The plot is awful Negative 1 4 Bad movie Negative 1 5 Not that bad Neutral 2 6 Actors good, but plot is labored Neutral 2 In [77]: df.dtypes Out[77]: Text object Category object cat_no category dtype: object

Now we "tokenize" the text and transform it into a form that is understandable for classifiers:

 #import nltk from sklearn.feature_extraction.text import TfidfVectorizer vect = TfidfVectorizer(sublinear_tf=True, max_df=0.5, analyzer='word', stop_words='english') X = vect.fit_transform(df.Text) r = pd.DataFrame(X.toarray(), columns=vect.get_feature_names()) t = df[['Category','cat_no']].join(r)

Result:

 In [82]: t Out[82]: Category cat_no actors awful bad cool good labored lovely movie plot story wow 0 Positive 0 0.000000 0.000000 0.000000 1.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1 Positive 0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.707107 0.000000 0.000000 0.707107 0.000000 2 Positive 0 0.000000 0.000000 0.000000 0.0 0.638709 0.000000 0.000000 0.000000 0.000000 0.000000 0.769449 3 Negative 1 0.000000 0.769449 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.638709 0.000000 0.000000 4 Negative 1 0.000000 0.000000 0.638709 0.0 0.000000 0.000000 0.000000 0.769449 0.000000 0.000000 0.000000 5 Neutral 2 0.000000 0.000000 1.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 6 Neutral 2 0.544082 0.000000 0.000000 0.0 0.451635 0.544082 0.000000 0.000000 0.451635 0.000000 0.000000

And if the data in the file is divided by a separator (how to extract them - is understandable)?
When there are only two columns (categories and text for classification), it becomes more clear.
But if the data for classification is divided, and a two-dimensional array is obtained in the data frame, in which the categories are located in the zero column, how to prepare the data for “feeding” to the classifier?
@GlassedMichail, try opening a new question with a reproducible example of your data

Preparing data from a csv file for learning the model in scikit-learn

1 answer 1

More articles: