I have a task to conduct a linear regression and make a salary forecast for the job description. Here is how I did it:

import numpy as np import pandas as pd Location = r'C:\Users\803008\Desktop\salary-train.csv' df = pd.read_csv(Location) 

Loaded data:

  FullDescription LocationNormalized \ 0 International Sales Manager London ****k ****... London 1 An ideal opportunity for an individual that ha... London 2 Online Content and Brand Manager// Luxury Reta... South East London 3 A great local marketleader is seeking a perman... Dereham 4 Registered Nurse / RGN Nursing Home for Young... Sutton Coldfield 5 Sales and Marketing Assistant will provide adm... Crawley 6 Vacancy Ladieswear fashion Area Manager / Regi... UK ContractTime SalaryNormalized 0 permanent 33000 1 permanent 50000 2 permanent 40000 3 permanent 22500 4 nan 20355 5 nan 22500 6 permanent 32000 

Led them to the bottom font.

 df['FullDescription'].str.lower() train1=df['FullDescription'].str.lower() train2=train1.replace('[^a-zA-Z0-9]', ' ', regex = True) from sklearn.feature_extraction.text import TfidfVectorizer 

I left only those words that occur in at least 5 objects:

 vectorizer = TfidfVectorizer(min_df=0.05) train3 = vectorizer.fit_transform(train2) 

Replaced gaps in the LocationNormalized and ContractTime columns with the string 'nan'

 LocTrain =df['LocationNormalized'].fillna('nan', inplace=True) ContrTime = df['ContractTime'].fillna('nan', inplace=True) 

Then it was necessary to obtain one-hot-coding of the signs LocationNormalized and ContractTime

 from sklearn.feature_extraction import DictVectorizer enc = DictVectorizer() X_train = enc.fit_transform(df[['LocationNormalized', 'ContractTime']].to_dict('records')) 

But I do not know how to combine all the obtained signs into a single matrix "objects-attributes" on the task. You need to use the scipy.sparse.hstack command. How to replace the columns with the date of the set to those that I have already converted (changed the font size, replaced the delimiters with spaces) and grouped everything into one matrix?

    1 answer 1

    You must save all changes to a variable or date set. For example:

     df['FullDescription'] = df['FullDescription'].str.lower() 

    and then use the scipy.sparse.hstack command or enc.fit_transform