after reading CSV with pd.read_csv (), the last column contains only NaN

Question

When launching the code, an error occurs: 'ValueError: couldn’t convert string to float:' h '' In d and f and h, these are labels with the names of object classes. (Dataset fragment and code are attached) Initially, the classifier was trained on a database with Fisher irises and Everything worked fine with them, although there are lines in the column with classes too. Tell me how to fix, I will be grateful

import sklearn import pandas as pd from sklearn import datasets from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split import numpy as np import time from sklearn.neural_network import MLPClassifier df = pd.read_csv('C:\\Users\\Ilyas\\Documents\\StrngStuff\\dft.csv', index_col = 0) X = df.loc[:, '1f':'2f'] #Характеристики y = df.loc[:, 'Pr'] #Метки X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= .5) def print_accuracy(f): print("Accuracy = {0}%".format(100*np.sum(f(X_test) == y_test)/len(y_test))) time.sleep(0.5) nn1 = MLPClassifier(activation='relu', solver='lbfgs', alpha=1e-1, hidden_layer_sizes=(5, 2), random_state=0) nn1.fit(X_train, y_train) print_accuracy(nn1.predict)

Fragment of dataset, in other parts of f replaced by h

 ,1f,2f,Pr 6.78E-09,0.000000029,"f" 1.71885E-07,7.36621E-07,"f" 1.1053E-06,4.74247E-06,"f" 1.09928E-05,0.000047261,"f" 2.45313E-05,0.000105561,"f" 4.79299E-05,0.000206426,"f" 0.000139912,0.000603569,"f" 0.000217298,0.000938154,"f" 0.000321944,0.001391043,"f" 0.00045879,0.001983876,"f" 0.000848963,0.003676757,"f" 0.001112029,0.004819806,"f" 0.001426622,0.006188141,"f" 0.002227449,0.009676944,"f"

Everything works for me ... It seems that in a real program you have a label (target) that gets into X_train ...
Dataset, sort of like too, but at 200 values, 100 at f and 100 at h
@MaxU, I will attach to the question itself how the data is located after the separation.
I cannot understand why everything happens this way, but you are right and X_train for some reason contains labels, but I cannot understand how they get there, because y_train does not contain them in turn (the first column is duplicated in it).
As I understand it, the problem is in dataset, because in it, for some reason, the names of the columns are shifted.
@MaxU, if I understood correctly, is the comma in the beginning extra?
When creating a dataset similar to the one on which it was originally tested, indexing was not taken into account.

MaxU MaxU 52.3k 6 18 51 · Accepted Answer · 2018-04-17T21:34:13

It looks like you have an incorrect CSV file header - if the CSV file header starts with a comma and you use df = pd.read_csv(fname, index_col=0) , then the values from the first column will be perceived as index values, although, judging by the data, the index values in the CSV file are missing.

Try to remove the first comma in the header row (first row) in the CSV manually and do not use the index_col=0 parameter:

 df = pd.read_csv('C:\\Users\\Ilyas\\Documents\\StrngStuff\\dft.csv')

after reading CSV with pd.read_csv (), the last column contains only NaN

1 answer 1

More articles: