Debugging a Python script by calling the ML model from the command line

Question

The task of the pull-up script for the machine learning model shown in the listing below is to take arguments from the command line as input.

The script itself:

import sys import argparse import pandas as pd import math import numpy as np import pickle import re from sklearn.neighbors import KNeighborsRegressor if __name__ == '__main__': columnsList = ['OGRN', 'cases_0_sum', 'cases_1_sum', 'cases_2_sum', 'cases_3_sum', 'cases_4_sum', 'cases_5_sum', 'cases_6_sum', 'cases_7_sum', 'cases_8_sum', 'cases_9_sum', 'blocks_0_count' 'blocks_0_sum', 'Balance_values_12003', 'Balance_values_12004', 'Balance_values_12303', 'Balance_values_12304', 'Balance_values_13103', 'Balance_values_14003', 'Balance_values_15003', 'Balance_values_15203', 'profit_class', 'executions_0_sum', 'executions_1_sum'] df_in = pd.DataFrame([sys.argv[1:]], index=columnsList).T df_in[['num_successful_executions','num_successful_executions_sum']] = df_in['executions_0_sum'].str.split('\s*на сумму\s*', expand=True) df_in.drop(columns='executions_0_sum', inplace=True) df_in[['num_continuing_executions','num_continuing_executions_sum']] = df_in['executions_1_sum'].str.split('\s*на сумму\s*', expand=True) df_in.drop(columns='executions_1_sum', inplace=True) df_in['num_continuing_executions'].fillna('0', inplace=True) df_in['num_successful_executions'].fillna('0', inplace=True) df_in['num_continuing_executions']=pd.to_numeric(df_in['num_continuing_executions'],errors='coerce') df_in['num_successful_executions']=pd.to_numeric(df_in['num_successful_executions'],errors='coerce') df_in['num_continuing_executions'].fillna(0, inplace=True) df_in['num_successful_executions'].fillna(0, inplace=True) for n in ['num_continuing_executions_sum','num_successful_executions_sum']: df_in[n]=df_in[n].str.replace('руб', '') df_in[n]=df_in[n].str.replace('млн', '00000') df_in[n]=df_in[n].str.replace('млрд', '00000000') df_in[n]=df_in[n].str.replace('Нет', '0') df_in[n].fillna('0', inplace=True) df_in[n]=df_in[n].str.replace('.', '') df_in[n]=df_in[n].str.replace(',', '') df_in[n]=df_in[n].str.replace(' ', '') df_in[n]=pd.to_numeric(df_in[n], errors='coerce') for n in df_in.columns: df_in[n]=pd.to_numeric(df_in[n], errors='coerce') df_in['OGRN'] = df_in['OGRN'].map(lambda x: str(x)[3:5]) df_in['OGRN']=pd.to_numeric(df_in['OGRN']) for n in [ 'Balance_values_12003', 'Balance_values_12004', 'Balance_values_12303', 'Balance_values_12304', 'Balance_values_13103', 'Balance_values_14003', 'Balance_values_15003', 'Balance_values_15203']: df_in[n+'_no_data_flag']=np.where(df_in[n]==np.nan,1,0) # load the model from disk filename1 = 'D:\knn.pickle' loaded_model = pickle.load(open(filename1, 'rb')) y2_pred = loaded_model.predict(df_in) wished_sum =float(input()) prob = 95*float(y2_pred)/float(wished_sum) if prob>=95: prob = 95 print("{:.1f} ".format(y2_pred),'\n',wished_sum, '{:.1f} %'.format(prob))

The arguments for entry: 107705 0 0 0 0 0 0 0 0 0 0 0 0 150000 120000 100000 90000 10000 200000 200000 170000 -1 "80 на сумму 939 836 руб." "3 на сумму 252 500 руб." 107705 0 0 0 0 0 0 0 0 0 0 0 0 150000 120000 100000 90000 10000 200000 200000 170000 -1 "80 на сумму 939 836 руб." "3 на сумму 252 500 руб." which corresponds to the input (24 arguments) and the model.

However, the script gives an error:

 Traceback (most recent call last): File "D:\1\Execution_prediction.py", line 94, in <module> y2_pred = loaded_model.predict(df_in) File "C:\anaconda\lib\site-packages\sklearn\neighbors\regression.py", line 142, in predict X = check_array(X, accept_sparse='csr') File "C:\anaconda\lib\site-packages\sklearn\utils\validation.py", line 453, in check_array _assert_all_finite(array) File "C:\anaconda\lib\site-packages\sklearn\utils\validation.py", line 44, in _assert_all_finite " or a value too large for %r." % X.dtype) ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

What is the meaning of such an error, when a model trained without problems receives exactly what it takes to enter?

Can you post knn.pickle so that you can reproduce the error?
I have your code falling out on df_in[['num_successful_executions','num_successful_executions_sum']] = df_in['executions_0_sum'].str.split('\s*на сумму\s*', expand=True) with a ValueError: Columns must be same length as key error ValueError: Columns must be same length as key

Stepan sokol stepan sokol 482 2 12 · Accepted Answer · 2018-12-14T14:26:34

The key error was here df_in = pd.DataFrame([sys.argv[1:]], index=columnsList).T Correct code: df_in = pd.DataFrame(sys.argv[1:], index=columnsList).T input arguments without additional brackets. From here all formatting has gone. The remaining bugs are caught by classic line-by-line uncommenting.

, and I was just going to look ... It's great that you yourself found the problem)

Debugging a Python script by calling the ML model from the command line

1 answer 1

More articles: