I made a neural network (text categorizer by category), but when training it can produce accuracy from 30% to 90% simply by mood. it can also retrain spontaneously (give probabilities> 1).

the code does not change, I just run it again. the network itself looks like this:

model = Sequential() model.add(Embedding(max_features, 50)) model.add(LSTM(16, dropout=0.2, recurrent_dropout=0.2)) model.add(Dense(4, activation='relu')) model.add(Dense(int(num_classes), activation='sigmoid')) model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) batch_size = 6 epochs = 3 

is trained on a sample of 1200 entries, length 50 words. checked on the same sample (I take records from the database and evaluate everything comparing with real data)

 for i in df.iterrows(): all += 1 tx = clean_text(i[1]['body']) # Преобразуем все описания в числовые последовательности, заменяя слова на числа по словарю. textSequences = tokenizer.texts_to_sequences([tx]) x_train = pad_sequences(textSequences, maxlen=50) z = model.predict(x_train) if z[0][0]>1 or z[0][1]>1: #вероятность больше 1 over += 1 if z[0][0]>0.9: #определена 1 категория x += 1 if i[1]['cat_id'] == 39: #1 категория это 39 succ += 1 if z[0][1]>0.9: #определена 2 категория y += 1 if i[1]['cat_id'] == 15: #2 категория это 15 succ += 1 print("--------------------------------------------------------") print("t1 (39) : {}".format(z[0][0])) print("t2 (15) : {}".format(z[0][1])) print("--------------------------------------------------------") str = "Всего: {}, больше 1: {}, больше 0.9: {}, из них верно: {} ({}%)" print(str.format(all, over, x+y, succ, round((succ/(x+y))*100, 3))) 

This check determines what is correctly determined each time in different ways (from 30 to 90). I read that the network is "not constant", but some sort of large variation is obtained. plus it always gives different results, there are probably over 800 probabilities of 1100 out of 1200, and maybe 50 of the same 1200. I need advice on what can be done better, which I did not take into account.

In addition to this, I am still very interested in how to properly train the network. when she determined obviously not right, I tried this way:

 tx = [clean_text('Товар не доставлен')] #clean_text - очистка от лишних символов, стоп-слов и прочего #загружаем наш словарик для преобразования with open('tokenizer.pickle', 'rb') as handle: tokenizer = pickle.load(handle) #вносим в словарь наше слово tokenizer.fit_on_texts(tx) #сохраняем словарик обратно в файл with open('tokenizer.pickle', 'wb') as handle: pickle.dump(tokenizer, handle, protocol=pickle.HIGHEST_PROTOCOL) #подготовили описание, заменили слова на числа по словарю. textSequences = tokenizer.texts_to_sequences(texts) x_train = pad_sequences(textSequences, maxlen=50) model = load_model('my_model.h5') #подготовили верный ответ для обучения y_train = keras.utils.to_categorical([0], 2) history = model.fit(x_train, y_train, batch_size=1, epochs=50) model.save('my_model.h5') 

that is, in this case, the category name corresponds to the text, but the network "completes its education" by completely changing its output, that is, it will react to any other text as if it belongs to the category to which I just "finished" it, even if the input line is random character set.

50 epochs set because at each epoch it changes the result by ~ 0.02%

How to train her correctly so that the results would not be completely related to one category?

    1 answer 1

    activation='sigmoid' and loss='binary_crossentropy' are used in binary classification tasks (i.e. when you have only two output classes, usually 0 or 1 ).

    In tasks of multi-class classification, the following are commonly used: activation='softmax' and loss='categorical_crossentropy'