Creating a scikit-learn training sample

Question

Good day! I am interested in how, having several sets of texts (suppose, in the txt format), use them as a training set for the python text classification program, using the scikit-learn library. In all the examples that I found, they offer to download training samples - and there is no example of creating and using one of my own.

@maxu, I do not yet have an exact idea of how it will be better, so I would be happy with any option, as long as the work does not stand in place.
I understand correctly that you have a set of text files and a category for each of them (for training the model)?
@maxu I have a set of text files, as well as individual samples of documents
To train a model, you need input data (text files in your case) and result (category).

Accepted Answer · 2017-01-30T20:13:22

I would in this case try the following:

compile "dictionary" (dictionary): {'ful_path_file_1':'category', ...}
try to trim comments and constants (for example, all literal strings). The task is not trivial. in different languages - different ways of commenting (single and multiline).
"tokenize" the text of the source code that has remained after clearing comments, in order to get a list of words / commands
select the most popular commands for each file and add them to the general list of lists - i.e. As a result, you should have a list, each element of which will be a list of selected words / commands of the corresponding source file: [['#include ...', 'printf(...)', ...], ['import ...', 'print(...)', ...]]
"feed" the resulting list TfidfVectorizer

In general, something like that ...

UPDATE:

Example for the first item:

 In [5]: import re In [6]: files = ['O:/categories/cpp/article_2.txt', 'O:/categories/python/article_1.txt'] In [7]: input_files = {f:re.search(r'\/categories\/([^\/]*)/', f).group(1) for f in files} In [8]: print(input_files) {'O:/categories/python/article_1.txt': 'python', 'O:/categories/cpp/article_2.txt': 'cpp'}

I would be very happy if I had the opportunity to ask you sometimes questions about python and sklearn in particular
@AndrewGorshenin, I'm sorry, I do not give out my contacts on the network.
If they do not answer your question in the Russian-language version, then ask in English - there is a chance to get a qualitative answer much more (simply because there the number of answering questions is orders of magnitude greater)

Creating a scikit-learn training sample

1 answer 1

More articles: