Good day! I am interested in how, having several sets of texts (suppose, in the txt format), use them as a training set for the python text classification program, using the scikit-learn library. In all the examples that I found, they offer to download training samples - and there is no example of creating and using one of my own.

  • What classifier are we talking about? Need to save string information? Those. Do you want to know which words appear in which lines? - MaxU
  • @maxu, I do not yet have an exact idea of ​​how it will be better, so I would be happy with any option, as long as the work does not stand in place. - Andrew Gorshenin
  • I understand correctly that you have a set of text files and a category for each of them (for training the model)? - MaxU
  • @maxu I have a set of text files, as well as individual samples of documents - Andrew Gorshenin
  • To train a model, you need input data (text files in your case) and result (category). In what form do you have the result (category)? - MaxU

1 answer 1

I would in this case try the following:

  1. compile "dictionary" (dictionary): {'ful_path_file_1':'category', ...}
  2. try to trim comments and constants (for example, all literal strings). The task is not trivial. in different languages ​​- different ways of commenting (single and multiline).
  3. "tokenize" the text of the source code that has remained after clearing comments, in order to get a list of words / commands
  4. select the most popular commands for each file and add them to the general list of lists - i.e. As a result, you should have a list, each element of which will be a list of selected words / commands of the corresponding source file: [['#include ...', 'printf(...)', ...], ['import ...', 'print(...)', ...]]
  5. "feed" the resulting list TfidfVectorizer

In general, something like that ...

UPDATE:

Example for the first item:

 In [5]: import re In [6]: files = ['O:/categories/cpp/article_2.txt', 'O:/categories/python/article_1.txt'] In [7]: input_files = {f:re.search(r'\/categories\/([^\/]*)/', f).group(1) for f in files} In [8]: print(input_files) {'O:/categories/python/article_1.txt': 'python', 'O:/categories/cpp/article_2.txt': 'cpp'} 
  • I'm sorry, can I contact you somehow? I would be very happy if I had the opportunity to ask you sometimes questions about python and sklearn in particular - Andrew Gorshenin
  • @AndrewGorshenin, I'm sorry, I do not give out my contacts on the network. What's wrong with StackOverflow? If they do not answer your question in the Russian-language version, then ask in English - there is a chance to get a qualitative answer much more (simply because there the number of answering questions is orders of magnitude greater) - MaxU
  • Well, thanks for answering so quickly. Then now I ’ll create a question - Andrew Gorshenin