Good day! I am interested in how, having several sets of texts (suppose, in the txt format), use them as a training set for the python text classification program, using the scikit-learn library. In all the examples that I found, they offer to download training samples - and there is no example of creating and using one of my own.
- What classifier are we talking about? Need to save string information? Those. Do you want to know which words appear in which lines? - MaxU
- @maxu, I do not yet have an exact idea of how it will be better, so I would be happy with any option, as long as the work does not stand in place. - Andrew Gorshenin
- I understand correctly that you have a set of text files and a category for each of them (for training the model)? - MaxU
- @maxu I have a set of text files, as well as individual samples of documents - Andrew Gorshenin
- To train a model, you need input data (text files in your case) and result (category). In what form do you have the result (category)? - MaxU
|
1 answer
I would in this case try the following:
- compile "dictionary" (dictionary):
{'ful_path_file_1':'category', ...} - try to trim comments and constants (for example, all literal strings). The task is not trivial. in different languages - different ways of commenting (single and multiline).
- "tokenize" the text of the source code that has remained after clearing comments, in order to get a list of words / commands
- select the most popular commands for each file and add them to the general list of lists - i.e. As a result, you should have a list, each element of which will be a list of selected words / commands of the corresponding source file:
[['#include ...', 'printf(...)', ...], ['import ...', 'print(...)', ...]] - "feed" the resulting list TfidfVectorizer
In general, something like that ...
UPDATE:
Example for the first item:
In [5]: import re In [6]: files = ['O:/categories/cpp/article_2.txt', 'O:/categories/python/article_1.txt'] In [7]: input_files = {f:re.search(r'\/categories\/([^\/]*)/', f).group(1) for f in files} In [8]: print(input_files) {'O:/categories/python/article_1.txt': 'python', 'O:/categories/cpp/article_2.txt': 'cpp'} - I'm sorry, can I contact you somehow? I would be very happy if I had the opportunity to ask you sometimes questions about python and sklearn in particular - Andrew Gorshenin
- @AndrewGorshenin, I'm sorry, I do not give out my contacts on the network. What's wrong with StackOverflow? If they do not answer your question in the Russian-language version, then ask in English - there is a chance to get a qualitative answer much more (simply because there the number of answering questions is orders of magnitude greater) - MaxU
- Well, thanks for answering so quickly. Then now I ’ll create a question - Andrew Gorshenin
|