Frequency analysis of text c ++

Question

I decided to write a frequency analysis of the text in c++ . The following problems arose: Russian characters are needed on all platforms ( linux , mac os , windows ), so I wanted to work with data that “can do anything”.

Class analisys :

 #define rus_size 33 * 2 #define eng_size 26 * 2 #define rus_space_size 33 * 2 + 1 #define eng_space_size 26 * 2 + 1 class analysis { private: std::wstring rus_alphabet[rus_size]; std::wstring eng_alphabet[eng_size]; std::wstring rus_alphabet_space[rus_space_size]; std::wstring eng_alphabet_space[eng_space_size]; int _indexOfSymbol(std::wstring const &, std::wstring[], int); int _indexOfSymbol(wchar_t, std::wstring[], int); int _indexOfMaxElement(int*, int); public: analysis(); void showAlphabet(); void showAlphabet(std::wstring const &, bool); std::wstring analisys(std::wstring const &, std::wstring const &, bool); };

Alphabets code (Russian and English), class constructor:

 this->rus_alphabet[0] = L"А"; this->rus_alphabet[1] = L"Б"; this->rus_alphabet[2] = L"В"; this->rus_alphabet[3] = L"Г"; this->rus_alphabet[4] = L"Д"; this->rus_alphabet[5] = L"Е"; и т.д.

The same story for the English alphabet. Then the main steps:

 std::wstring analysis::analisys(std::wstring const & text, std::wstring const & lang_of_text, bool space) { std::wstring result = L""; if(lang_of_text == L"rus"){ if(space){ int count_of_symbols[rus_space_size]; } else{ int count_of_symbols[rus_size]; // Новый массив for (int k = 0; k < rus_size; ++k) { count_of_symbols[k] = 0; } // Считаем сколько раз повторяется каждый символ в тексте for (int i = 0; i < text.length(); ++i) { if(iswalpha(rus_alphabet[_indexOfSymbol(std::to_wstring(text[i]), rus_alphabet, rus_size)][0])) { count_of_symbols[_indexOfSymbol(std::to_wstring(text[i]), rus_alphabet, rus_size)] += 1; } } // По убыванию даем информацию о кол-ве найденных символов for (int j = 0; j < rus_size; ++j) { int max = _indexOfMaxElement(count_of_symbols, rus_size); result = result + rus_alphabet[max] + L" – " + std::to_wstring(count_of_symbols[max]) + L"\n"; count_of_symbols[max] = -1; } return result; } } return NULL; }

_IndexOfSymbol functions (two overloads):

 int analysis::_indexOfSymbol(wchar_t symbol, std::wstring alphabet[], int size) { for (int i = 0; i < size; ++i) { if(std::to_wstring(symbol) == alphabet[i]) return i; } return NULL; }

Second overload:

 int analysis::_indexOfSymbol(std::wstring const & symbol, std::wstring alphabet[], int size) { for (int i = 0; i < size; ++i) { if(symbol == alphabet[i]) return i; } return NULL; }

And the _indexOfMaxElement search function:

 int analysis::_indexOfMaxElement(int* array, int size) { int index_max = 0; for (int i = 0; i < size; ++i) { if(array[index_max] < array[i]) index_max = i; } return index_max; }

In the main program main.cpp the call looks like this:

 analysis _analysis; wstring kek; getline(wcin, kek); wcout << _analysis.analisys(kek, L"rus", false);

The problem is that he refuses to look for characters / compare characters / do something else with the characters. The output should be something like this:

 A - 7; Б - 5; Я - 4; Д - 3;

etc. descending.

But my output is all zeros. I put breakpoints and here are some data from them:

 Текст который я ввёл - L"Ð\U0000009fÑ\U00000080Ð¸Ð²ÐµÑ\U00000082 ÐºÐ¾Ð´" (по идее должно быть - "Привет код") В цикле, который считает символы, у меня ни разу не выполняется условие.

How to fix this?

And check that the Russian letters are readable (i.e., the locale is correctly installed in the system)

Accepted Answer · 2017-02-13T16:30:05

Replacing the utf-8 encoding with windows-1251 helped.

Frequency analysis of text c ++

1 answer 1

More articles: