The task here is this:

Create a text processing program for a textbook on programming using classes: Symbol, Word, Sentence, Punctuation Mark (structure and class hierarchy should be thought out by yourself).

I am new to the PLO, and I don’t understand in principle how to do this. Let me clarify that if I did this without using classes, I would parse strings into an array, then select words from them (by scanning to the “character”), and there again, using arrays, I would write repetitions and the words themselves. But how can you use classes with #? I created classes: string, word, symbol, ok, and what to do with them next?) I don’t ask me to write code, it's no use, it is important for me to understand the principle itself, what and how we do it. Thank.

  • An interesting question, but where does the dictionary appear in the title? - rdorn

2 answers 2

Probably this way: to make a "Character" with a char type, then a punctuation mark derived from it (where you need to set a restriction on char which belong to punctuation marks) and a second derived class "Letter" (where you need to set a restriction on char which refer to the letters). Create a class "Word", in which List<Буква> , well, and "Offer" which contains properties of the types List<Знак препинания> and List<Слово> . Offhand, you still need to think about storing the position of the word in the sentence, the character in the word, the punctuation mark in the sentence, the position of the sentence in the text.

    A formal description of a text in a natural language is very complex, especially if you take into account all the peculiarities of the language, such as abbreviations, the many-valued use of punctuation marks, complicated sentences, etc. not counting errors and typos. Therefore, without control by the operator, processing arbitrary text is almost impossible task.

    @Bulson suggested a bottom-up hierarchy, so I’ll try to offer a top-down option, the truth is somewhere in the middle.

    The text in the simple case is an array of sentences , the order is determined by the order of placement in the array.

    The sentence consists of lexemes - minimal independent meaningful units of text.

    Tokens can be divided into 3 classes: a separator (space), a punctuation mark , a word (a word from one letter is also a word, the number written in numbers is also a word). Each token is one or more characters from the group specified for a given type of tokens.

    The symbol is the minimum unit of text, a building brick for tokens, in principle, you can use the base Char for this.

    And the most important thing, in my opinion, is to write down formal rules for the text, for example: a sentence cannot begin and end with a separator, at the end of a sentence there should be a punctuation mark, a word cannot begin with a symbol denoting a punctuation mark, but may contain it in the middle or end (abbreviations ie, because, initials and the like), well, etc. etc.

    And there will be no hierarchy of classes here, there will be composition, instead of inheritance. Although in some cases the use of inheritance is possible, for example for tokens.