Hello! Please tell me the solution to my problem. I try to make it so that from the text that is read from the file, I get the number of sentences. It works in my code. But! I do not understand how to expand the program to take into account that the dot does not always indicate the end of a sentence. The point can also be used as abbreviations, such as acting (acting).

package ir_ub2; import java.io.BufferedReader; import java.io.File; import java.io.FileNotFoundException; import java.io.FileReader; import java.io.IOException; import java.nio.file.ReadOnlyFileSystemException; import java.util.ArrayList; import java.util.Arrays; import java.util.List; public class TextCounts { public static void main(String[] args) throws IOException { // datei lesen FileReader fileName = new FileReader("C:\\Users\\Olga\\Desktop\\ub_2\\inputDE.txt"); // wrap a BufferedReader around FileReader BufferedReader reader = new BufferedReader(fileName); int sentenceCount = 0; String line; String terminalSymbol = ".?!"; while ((line = reader.readLine()) != null) { // Continue reading until end of file is reached for (int i = 0; i < line.length(); i++) { if (terminalSymbol.indexOf(line.charAt(i)) != -1) { // If the delimiters string contains the character sentenceCount++; } } } reader.close(); System.out.println("The number of sentences is " + sentenceCount); } } 
  • four
    I think, without semantic analysis, there is no way to define it. How to distinguish ув. Иванов ув. Иванов from the конец предложения. Начало следующего конец предложения. Начало следующего . Only if you enter a list of all possible abbreviations. - andy.37
  • eight
    1) I do not know java (from the word at all). 2) Can you imagine the amount of work when writing a semantic analyzer of arbitrary text? This is a task for years. - andy.37
  • one
    If you write this way to anyone who leaves a comment, I am afraid that they will not help you with the answer - Andrew Bystrov
  • 2
    Anyway, give advice. Offer separator is not easy .!? , and one of these characters, followed by one or more space characters, followed by a capital letter or number. There will be a problem with the initials, well, you can try something like a small letter before the full stop. With the implementation, sorry, I will not help, because I repeat - I do not know Java. A regular pearl style will look something like this: [^AZ][.!?]\s+[AZ] - andy.37
  • 9
    The question here is not how many days or years you learn a particular language, the question is that you really need to train the program to understand the text. The specific programming language has nothing to do with it, it is a question of theory. You cannot , on the basis of formal criteria, distinguish the end of a sentence from a reduction. Compare, for example: “Pushkin and Dantes participated in the duel” and “My poems are one continuous Pushkin would have shot himself, but would not read that.” - VladD

8 answers 8

At the risk of incurring the wrath of the whole tape I’ll still write:

And you, friends, no matter how you sit down,
All the musicians are not fit

Well, seriously, you do not need amateur work, but you have to solve the problem correctly.

I see 2 possible ways:

  1. Or stick your text in the NLP network and train it. For starters, you can take OpenNLP and try to train it.
  2. Take Apache Lucene and try the SentenceTokenizer - which will break the text into sentences. Lucene, as I understand it, supports the Russian language , by the way, there are also external Lucene extenders for Russian morphology

Something like this.

  • What happened to the nightingale to the noise of their fly? - Igor

with such various kinds of abbreviations (interim, so-called, IS Turgenev ...). It is necessary to take into account the number of characters between the points, or the number of characters to the point. There are no such short sentences, therefore they can be ignored. You can also lay on proposals like: "Yes.", "No." I do not think that such a short set up a lot.

In any case, the task is useless, IMHO.

  • snowhead why useless task. 100% used in Information Retrieval. - OlgaM
  • 2
    Why such a short sentence does not happen? BUT? :-D - Grundy

Offhand a few moments that would seem to make a simple task unsolvable:

  1. Parcelry - it is not clear, this is one sentence or several:

    He went too. To the store. Buy cigarettes. (Shukshin)

  2. Direct speech sentences:

    Ignat whispered: "Well, to hell with this task," he laughed nervously.

  3. Scanning errors, typos, lack of punctuation marks as in typical Internet communication. When the border is on, the sentences of the sentence are not labeled in any way.

    EVERYTHING WAS EVERYTHING SO JUST EIGHT FIVE MINUTES BACK I HATE

  4. Interspersing code:

    To display the username, type echo $ name. ''. $ surname.

    This is an open science task.

    Here , for example, people are trying to solve it somehow.

    There is a library for Python , which copes well with this task (at least for English). If you still need Java, see Jython.

      You can use regex:

       Matcher m = Pattern.compile("\\.\s*[A-ZА-Я]").matcher(yourText); //Вместо yourText должна быть переменная с текстом, в котором нужно посчитать предложения int count = 1; while(m.find()) count++; //Будут сосчитаны все предложения начинающиеся с большой буквы 

        Here I wrote, in my opinion it was interesting. This code understands this kind of test sentences.

         Привет тест. T..... a Тестирование 34. WHAT??? 

        Conclusion

        Предложений по моему мнению:4

        There is also a debugging fitch for the test, after each sentence it displays the sentence itself. They can be removed, and you can use.

         package javaapplication3; import java.io.ByteArrayOutputStream; public class JavaApplication3 { public static void main(String[] args) { String str = "Привет тест. T..... a Тестирование 34. WHAT???"; byte[] bytes = str.getBytes(); byte tocka = '.'; //заставляем интерпретатор из символа точки получить 1 байт(можно цифру, но так понятнее вам будет) byte vopr = '?'; byte voscl = '!'; int count = 0; //храним количество int max = bytes.length; if(max>0){ int i = 0; //сдвиг byte a; //активный байт //ДЕБАГ, В СЛУЧАЕ ИСПОЛЬЗОВАНИЯ СКРИПТА БЕЗ ДЕБАГА ЗАКОМЕНТИТЬ ОТ И ДО ByteArrayOutputStream out = new ByteArrayOutputStream(); //буфер для предложений //дебаг режим выводит уже предложения //РАЗБИРАЕМ while(i<max){ //ЦИКЛ ОСНОВА while(i<max){ //ЦИКЛ ПРЕРЫВАЕТСЯ ОЗНАЧАЮЩИЙ КОНЕЦ ПРЕДЛОЖЕНИЯ a = bytes[i]; i = i + 1; if(i<max){ //ЕСТЬ СИМВОЛЫ ДЛЯ ДЕБАГИНГА if(a == tocka || a == vopr || a == voscl){ //ПРЕДЛОЖЕНИЕ. //проверяем следующий символ a = bytes[i]; i = i + 1; if(i<max){ if(a == tocka || a == vopr || a == voscl){ //возможно это троеточие, почемубы и нет? или ???:) while(i<max){ a = bytes[i]; if(a!=tocka && a!=vopr && a!=voscl){ break; } i = i + 1; } } } break; } } //ДЕБАГ out.write(a); // } count = count + 1; //ПРЕРЫВАНИЕ БЫЛО ИЛИ ПРЕДЛОЖЕНИЕ ОКОНЧЕНО ЗНАЧИТ ПРЕДЛОЖЕНИЕ //ДЕБАГ, МОЖНО ЗАКОМЕНТИТЬ System.err.println(out.toString()); out.reset(); // } } System.out.println("Предложений по моему мнению:"+String.valueOf(count)); } } 

          I managed to solve my question. For those who are interested here is my answer. It is worth noting that the task was for the texts of the German language and cases with abbreviations of the initials of the name (for example, AS Pushkin) I did not need to take into account.

           package ir_ub2; import java.io.BufferedReader; import java.io.FileReader; import java.io.IOException; public class TextCounts { public static void main(String[] args) throws IOException { FileReader file = new FileReader("inputDE.txt"); // datei lesen BufferedReader reader = new BufferedReader(file); int sentenceCount = 0; String line; String delimiter = ".?!"; String[] singlePointExceptions = { "19. Jahrhundert", "allg.", "bzw.", "bspw.", "etc.", "evtl.", "geb.", "ggf.", "n.Chr", "od.", "s.", "u.", "usw.", "vgl." }; String[] doublePointExceptions = { "bw", "dh", "di", "n.Chr.", "sa", "so", "su", "ua", "u.Ä.", "uU", "uz", "va", "v.Chr.", "zB", "zT", "z.Zt." }; while ((line = reader.readLine()) != null) { // Continue reading until end of file is reached int countQuestionsAndExclamations = countMatchesOfSpecialCharacters(line, "?", "!"); int countSingles = countMatchesOfSpecialCharacters(line, singlePointExceptions); int countDoubles = countMatchesOfSpecialCharacters(line, doublePointExceptions); int countPoints = countMatchesOfSpecialCharacters(line, "."); sentenceCount += (countQuestionsAndExclamations + (countPoints - (countSingles + countDoubles))); } reader.close(); file.close(); System.out.println("# of founded Sentences: " + sentenceCount); } private static int countMatchesOfSpecialCharacters(final String str, final String...specialCharacters) { if (null == str || str.isEmpty()) { return 0; } if (null == specialCharacters || specialCharacters.length == 0) { return 0; } int count = 0; int index = 0; for (int i = 0; i < specialCharacters.length; i++) { String special = specialCharacters[i]; index = 0; while ((index = str.indexOf(special, index)) != -1) { count++; index += special.length(); if (index >= str.length()) { break; } } } return count; } } 
          • And what about 20. Jahrhundert, or, God forbid, 21.? And if "z. B. “write with a space? - VladD
          • I agree that my version is not perfect. I repent already, that exposed. If I indicated in my question that I needed for German, I would hardly have received any answers. - OlgaM
          • one
            @Igor: If the text is specific , then the right solution would be something like System.out.println(20); - VladD

          You can use the function String.split ("."); The function returns an array of sentences, and with predlidzenja.length() you can get the length of the array.

          • one
            How do you read the question? The author says: “to take into account that a dot does not always indicate the end of a sentence. A dot can also be used as abbreviations, for example, acting (acting).” - Egor Trutnev