The forum member Vova loves to write messages, omitting spaces after punctuation marks and forgetting to put capital letters at the beginning of a sentence (for some reason, it seems to him that this gives his messages a special charm). The moderators are already tired of making comments to Vova and decided to ask the programmers of the forum to write the simplest proofreader, which will set up spaces and capitalize letters for Vova.

The fix rules are:

  1. Sentences consist of words, spaces, quotes, punctuation marks, line breaks.
  2. Words consist of letters of the Russian and Latin alphabet.
  3. Sentences end with a period, exclamation point or question mark.
  4. The first word in the sentence should begin with a capital letter, all other letters in the sentence should be lowercase.
  5. There should not be a space before each punctuation mark (comma, period, exclamation and question mark, colon, dot), after each punctuation mark there should be a space.

Offer your version of the corrector. The text is input, the text is output.

Closed due to the fact that it is necessary to reformulate the question so that you can give an objectively correct answer to the participants Oceinic , Kromster , Cyrus , Max Mikheyenko , BogolyubskiyAlexey 12 Oct '15 at 12:25 .

The question gives rise to endless debates and discussions based not on knowledge, but on opinions. To get an answer, rephrase your question so that it can be given an unambiguously correct answer, or delete the question altogether. If the question can be reformulated according to the rules set out in the certificate , edit it .

  • 12
    Why bother to ask us these tasks? - gammaker
  • five
    > Forum member Vova likes to write messages, omitting spaces after punctuation marks and forgetting to put capital letters at the beginning of a sentence (for some reason it seems to him that this gives his messages a special charm). The moderators are already tired of making comments to Vova and decided to ask the programmers of the forum to write the simplest proofreader, which will set up spaces and capitalize letters for Vova. Why get personal? =) - jmu
  • 3
    @GLmonster There is an answer to your question on the blog . - Nicolas Chabanovsky
  • 3
    I knew that Vova was very limited, even in the use of punctuation marks (well, this is based on the first answer - apparently, this is a familiar ... tsss). Vova, we do not use direct speech, and even any brackets, and this is on the forum - well, you understand; plus does not separate long sentences consisting of parts different in meaning, semicolon! yes hell with Vova and the answers ... the easiest way would be to ban such "Vov" - 4Denis pm
  • 2
    The task is cool. But miserable. For example, the abbreviation definition feature is missing. It is clear that, for abbreviations, the rules of writing and after them, the proposal can continue as well as be pumped. There is also no analysis on proper names and their forms. - gecube

6 answers 6

import re def correction(text): def callback(m): punct, word = m.groups() word = word.capitalize() if punct in ('.', '!', '?', '') else word.lower() return '%s %s' % (punct if punct else '', word) return re.sub(u'\s*(^|[.!?,:\s]|\.\.\.)?\s*([a-zA-Zа-яА-Я]+|$)', callback, text, re.U) 

Check how it works:

 >>> print(correction(u'раз ,два, три , четыре, ПЯТЬ - вышел ZaIcHiK погулять.ТУТ охотник выбегает ...прямо в зайчика СТРЕЛЯЕТ ! КОНЕЦ .')) Раз, два, три, четыре, пять - вышел zaichik погулять. Тут охотник выбегает... прямо в зайчика стреляет! Конец. 
  • 2
    Python? Zer Gud, Waldemar, Zer Gud (s) - gecube
  • And the breakdown in rows and indents at the beginning of the line will be preserved? - avp
  • one
    Actually, there should be a big letter after the trot - Andrey Siunov
  • Again, what about quotes and brackets? - Sleeping Owl
  • one
    @avp is not saved, since there was no such condition in the task. However, the breakdown of lines can be preserved if in the regular schedule replace \s with a space. Indenting is more difficult, but also possible. @FAndES,> Sentences end with a period, exclamation, or question mark. About a lot of points here is not said. @Sleeping Owl, what about them? - Ilya Pirogov
 use utf8; sub correct{ my $string = shift; $string =~ s/(\w+)/\L$1/g; $string =~ s/(\w+)\s*([.,!?:]+)\s*/$1$2 /g; $string =~ s/(\A|[.!?]+)(\s*)(\w+)/$1$2\u$3/g; return $string; } print correct "раз ,два, три , четыри, ПЯТЬ - вышел ZaIcHiK погулять!ТУТ охотник выбегает ...прямо в зайчика СТРЕЛЯЕТ ? КОНЕЦ ."; 

Result: One, two, three, four, five - zaichik went out for a walk! Then the hunter runs out ... Shoots right at the bunny? The end.

PS borrowed an example for the test from Ilya Pirogov , I hope he will not be offended :) PS This is a solution in Perl

     string Corrector(string innerText, UserInfo userInfo) { var reg1 = new Regex(@"[\.\,\:\!\?][^\s]"); //после знака препинания нет пробела или переноса var reg2 = new Regex(@"[\.]\ [^А-ЯA-Z]"); //после точки и пробела не заглавная буква var reg3 = new Regex(@"\b[^А-ЯA-Z]"); //после переноса строки не заглавная буква var reg4 = new Regex(@"\ [\.\,\:\!\?]"); //пробел перед знаком препинания if(reg1.IsMatch(innerText) || reg2.IsMatch(innerText) || reg3.IsMatch(innerText) || reg4.IsMatch(innerText)) { return("Прошу прощения, товарищи. К сожалению, я безнадежно безграмотная соволочь! В связи с этим, не имею возможности в письменном виде высказать вам свою мысль."); } return innerText; } 
    • 1) I would correct reg2 - why is there only one space after the dot and not more, or not a line feed? + see below: 2) reg3 is not very clear (for example, verses in a column), well, sentences can begin not only with letters :-) But this is a claim to condition 4 - user6550
    • reg3 means actually any character except a large letter to the right of the word boundary, but since there cannot be a letter or a digit to the right, otherwise it would not be a word boundary, this expression defines any word boundary except the one at the end of the line. reg2, a sentence can begin with a digit, and it cuts it. - ReinRaus
    • 1. Forgot about quotes, brackets and dashes. 2. After the punctuation mark there can be another punctuation mark: after the dot, another dot, for example. - Sleeping Owl
    • I regretted in reg3, there should have been \ n. But in general, I blew it up, podsony - svsokrat
    • BASIC, or what? - gecube

    In PHP:

     function corrector($string){ $string=strtolower($string); $patterns = array(); $patterns[0] = '/ *\. */'; $patterns[1] = '/ *\, */'; $patterns[2] = '/ *\! */'; $patterns[3] = '/ *\? */'; $patterns[4] = '/ *\: */'; $replacements = array(); $replacements[0] = '. '; $replacements[1] = ', '; $replacements[2] = '! '; $replacements[3] = '? '; $replacements[4] = ': '; $string=preg_replace($patterns, $replacements, $string); $string=preg_replace('/\. \. \. /', '... ', $string); $string=preg_replace_callback( '/([\.\!\?]|(\.\.\.)) [a-zA-Zа-яА-Я]/', create_function( '$matches', 'return strtoupper($matches[0]);' ), $string ); $string=preg_replace_callback( '/^ *[a-zA-Zа-яА-Я]/', create_function( '$matches', 'return strtoupper($matches[0]);' ), $string ); return $string; } $string=" test . ..test ?teSt ."; echo corrector($string); 

      That's what I got on java + oop without regexp . Enforcing different formatting for hyphens and dashes was simply not enough patience = /

       input string (для наглядности строки в кавычках): ' hello DUMMY-user ! a lot of thanks ,- this is a "Hello World" Text ... just to check ' result string: 'Hello dummy - user! A lot of thanks, - this is a "hello world" text... just to check.' 

      USE_CUSTOM_FORMATTERS - a switch of additional formatters that are needed, but were not explicitly specified in the statement of the problem

       public class FormatStr { // debug mode switcher - dump results after each formatter private static boolean DEBUG = false; // additional formatters switcher - use additional formatters private static boolean USE_CUSTOM_FORMATTERS = true; public static void main(String[] args) { // create list List<ABaseFormatter> formatters = new LinkedList<ABaseFormatter>(); // add extra formatters if (USE_CUSTOM_FORMATTERS) { formatters.add(new DashFormatter()); formatters.add(new TrimStringFormatter()); formatters.add(new RemoveExtraWhitespacesFormatter()); } // add required formatters formatters.add(new LastDotFormatter()); formatters.add(new StringToLowerCaseFormatter()); formatters.add(new FirstCharToCapitalFormatter()); formatters.add(new SentenceFromUpperCaseCharFormatter()); formatters.add(new BeforePunctMarkFormatter()); formatters.add(new AfterPunctMarkFormatter()); // input str String str = " hello DUMMY-user ! a lot of thanks ,- this is a \"Hello World\" Text ... just \nto check "; dump("input string:", str); String result = str; for (ABaseFormatter sf : formatters) { result = sf.format(result); if (DEBUG) { dump("applied formatter " + sf.getClass().getName(), result); } } dump("result string:", result); } private static void dump(final String purpose, final String str) { System.out.println(purpose + "\n'" + str + "'\n"); } } abstract class ABaseFormatter { // private Set<Character> set = new HashSet<Character>(); public static Set<Character> SENTENCE_SEPARATOR_CHARS = new HashSet<Character>(); public static Set<Character> PUNCTUATION_CHARS = new HashSet<Character>(); static { Collections.addAll(SENTENCE_SEPARATOR_CHARS, '.', '!', '?'); Collections.addAll(PUNCTUATION_CHARS, ',', '.', '!', '?', ':'); } /** * Format string * * @param string */ public final String format(String string) { if (null == string) { return null; } return formatStr(string); } /** * Override to apply formatting to whole string * * @param str */ protected String formatStr(String str) { StringBuilder buff = new StringBuilder(str); // default behavior - apply per character formatting for (int i = 0; i < buff.length();) { i = formatChar(buff, i, buff.charAt(i)); } return buff.toString(); } /** * Override to apply formatting per char * * @param buff * @param pos * @param ch */ protected int formatChar(StringBuilder buff, int pos, char ch) { return pos + 1; } } class SentenceFromUpperCaseCharFormatter extends ABaseFormatter { @Override protected int formatChar(StringBuilder buff, int pos, char ch) { if (pos == 0) { // find first letter if possible while (pos < buff.length()) { if (Character.isLetter(buff.charAt(pos))) { buff.setCharAt(pos, Character.toUpperCase(buff.charAt(pos))); break; } pos++; } } return pos + 1; } } class AfterPunctMarkFormatter extends ABaseFormatter { @Override protected int formatChar(StringBuilder buff, int pos, char ch) { // add whitespace after punctuation mark if (PUNCTUATION_CHARS.contains(ch)) { // if dots are coming just move to next one if ('.' != ch) { // add whitespace between punct mark and word ch = buff.charAt(pos + 1); if (' ' != ch && '\n' != ch) { buff.insert(pos + 1, ' '); } } } return pos + 1; } } class BeforePunctMarkFormatter extends ABaseFormatter { @Override protected int formatChar(StringBuilder str, int pos, char ch) { // delete leading whitespaces before punctuation marks if (PUNCTUATION_CHARS.contains(ch)) { // count leading whitespaces to the punctuation mark; for (int start = pos; start > 1; start--) { char prevCh = str.charAt(start - 1); if (' ' != prevCh && '\n' != prevCh) { str.replace(start, pos, ""); pos = start + 1; break; } } // new sentence should start from upper case letter if (SENTENCE_SEPARATOR_CHARS.contains(ch) && (pos > 0 && str.charAt(pos - 1) != '.')) { // find first letter if possible while (pos < str.length()) { if (Character.isLetter(str.charAt(pos))) { str.setCharAt(pos, Character.toUpperCase(str.charAt(pos))); break; } pos++; } } } return pos + 1; } } class LastDotFormatter extends ABaseFormatter { @Override protected String formatStr(String str) { // default behavior - apply per character formatting for (int i = str.length() - 1; i >= 0; i--) { char ch = str.charAt(i); if (SENTENCE_SEPARATOR_CHARS.contains(ch)) { break; } // add end of sentence char to the last character // of the last sentence if (Character.isLetter(ch)) { return str + '.'; } } return str; } } class FirstCharToCapitalFormatter extends ABaseFormatter { @Override public String formatStr(String string) { return Character.toUpperCase(string.charAt(0)) + string.substring(1); } } class RemoveExtraWhitespacesFormatter extends ABaseFormatter { @Override public String formatStr(String string) { while (-1 != string.indexOf(" ")) { string = string.replaceAll(" ", " "); } return string; } } class StringToLowerCaseFormatter extends ABaseFormatter { @Override public String formatStr(String string) { return string.toLowerCase(); } } class TrimStringFormatter extends ABaseFormatter { @Override public String formatStr(String string) { return string.trim(); } } class DashFormatter extends ABaseFormatter { @Override protected int formatChar(StringBuilder buff, int pos, char ch) { // delete leading whitespaces before dashes if ('-' == ch) { // add white space before dash ch = buff.charAt(pos - 1); if (pos > 1 && ' ' != ch && '\n' != ch) { buff.insert(pos, ' '); } // add whitespace after pos = pos + 1; ch = buff.charAt(pos + 1); if (pos + 1 < buff.length() && (' ' != ch && '\n' != ch)) { buff.insert(pos + 1, ' '); } } return pos + 1; } } 

        And I decided to implement it in C ++ . Maybe Hashcode will be useful, who knows ... despite the fact that the task is almost elementary (and, I think, standard). I did everything through classes (well, how else !?), so you can embed it into any system without going into the details of the functioning of this class, knowing only one of its methods. Encapsulation, respectively. Anyway, such things are not reasonable to implement not through classes, since there is a behavior here. Please love and respect:

         #include<iostream> #include<fstream> #include<clocale> #include<string> #include<vector> #include<algorithm> using namespace std; class Corrector // наш класс, реализующий функционал корректора { private: char* filename; // имя файла, где лежит BAD-TEXT. std::string text; // текст из этого файла std::vector<std::string>lines; // все лексемы( Tok = ' ') void CreateText() { for(vector<string>::iterator itr=lines.begin();itr!=lines.end();++itr) text+=*itr+" "; transform(text.begin(),text.end(),text.begin(),tolower); text[0] = toupper(text[0]); Correct(); } public: Corrector(char* fn):filename(fn){setlocale(LC_ALL,"Russian");} void LoadText() // метод загрузки текста { std::string str; ifstream f(filename); while(!f.eof()) { f>>str; lines.push_back(str); } f.close(); CreateText(); } void Correct() // корректор исходного текста { int i = 0; for(string::iterator itr=text.begin();itr!=text.end();++itr) { i++; if(*itr==',' || *itr=='!' || *itr=='?' || *itr=='.' || *itr==';') if(i>0) if(*(itr+1)!=' ') text.insert(itr+1,' '); if(*itr=='!' || *itr=='?' || *itr=='.') if(i>0) if(*(itr-1)==' ') text.erase(itr-1,itr); if(i>2) if(*(itr-2)=='!' || *(itr-2)=='?' || *(itr-2)=='.') *itr = toupper(*itr); } } void SaveText(char* fn) // Сохранение отформатированного текста в Файл "fn" { ofstream f(fn); f<<text<<endl; f.close(); } void Print(){cout<<text<<endl;} // отладочный метод... }; int main() { setlocale(LC_ALL,"Russian"); Corrector corr("badtext.txt"); // корректор corr.LoadText(); // загружаем текст corr.Print(); // отладочный вызов corr.SaveText("formatted.txt"); system("Pause"); } 

        Let's say if the source text was like this:

         когда-то я был таким же шалопаем,как и вы,поэтому приношу свои извинения !! совершенно безграмотная сволочь(нет,нет,нет!) но а вы,а !? Так,что,"довай,до свыдааания!" !!! 

        So at the output it turns out like this

         Когда-то я был таким же шалопаем, как и вы, поэтому приношу свои извинения!! Совершенно безграмотная сволочь(нет, нет, нет! ) но а вы, а!? Так, что, "довай, до свыдааания! "!!! 

        The code is written literally "on the knee", so if errors are found, I will correct them ...

        PS I apologize for being cumbersome, but this is C ++ ... nevertheless nothing bad - the code is fully embedded and scalable. Waiting for comments (in style including!) =)))

        And Vova for this need to ban

        • one
          I wonder, guys, for what was a minus? Someone set. Explain better where, what is wrong. - Salivan
        • 2
          @Asen, do not you think that the output is worse than the text at the input? Spaces before the closing parenthesis, before the closing quotation mark. And before the opening bracket, on the contrary, there is no space. - Sleeping Owl
        • @Asen, you lose all spaces and empty lines (that is, the author’s structure of the text). In the TK is not provided. - avp
        • one
          @Sleeping Owl, no, I don’t think so) --- @avp, this is specifically done to convert many spaces into one. - Salivan 5:56 pm
        • @Asen, what does "it seems" mean? I see what is written in your result. There are all the errors I have indicated (which were not in the original line). - Sleeping Owl