There is a large text of ~ 400000 characters, which is read from a .txt file and needs to be split over texts smaller than ~ 9100 characters + these fragments must end with a period. So far, I have come up with a solution only through str.split ("."), And then in each sentence count the number of characters and when the counter approaches 9100, find the nearest point and cut the selected fragment and continue the process. Are there any libraries that could simplify this process? Thank you very much to those who participate.

  • Post an example of your decision. - S. Nick
  • repl.it/@shardakov/MediumVelvetyCone-1 So far, I could only think of this before - dm.shardakov
  • And if in the text the point is not used as the end of the sentence? For example: fractional numbers, Last Name, O., ellipsis. - mkkik
  • I thought about this, but so far nothing has come to my mind. Since I did this only for translating one book and passed the test on it, now I’ll be thinking how to improve the code. - dm.shardakov
  • Perhaps something similar has already been implemented in the tools of language tokenization - NTLK . - mkkik

3 answers 3

Try my example. Write what happened :)

from textwrap import wrap # <--- text_input = '1.txt' with open(text_input, 'r') as f1: lines = f1.read() listWrap = wrap(lines, width=9000) text_output = '2.txt' with open(text_output, 'w') as f2: moveTextToNewLine = '' for line in listWrap: point = max(line.rfind(". "), line.rfind("! "), line.rfind("? ")) if point == -1: moveTextToNewLine = moveTextToNewLine + line else: newLine = "{}{}\n".format(moveTextToNewLine, line[0:point+1]) f2.write( newLine ) moveTextToNewLine = line[point+2:] + " " f2.write(moveTextToNewLine + "\n") 
  • Thanks for the answer, your decision fulfills the task. - dm.shardakov 1:28 pm

If we accept that the task can always be completed (i.e. there is no too long fragment without dots):

 s = 'abc.d.ef.g.ij.klmn.op.rq.a.tu.vwx.yz.' sz = 5 lastpoint = -1 start = 0 for i in range(len(s)): if s[i] == '.': lastpoint = i if i - start >= sz: print(s[start:lastpoint+1]) start = lastpoint + 1 if start < len(s): print(s[start:len(s)]) abc.d. ef.g. ij. klmn. op.rq. a.tu. vwx. yz. 

for sz = 7:

 abc.d. ef.g.ij. klmn.op. rq.a.tu. vwx.yz. 

    The essence of the task invented by me was to feed parts of the text into the Yandex translator, but since it only takes 10,000 characters at a time, it was necessary to divide the text into small subtexts. In the end, I got something similar, but I will need to tweak a few more to get the final version.

     import requests URL = "https://translate.yandex.net/api/v1.5/tr.json/translate" # Your api key KEY = "trnsl.1.1.20190429T065522Z.18f4830c20ee2f4b.b940037513ea0a3e99dd7129948c0a456985e3d6b" text_input = r'1.txt' text_output = r'2.txt' text_ru = r'3.txt' def rep_pdf_to_txt(text_in, text_out): with open(text_input, 'r') as f1, open(text_output, 'w') as f2: lines = f1.readlines() for line in lines: if(line.endswith(".\n")): f2.write(line + "\n") else: f2.write(line.replace("\n", " ") + " ") def split_text(text_out): with open(text_output, 'r') as f2: tmp_text = f2.read() tmp = tmp_text.split(".") n = 0 tmp_str = "" tmp_list = [] for str in tmp: n += len(str) tmp_str += str + "." if(n > 9000): tmp_list.append(tmp_str) tmp_str = "" n = 0 return tmp_list def translate(my_text): params = { "key": KEY, "text": my_text, "lang": 'en-ru' } response = requests.get(URL, params = params) json_response = response.json() return json_response # input parameters ''' my_text = "Test params for my text input for translate" json = translate(my_text) print(''.join(json["text"])) ''' rep_pdf_to_txt(text_input, text_output) t_list = split_text(text_output) with open(text_ru, 'w') as f3: for i in t_list: f3.write(''.join(translate(i)["text"])) f3.write("\n Переведено сервисом «Яндекс.Переводчик» http://translate.yandex.ru/ \n") 
    • Переведено сервисом «Яндекс.Переводчик» but why add this? :) - gil9red
    • Yandex's requirement for the design of the translation tech.yandex.ru/translate/doc/dg/concepts/… Although, I screwed up, you need to move this line - dm.shardakov
    • Wow, thanks :) at least you can use their service without an api-key and thanks for that: D - gil9red