Tell me the algorithm for splitting a string into substrings

Question

Given: string S, number N. It is necessary to break the string into substrings so that

Splitting occurred at spaces (the space itself is removed)
The length of each substring does not exceed N characters except for the case when 1 word is more than N characters, then it is not broken.

Example: S = "One two three big long". N = 6
Output: "One", "two", "three", "large".

I need this algorithm to place a long line in a limited column of a table.

Apparently, this needs to be done using regular expressions, but I can't find the right one.

N = 10, should the output be: "One two", "three", "large long"?
regex does not allow filtering on N> 6, regex allows you to beat text into fragments of 6 characters each, for example.

KoVadim KoVadim 85.7k four 66 128 · Accepted Answer · 2016-10-24T14:48:47

Algorithm such a breakdown of this.

начало = 0 лимит = 6 разделитель = 0 пока начало < длина строки: разделитель = найти разделитель от позиции начало+лимит "с конца" строки если начало == разделитель: это длинная подстрока разделитель = найти разделитель от позиции начало+лимит к концу строки добавить подстроку [начало, разделитель) в список начало = разделитель + 1

The code is written in a python-like algorithmic language, since the language is not specified in the original question.

ReinRaus ReinRaus 16k 3 32 77 · Answer 2 · 2016-10-26T08:10:09

This solution is only possible in languages with support for conditional back viewing (PHP, Python, and everything that PCRE normally supports).
The problem is not solved by splitting ( split ), but by searching for matches ( match_all ). By splitting it can be solved only in languages with support for conditional backward viewing of unlimited length (С #, .NET), and it is possible that this cannot be correctly implemented.

You can experiment here:
https://regex101.com/r/Rvvwwz/2

Regular expression:

 (?<=\s|^)(\S{11,}|\S.{0,9})(?<!\s)(?=\s|$)

It means a very simple thing (let N = 10 ):

(?<=\s|^) must be a space or the beginning of a line before a match
Coincidence consider any of the alternatives:
1. \S{11,} non-blank sequence of more than 11 characters N + 1
2. \S.{0,9} non-blank character followed by up to 9 any N-1 characters
(?=\s|$) after the match must be a space or end of line
(?<!\s) last character of the match cannot be whitespace.

You just need to make CreateObject "Application.Regex" and get all the power of PCRE.

Answer 3 · 2016-10-24T13:59:30

In my opinion, it would be just easier to do without regular expressions. And it will be faster.

 function isWordCharacter(c){ return c >= 'а' && c <= 'я'; // TODO } var s = "один два три большоеидлинное а б в г д е ж большое,пусть а б в"; var result = [], from = 0, prevSpace = 0, limit = 6; for (var i = 0; i <= s.length; i++){ var c = s[i]; if (c == null || i - from >= limit){ if (!isWordCharacter(c)){ result.push(s.substring(from, i)); from = c == ' ' ? i + 1 : i; prevSpace = -1; } else if (prevSpace != -1){ result.push(s.substring(from, prevSpace)); from = prevSpace + 1; prevSpace = -1; } } if (c == ' ') prevSpace = i; }

Result:

 ["один", "два", "три", "большоеидлинное", "а б в", "г д е", "ж", "большое", ",пусть", "а б в"]

Take the following starting line: var s = "а б в г д е ж";
And if "One, two, three large, but, let, a large, long" then get lost?
You chose the wrong language to compare regular expressions with a "clean" implementation.
In any language that is based on bytecode, and regular expressions are natively implemented (which is JavaScript that you gave) - regular expressions will be much more productive than the "clean" implementation (and not to cause further controversy: even V8 compilations do not save. regular expressions also win).

Answer 4 · 2016-10-24T14:56:58

Here is an example in python:
s = 'One two three large long'
print (s.split ())

result:
['One', 'two', 'three', 'large long']

Andrey Kruzlik

51 eleven

When N = 10, it should be: “One two”, “three”, “large” - Reffum
it's not clear what N is - Andrei Kruzlik
@AndreyKruzlik N - the maximum number of characters in one of the output substrings. - smellyshovel
@AndreyKruzlik That is, if 2 words, even separated by a space, are less than N, then these 2 words should be one output line, not two (even considering that they are separated by a space). - smellyshovel
@smellyshovel Yes. If 2 words, even separated by a space less than N, they must be one line. - Reffum

|

Tell me the algorithm for splitting a string into substrings

4 answers 4

More articles: