Given: string S, number N. It is necessary to break the string into substrings so that

  1. Splitting occurred at spaces (the space itself is removed)
  2. The length of each substring does not exceed N characters except for the case when 1 word is more than N characters, then it is not broken.

Example: S = "One two three big long". N = 6
Output: "One", "two", "three", "large".

I need this algorithm to place a long line in a limited column of a table.

Apparently, this needs to be done using regular expressions, but I can't find the right one.

  • but with such initial data, S = "One two three big long". N = 10, should the output be: "One two", "three", "large long"? - KoVadim
  • regex does not allow filtering on N> 6, regex allows you to beat text into fragments of 6 characters each, for example. - nick_n_a
  • @KoVadim Yes, exactly. - Reffum

4 answers 4

Algorithm such a breakdown of this.

начало = 0 лимит = 6 разделитель = 0 пока начало < длина строки: разделитель = найти разделитель от позиции начало+лимит "с конца" строки если начало == разделитель: это длинная подстрока разделитель = найти разделитель от позиции начало+лимит к концу строки добавить подстроку [начало, разделитель) в список начало = разделитель + 1 

The code is written in a python-like algorithmic language, since the language is not specified in the original question.

    This solution is only possible in languages ​​with support for conditional back viewing (PHP, Python, and everything that PCRE normally supports).
    The problem is not solved by splitting ( split ), but by searching for matches ( match_all ). By splitting it can be solved only in languages ​​with support for conditional backward viewing of unlimited length (С #, .NET), and it is possible that this cannot be correctly implemented.

    You can experiment here:
    https://regex101.com/r/Rvvwwz/2

    Regular expression:

     (?<=\s|^)(\S{11,}|\S.{0,9})(?<!\s)(?=\s|$) 

    It means a very simple thing (let N = 10 ):

    1. (?<=\s|^) must be a space or the beginning of a line before a match
    2. Coincidence consider any of the alternatives:
      1. \S{11,} non-blank sequence of more than 11 characters N + 1
      2. \S.{0,9} non-blank character followed by up to 9 any N-1 characters
    3. (?=\s|$) after the match must be a space or end of line
    4. (?<!\s) last character of the match cannot be whitespace.
    • I am writing a program in Visual Basic Excel. He unfortunately does not accept such an expression. - Reffum
    • You are wrong. VB is the language of the .NET family. You just need to make CreateObject "Application.Regex" and get all the power of PCRE. I do not exactly remember the object CLASSID. Check with Google. - ReinRaus

    In my opinion, it would be just easier to do without regular expressions. And it will be faster.

     function isWordCharacter(c){ return c >= 'а' && c <= 'я'; // TODO } var s = "один два три большоеидлинное а б в г д е ж большое,пусть а б в"; var result = [], from = 0, prevSpace = 0, limit = 6; for (var i = 0; i <= s.length; i++){ var c = s[i]; if (c == null || i - from >= limit){ if (!isWordCharacter(c)){ result.push(s.substring(from, i)); from = c == ' ' ? i + 1 : i; prevSpace = -1; } else if (prevSpace != -1){ result.push(s.substring(from, prevSpace)); from = prevSpace + 1; prevSpace = -1; } } if (c == ' ') prevSpace = i; } 

    Result:

     ["один", "два", "три", "большоеидлинное", "а б в", "г д е", "ж", "большое", ",пусть", "а б в"] 
    • Your algorithm is idle. Take the following starting line: var s = "а б в г д е ж"; - KoVadim
    • And if "One, two, three large, but, let, a large, long" then get lost? - nick_n_a
    • And, it seems, I misunderstood the task. I will correct now. - Surfin Bird
    • You chose the wrong language to compare regular expressions with a "clean" implementation. In any language that is based on bytecode, and regular expressions are natively implemented (which is JavaScript that you gave) - regular expressions will be much more productive than the "clean" implementation (and not to cause further controversy: even V8 compilations do not save. regular expressions also win). - ReinRaus
    • @ReinRaus, hm, what about JIT? - Surfin Bird

    Here is an example in python:
    s = 'One two three large long'
    print (s.split ())

    result:
    ['One', 'two', 'three', 'large long']

    • When N = 10, it should be: “One two”, “three”, “large” - Reffum
    • it's not clear what N is - Andrei Kruzlik
    • @AndreyKruzlik N - the maximum number of characters in one of the output substrings. - smellyshovel
    • @AndreyKruzlik That is, if 2 words, even separated by a space, are less than N, then these 2 words should be one output line, not two (even considering that they are separated by a space). - smellyshovel
    • @smellyshovel Yes. If 2 words, even separated by a space less than N, they must be one line. - Reffum