1. Dots should not break the word.
  2. Dots should not continue the punctuation marks (even through the space (s). For example, like this: ,… !!!… !!… !… - … - should not be obtained.
  3. Special unicode characters of the stress character type must be correctly processed.

Any programming language. Or finished tool.

The clause 3 clause is optional.

Why regular expressions. A regular expression is converted to an automaton, and alternative solutions by "logic" in scripting languages ​​often lead to temporary arrays, etc. On the other hand, there are now many places where JIT is used in scripts, that is, a solution may come up with a cycle.

Example. Let N = 40 and the next line

 Lorem ipsum dolor sit amet!! Consectetur?! Adipiscing elit... Nam tincidunt ultricies congue (turpis duis). 

Then as a result you need

 Lorem ipsum dolor sit amet... 
  • Do you need to use regexpov? I just have a suspicion that they cannot be done at all. If I am right and you are just interested in the algorithm, then change the label to the алгоритм probably. - Flowneee
  • Any language. - Then it is impossible. Each has its own nuances. By the way, what's the problem? The task is stated, but where is the code, the expression, what does not work? - Wiktor Stribiżew
  • @WiktorStribiżew it’s probably still a programming language (I hope of course). - Flowneee
  • @Flowneee, added a question why re would be better. Any programming language, of course. - user239133
  • one

4 answers 4

Well, like this (Python 3), without regexp, dots are included in N:

 def truncate_string(str, N): substr = str[0: N] last_alpha = 0 for i in range(0, N - 1): if str[i].isalpha() and not str[i+1].isalpha(): last_alpha = i return substr[0: last_alpha + 1] + "…" 

Check

Naturally, I didn’t do any checks such as а whether the empty string will return ’, а rather than whether an empty string is being input’, since you didn’t describe how the function should lead in such a case, but I think you can .

UPDATE

I have a little corrected the logic of the work, so that the traversal starts from the end of the line, so it's more efficient:

 def truncate_string(str, N): substr = str[0: N] last_alpha = 0 for i in range(N - 1, 0, -1): if str[i-1].isalpha() and not str[i].isalpha(): last_alpha = i break return substr[0: last_alpha] + "…" 

Check

  • It seems everything is OK, it only falls on lines shorter than N characters (but these are details, of course) - andy.37
  • @ andy.37 Well, I said that I didn’t check, I was interested in the algorithm as a whole - Flowneee

With regular season (N = 6, disregarding ellipsis)

 import re p = re.compile(r'^.{0,5}\w\b') for s in ('abc def', 'abcdefg', 'a, b, c, d', 'abcdef ghi'): m = p.search(s) if m: print m.group(0) + '...' 

Conclusion:

 abc... a, b... abcdef... 

Questions: what to do if

a) the line consists entirely of punctuation marks?

b) the first word in the string is longer than N characters?

UPDATE regulars should be "greedy." But this, almost always seems to be the case.

    I do not think that this is the final version, I will try to supplement it later, but perhaps it will prompt you to solve this problem. My idea was to split the string into two substrings. In the second substring, find the first few words and add a triple point to them.

     <?php $text = "Lorem ipsum dolor sit amet!! Consectetur?! Adipiscing elit... Nam tincidunt ultricies congue (turpis duis)."; $pre = substr($text, 0, 40); $after = substr($pre, 20, 40); $pattern = '/^((?:\S+\s+){2}\S+)/'; preg_match($pattern, $after, $matches); $k = strlen($matches[1])-1; $str = $matches[1]; while($k > 0 && !ctype_alpha($str[$k])) { $str = substr($str, 0, -1); $k--; } echo substr($pre, 0, 20),$str,'...'; 

    The solution is quite working and can be used, but it is absolutely not perfect and needs some work.

    UPDATE

    I found a bug, like in my own and in other solutions on this page, the phrase with a hyphen of type test-test will always stop at the first test , if the function defines it as the last word, and the rest of the phrase is erased. Therefore, having played with regex, I found a more elegant pattern ( /^(\w+(-|\s?)\w+)/ ), which will find either the first two words in $after or one phrase divided by a hyphen like test-test Any signs specified after do not affect the result of the execution.

     <?php $text = "Lorem ipsum dolor sit amet!! Consectetu-r?! Adipiscing elit... Nam tincidunt ultricies congue (turpis duis)."; $pre = substr($text, 0, 40); $after = substr($pre, 20, 40); $pattern = '/^(\w+(-|\s?)\w+)/'; preg_match($pattern, $after, $matches); $k = strlen($matches[1])-1; $str = $matches[1]; while($k > 0 && !ctype_alpha($str[$k])) { $str = substr($str, 0, -1); $k--; } echo substr($pre, 0, 20),$str,'...'; 

      Perhaps, I will propose an option on

      The basic idea of ​​the algorithm: move from the beginning of the original string s to the end, taking one word (and punctuation marks in front of it), checking that the result does not exceed the specified length along with ellipsis, and add this word if it fits. For construction, the StringBuilder class is used .

       public static String truncateLine(String s, int n, String suf) { if (s.length() <= n) return s; if (n < suf.length()) throw new IllegalArgumentException(); StringBuilder b = new StringBuilder(n); Pattern p = Pattern.compile("(?u)\\P{L}*\\p{L}+"); Matcher m = p.matcher(s); while (m.find()) { String w = m.group(); if (b.length() + w.length() + suf.length() > n) break; b.append(w); } return b.toString() + suf; } public static String truncateLine(String s, int n) { return truncateLine(s, n, "..."); } 

      Using

       truncateLine("Мама мыла раму", 12); 

      Result

      Mom soap ...