Enough common gotcha that

r = "" for c in s: r += c 

it may be O(len(s)**2) complexity in Python, but is it really, and if so (or not), then why?

  • the complexity is clearly linear, ideone.com/wqfd8C here with a change of 10 times the speed changes 10 times. Most likely a quadratic complexity - if a string is an immutable object and it needs to be copied completely with each operation, then this is probably not the case (well, or the optimizer worked) - pavel
  • 2
    @pavel is clearly linear: Python strings are immutable. Try running Jython for example. Here is a detailed description from the creator of the language . CPython (after 2.5) can sometimes optimize (as a special case) code like r += . Since it is difficult for newcomers to explain that you should always use ''.join instead of s+= . Here is a good answer from Alex Martelli to a related question . - jfs
  • @pavel: to understand the fragility of optimization, compare: python -mtimeit -s 's="a"*1000000' $'r = ""\nfor c in s:\nr += c' and python -mtimeit -s 's="a"*1000000' $'r = ""\nfor c in s:\nr += c\nt=r' - jfs
  • @jfs so maybe in response right away?) - pavel
  • @pavel for the answer, it is necessary to explicitly describe why the immutability of strings leads to quadratic behavior in an accessible form. You seem to understand why this is happening — try to explain how you understand it in the form of an answer. - jfs

1 answer 1

Consider this question in general for different languages.

The first case. The strings are changeable , the string is a pointer to a certain array (or other data structure) of characters (+ perhaps a wrapper with functions). Examples of languages ​​are С (char *) and С ++ (string).

In this case, the code from the question essentially degenerates into the usual addition of an element to the array. And if this is not a low-quality lab, a handwritten implementation, then the complexity of adding (depreciated) will be no more than the length of what is being added.

Then the complexity of 1 iteration is obviously O (1) and the total is O (len).

The advantages of this approach are the speed of work (addition, the ability to replace characters, the rapid creation of substrings). Cons - you need to be very careful what and how changes in the line, especially in multi-threaded implementations.

And 2 case - the lines are unchangeable . Those. The string object is created once and cannot directly or indirectly change its state. This is typical for Java.

In this case, every time a temporary object is created, the entire old line is copied to + what needs to be added. Those. roughly speaking (pseudocode)

 @s - string r[0] = '' for i =1: i <= s.length(); i++ r[i] = String.gen(r[i-1] , s[i-1]); 

And we create Len lines, length from 0 to Len inclusive, it is easy to see that their total length will be Len * (Len + 1) / 2 ~ O (Len ^ 2).

By the way, in the absence of a normal garbage collector (but usually it is in such languages), you can easily sift from memory.

Pros and cons can say the opposite, it is easier to work but slower.

But in every such language there is a mechanism that allows you to create this without losing performance. This may be a string constructor from the collection or special utility classes (StringBuiled and the like). Using them allows you to significantly speed up the process and achieve in this example the asymptotics O (len).

A smart compiler, if it can prove the equivalence of a code (and in such a simple situation it is not difficult) can replace the direct string concatenation with a utilitarian one, ensuring a speed gain. This is exactly what happened in Python in the example http://ideone.com/wqfd8C .

However, relying on it is dangerous. Quotes from the comments.

Try running Jython for example. Here is a detailed description from the creator of the language. CPython (after 2.5) can sometimes optimize (as a special case) code like r + =. Since it is difficult for newcomers to explain that you should always use '.join instead of s + =. Here is a good answer from Alex Martelli to a related question.

(The link to the question is https://stackoverflow.com/a/1350289/4279 )

To understand the fragility of optimization, compare: python -mtimeit -s' s = "a" * 1,000,000 '$' r = "" \ nfor c in s: \ nr + = c 'and python -mtimeit -s' s = "a "* 1,000,000 '$' r =" "\ nfor c in s: \ nr + = c \ nt = r '

In this case, the compiler lacks “quick-wittedness” and it implements what is written in the “forehead”, which is much slower. Therefore, I repeat, you need to use special language constructs and simplify the life of the compiler.

  • Those. the reason is that the longer the string, the longer it takes to create? - Vladimir Gamalyan
  • 2
    The beginning is good, but about the compiler — wrong. It's not about the compiler, but about the implementation of the interpreter . In Jython and even in Pypy with JIT, this is a quadratic operation. Python code is compiled into bytecode, which is then interpreted . In the case of optimization and without ( t=r ), the same INPLACE_ADD operation INPLACE_ADD performed — the only difference is in the specific implementation for the strings: probably += notices that r is the only reference to the string, therefore changes are in place (instead of creating a new string and discarding the existing one). t=r creates another reference to the string, which turns off optimization. - jfs
  • one
    I don’t think, if adding a symbol we will do realloc doubling the buffer size (a common tactic), then what will be the complexity - n log n? - avp
  • @avp no, n. Total for all time will be allocated 2n memory. - pavel
  • @jfs edit) I immediately said that I only know about the answer, I can make it general to make it easier. - pavel