📜 ⬆️ ⬇️

Defective Embedding of Functions in Go


Is the performance below equivalent in performance?


// (A). Вызов HasPrefix будет встроен. return strings.HasPrefix(s, "#") // (B). Ручное встраивание тела HasPrefix. return len(s) >= len("#") && s[:len("#")] == "#" 

The answer is no .


For details and explanations I ask under cat.




Good day, before you open the topic, I would like to introduce myself.
My name is Iskander and I send commits to the golang / go repository from time to time.


image

I used to do it on behalf of the Intel Go team, but our paths diverged and now I am an independent contributor. Recently I work in vk in the infrastructure team.


In my free time I make different tools for Go, such as go-critic and go-consistent . I also draw gophers .




Measure it!


Immediately proceed to the comparison and outline the benchmark:


 package benchmark import ( "strings" "testing" ) var s = "#string" func BenchmarkHasPrefixCall(b *testing.B) { for i := 0; i < bN; i++ { _ = strings.HasPrefix(s, "#") _ = strings.HasPrefix(s, "x") } } func BenchmarkHasPrefixInlined(b *testing.B) { for i := 0; i < bN; i++ { _ = len(s) >= len("#") && s[:len("#")] == "#" _ = len(s) >= len("x") && s[:len("x")] == "x" } } 

Instead of recommending benchstat to you, I will show you benchrun .


With the help of one command, we can run both benchmarks and get a comparison:


 go-benchrun HasPrefixCall HasPrefixInlined -v -count=10 . Benchstat results: name old time/op new time/op delta HasPrefixCall-8 9.15ns ± 1% 0.36ns ± 3% -96.09% (p=0.000 n=10+9) 

The version with manual embedding is much faster than the code that was obtained by embedding the function body with the compiler. Let's try to figure out why this is happening.


strings.HasPrefix


Recall the strings.HasPrefix implementation:


 // HasPrefix tests whether the string s begins with prefix. func HasPrefix(s, prefix string) bool { return len(s) >= len(prefix) && s[0:len(prefix)] == prefix } 

The HasPrefix function HasPrefix built in by the compiler.
You can check this as follows:


 go build -gcflags='-m=2' strings 2>&1 | grep 'can inline HasPrefix' 

To call strings.HasPrefix from option (A) we get the following machine code:


  MOVQ (TLS), CX CMPQ SP, 16(CX) JLS more_stack fn_body: SUBQ $40, SP MOVQ BP, 32(SP) LEAQ 32(SP), BP XCHGL AX, AX MOVQ s+56(SP), AX CMPQ AX, $1 JGE compare_strings XORL AX, AX MOVB AL, ~ret1+64(SP) MOVQ 32(SP), BP ADDQ $40, SP return: RET compare_strings: MOVQ s+48(SP), AX MOVQ AX, (SP) LEAQ go.string."#"(SB), AX MOVQ AX, 8(SP) MOVQ $1, 16(SP) CALL runtime.memequal(SB) MOVBLZX 24(SP), AX JMP return more_stack: CALL runtime.morestack_noctxt(SB) JMP fn_body 

Do not pay attention to what the code looks like noodles.


What you should pay attention to:



But what then is generated for the manually built-in variant, the code from the example (B) ?


  MOVQ s+16(SP), AX CMPQ AX, $1 JLT different_length MOVQ s+8(SP), AX CMPB (AX), $35 // 35 - код символа "#" SETEQ AL return: MOVB AL, "".~ret1+24(SP) RET different_length: XORL AX, AX JMP 22 

And here the compiler does not generate a runtime.memequal call and compares a single character directly. Ideally, he should have done the same for the first option.


We observe the weak side of the Go optimizer, and we will analyze it.


Optimize constant expressions


The reason why the call to strings.HasPrefix(s, "#") can be optimized is because the prefix argument is constant. We know its length and content. It makes no sense to call runtime.memequal for short lines, it is faster to make a comparison of characters "in place", avoiding an extra call.


As you know, compilers usually have at least two parts: the compiler frontend and the compiler backend . The first works with a higher-level view, the second is closer to the machine and the intermediate view will look like a stream of instructions. Already several versions in Go use the SSA representation for optimizations in the backend part of the compiler.


The folding of constants, such as {10*2 => 20} , is implemented in the backend. In general, most of the operations associated with lowering the computational cost of expressions are in this part of the compiler. But there are exceptions.


One exception is the optimization of constant string comparisons. When the compiler sees a string comparison (or substrings) in which one or both operands are constants, a more efficient code is generated than a runtime.memequal call.


You can view the source code for this in cmd / compile / internal / gc / walk.go: 3362 .


Embedding of functions occurs before running these optimizations, but also in the frontend part of the compiler.


It would seem that all the same does not allow this optimization to work in our case?


How Go builds function calls


This is how embedding will occur:


 // Вот как выглядел вызов: return strings.HasPrefix(s, "#") // Вот сигнатура: func HasPrefix(s, prefix string) bool // А вот результат встраивания: _s, _prefix := s, "#" return len(s) >= len(prefix) && s[:len(prefix)] == prefix 

When embedding functions, the compiler assigns arguments to temporary variables, which breaks the optimization, since the algorithm in walk.go does not see the constants, but the arguments with variables. This is the problem.


By the way, optimizations from backend, which are available to SSA, do not interfere. But there are other problems there, for example, the inability to restore high-level language constructs for their effective comparison (work to eliminate this shortcoming has been “planned” for several years).


Another example: escape analysis


Imagine a function that is important to allocate a temporary buffer on the stack:


 func businessLogic() error { buf := make([]byte, 0, 16) // buf используется только локально // для временного хранения результатов. return nil } 

Since buf does not "run away", the compiler will be able to allocate these 16 bytes on the stack, without allocation on the heap. Again, all thanks to the constant value when invoking make . To allocate memory on the stack, it is important for us to know the required size, which will be part of the frame allocated for the function call.


Suppose in the future we wanted to allocate temporary buffers of different sizes and encapsulate some logic in the methods. We introduced a new abstraction and decided to use the new tmpBuf type tmpBuf . The design function is extremely simple:


 func newTmpBuf(sizeHint int) tmpBuf { return tmpBuf{buf: make([]byte, 0, sizeHint)} } 

Adapt the original example:


 func businessLogic() error { - buf := make([]byte, 0, 16) + buf := newTmpBuf(16) // buf используется только локально // для временного хранения результатов. return nil } 

The constructor will be embedded, but the allocation will now always be on the heap, for the same reason passing arguments through temporary variables. Escape analysis will see make([]byte, 0, _sizeHint) that does not fall under its recognition pattern of optimized make calls.


If we had "everything just like people", there would be no problem, after embedding the constructor newTmpBuf it would be clear that the size is still known at the compilation stage.


This saddens almost more than the situation with string comparison.


Horizons Go 1.13


The situation can be corrected quite easily and I have already sent the first part of the solution .


image

If you think that the problem described in the article really needs a solution, please put a thumbs up in the appropriate issue .





My position is that embedding the code with your hands just because it runs faster in the current version of Go is wrong. It is necessary to correct this flaw in the optimizer, at least to the level that the examples described above work without unexpected performance regressions.


If everything goes according to plan, this optimization will go into the release of Go 1.13.


Thanks for attention.


Addition: proposed solution


This section is for the bravest, those who are not yet tired of reading.


So, we have several places that work worse when using variables instead of their values ​​directly. The proposed solution is to introduce a new function in the frontend part of the compiler, which allows you to get the last bound value by name. After that, in each optimization that expects a constant value, do not give up when a variable is detected, but receive this previously saved state.


The signature of our new feature might look like this:


 func getConstValue(n *Node) *Node 

The Node definition can be viewed in the syntax.go file.


Each variable definition corresponds to a Node with an ONAME tag. Inside Node.Name.Defn for most of these variables there is an initialization value.


If the Node already a literal, nothing needs to be done, and we simply return n . If this is ONAME (variable), then you can try to extract from the n.Name.Defn the same initialization value.


But what about the modifications between declaring and reading the variable for which we call getConstValue ? If we confine ourselves to read-only variables, then there is no problem. In the frontend go, there are already special node flags that mark similar names. If the variable has been modified, getConstValue will not return an initial value.


Programmers usually do not modify the input arguments of numeric and string types, and this makes it possible to cover a fairly large number of cases with this primitive algorithm.


Now we are ready to consider the implementation of:


 func getConstValue(n *Node) *Node { // Мы раскрываем только ONAME у которых доступен definition. if n.Op != ONAME || n.Name.Defn == nil { return n } // Проверка на то, что инициализирующее значение не изменялось. // Заметим, что это очень консервативно, но нашу задачу по // исправлению проблемы встраивания функций и escape analysis'а решает. maybeModified := n.Assigned() || n.Name.Defn.Assigned() || n.Addrtaken() if maybeModified { return n } // OAS - Node типа присваивания. // n.Name.Defn.Left - это LHS. // n.Name.Defn.Right - это RHS. // consttype(v) возвращает константный тип инициализирующего выражения. // Если это CTxxx, то переданное выражение не является константным. if n.Name.Defn.Op == OAS { v := n.Name.Defn.Right if v != nil && consttype(v) != CTxxx { return v } } return n } 

Something like this changes the code, which depends on the constants:


 - i := indexconst(r) + i := indexconst(getConstValue(r)) 

Great, and it even works:


 n := 10 xs := make([]int, n) // Теперь не убегает в кучу! 

Before this change, escape analysis could not get a value of 10 through n , which made it an assumption that it was necessary to place xs on the heap.


The code above is syntactically similar to the situation observed when embedding. n may be a temporary variable that is added when the argument is passed.


Unfortunately, there are nuances.


We solved the problem for local variables entered through OAS , but Go initializes the variables for the built-in functions through OAS2 . Because of this, we will need a second change that extends the getConstValue function and slightly modifies the code of the inliner itself, because, among other things, OAS2 does not have a suitable Defn field.


That was bad news. Good news: #gocontributing channel appeared in the Russian-language slak , where you can share your ideas and plans, ask questions, and discuss everything related to participation in the development of Go.



Source: https://habr.com/ru/post/438636/