How to worsen performance, improving it

We wanted the best, but it turned out as always.
Viktor Chernomyrdin
Russian statesman

There are cases in life when you seem to be doing everything right, but something goes wrong.
This story is about one of these cases.

Once I looked at this code and thought about speeding it up:

public String appendBounds(Data data) { int beginIndex = data.beginIndex; int endIndex = data.endIndex; return new StringBuilder() .append('L') .append(data.str, beginIndex, endIndex) .append(';') .toString(); }

First, I wanted to calculate the final string length, use the beginIndex and endIndex (and the fact that, in addition to the stripped string, 2 more characters will be added to the StringBuilder ), and pass this value to the StringBuilder constructor to immediately select the array of the required size . This thought seemed too obvious to me, so I decided to try something else. I was pushed to the right thought by the fact that this code was not highlighted in any way by the “Idea”, although this clever girl usually suggests replacing a short string from StringBuilder::append with string addition, which is shorter and easier to understand.

A hindrance to this simplification is the use of the StringBuilder.append(CharSequence, int, int) method StringBuilder.append(CharSequence, int, int) . Given that the data.str field is a string, using String.substring(beginIndex, endIndex) you can select a substring from it and pass it to the StringBuilder.append(String) method.

Code after conversion:

 public String appendBounds(Data data) { int beginIndex = data.beginIndex; int endIndex = data.endIndex; String subString = data.str.substring(beginIndex, endIndex); return new StringBuilder() .append('L') .append(subString) .append(';') .toString(); }

And now the “Idea” offers a simplification:

 public String appendBounds(Data data) { int beginIndex = data.beginIndex; int endIndex = data.endIndex; return 'L' + data.str.substring(beginIndex, endIndex) + ';'; }

However, our goal in this case is not so much readability as performance. Compare both ways:

 @BenchmarkMode(Mode.AverageTime) @OutputTimeUnit(TimeUnit.NANOSECONDS) @Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"}) public class StringBuilderAppendBenchmark { @Benchmark public String appendSubString(Data data) { String latinStr = data.latinStr; String nonLatinStr = data.nonLatinStr; int beginIndex = data.beginIndex; int endIndex = data.endIndex; String substring = data.nonLatin ? nonLatinStr.substring(beginIndex, endIndex) : latinStr.substring(beginIndex, endIndex); return new StringBuilder() .append('L') .append(substring) .append(';') .toString(); } @Benchmark public String appendBounds(Data data) { String latinStr = data.latinStr; String nonLatinStr = data.nonLatinStr; int beginIndex = data.beginIndex; int endIndex = data.endIndex; String appended = data.nonLatin ? nonLatinStr : latinStr; return new StringBuilder() .append('L') .append(appended, beginIndex, endIndex) .append(';') .toString(); } @State(Scope.Thread) public static class Data { String latinStr; String nonLatinStr; @Param({"true", "false"}) boolean nonLatin; @Param({"5", "10", "50", "100", "500", "1000"}) private int length; private int beginIndex; private int endIndex; private ThreadLocalRandom random = ThreadLocalRandom.current(); @Setup public void setup() { latinStr = randomString("abcdefghijklmnopqrstuvwxyz"); nonLatinStr = randomString("абвгдеёжзиклмнопрстуфхцчшщьыъэюя"); beginIndex = 1; endIndex = length + 1; } private String randomString(String alphabet) { char[] chars = alphabet.toCharArray(); StringBuilder sb = new StringBuilder(length + 2); for (int i = 0; i < length + 2; i++) { char c = chars[random.nextInt(chars.length)]; sb.append(c); } return sb.toString(); } } }

The benchmark is as simple as two kopecks: a random string is added to the StringBuilder , the size of which is determined by the length field, and since it is 2019 in the yard, you need to check it as a string containing only the main Latin characters (i.e. corresponds to 1 byte), and a string with non-Latin characters (each character is represented by 2 bytes).

When looking at the surface, the appendSubString method appendSubString to be slower, because the amount of data to be pasted is the same as in the appendBounds method, but the appendSubString method appendSubString has an explicit creation of a substring, i.e., allocating memory to a new object and copying the contents from data.latinStr / data.nonLatinStr .

All the more surprising (but only at first glance) are the results of the measurement that I performed using JDK11 on my home machine (Intel Core i5-4690, 3.50 GHz):

 Benchmark nonLatin length Score Error Units appendBounds true 5 44,6 ± 0,4 ns/op appendBounds true 10 45,7 ± 0,7 ns/op appendBounds true 50 129,0 ± 0,5 ns/op appendBounds true 100 218,7 ± 0,8 ns/op appendBounds true 500 907,1 ± 5,5 ns/op appendBounds true 1000 1626,4 ± 13,0 ns/op appendSubString true 5 28,6 ± 0,2 ns/op appendSubString true 10 30,8 ± 0,2 ns/op appendSubString true 50 65,6 ± 0,4 ns/op appendSubString true 100 106,6 ± 0,6 ns/op appendSubString true 500 430,1 ± 2,4 ns/op appendSubString true 1000 839,1 ± 8,6 ns/op appendBounds:·gc.alloc.rate.norm true 5 184,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm true 10 200,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm true 50 688,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm true 100 1192,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm true 500 5192,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm true 1000 10200,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm true 5 136,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm true 10 160,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm true 50 360,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm true 100 608,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm true 500 2608,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm true 1000 5104,0 ± 0,0 B/op appendBounds false 5 20,8 ± 0,1 ns/op appendBounds false 10 24,0 ± 0,2 ns/op appendBounds false 50 66,4 ± 0,4 ns/op appendBounds false 100 111,0 ± 0,8 ns/op appendBounds false 500 419,2 ± 2,7 ns/op appendBounds false 1000 840,4 ± 7,8 ns/op appendSubString false 5 25,3 ± 0,3 ns/op appendSubString false 10 25,7 ± 0,2 ns/op appendSubString false 50 36,0 ± 0,1 ns/op appendSubString false 100 52,8 ± 0,4 ns/op appendSubString false 500 206,1 ± 6,1 ns/op appendSubString false 1000 388,1 ± 1,6 ns/op appendBounds:·gc.alloc.rate.norm false 5 80,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm false 10 88,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm false 50 320,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm false 100 544,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm false 500 2144,0 ± 0,0 B/op appendBounds:·gc.alloc.rate.norm false 1000 4152,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm false 5 96,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm false 10 112,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm false 50 192,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm false 100 288,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm false 500 1088,0 ± 0,0 B/op appendSubString:·gc.alloc.rate.norm false 1000 2088,0 ± 0,0 B/op

In the overwhelming majority of cases (including always for non-Latin strings), the appendSubString method, appendSubString refuted our assumption, turned out to be both faster and less voracious (and this despite the fact that String::substring returns a new object). How did it happen?

I look in the book, I see a fig

To lift the veil of secrecy will help the study of the source code StringBuilder . Both methods used pass the call to the same AbstractStringBuilder methods:

 public final class StringBuilder extends AbstractStringBuilder implements java.io.Serializable, Comparable<StringBuilder>, CharSequence { @Override public StringBuilder append(String str) { super.append(str); return this; } @Override public StringBuilder append(CharSequence s, int start, int end) { super.append(s, start, end); return this; } }

Go to AbstractStringBuilder.append(String) :

 public AbstractStringBuilder append(String str) { if (str == null) { return appendNull(); } int len = str.length(); ensureCapacityInternal(count + len); putStringAt(count, str); count += len; return this; } private final void putStringAt(int index, String str) { if (getCoder() != str.coder()) { inflate(); } str.getBytes(value, index, coder); }

What is interesting here? The AbstractStringBuilder::inflate , as the name suggests, expands the AbstractStringBuilder.value array when merging heterogeneous strings. Data transfer occurs in the String::getBytes :

 void getBytes(byte[] dst, int dstBegin, byte coder) { if (coder() == coder) { System.arraycopy(value, 0, dst, dstBegin << coder, value.length); } else { // this.coder == LATIN && coder == UTF16 StringLatin1.inflate(value, 0, dst, dstBegin, value.length); } }

What is important? If the strings are homogeneous, then the intrinsic System::arraycopy is used to move the data, in the opposite case, StringLatin1::inflate , which through delegation leads us to the StringUTF16::inflate :

 // inflatedCopy byte[] -> byte[] @HotSpotIntrinsicCandidate public static void inflate(byte[] src, int srcOff, byte[] dst, int dstOff, int len) { // We need a range check here because 'putChar' has no checks checkBoundsOffCount(dstOff, len, dst); for (int i = 0; i < len; i++) { putChar(dst, dstOff++, src[srcOff++] & 0xff); } } @HotSpotIntrinsicCandidate static void putChar(byte[] val, int index, int c) { assert index >= 0 && index < length(val) : "Trusted caller missed bounds check"; index <<= 1; val[index++] = (byte)(c >> HI_BYTE_SHIFT); val[index] = (byte)(c >> LO_BYTE_SHIFT); }

Thus, if the strings are homogeneous, then the platform-dependent method System::arraycopy is used for data movement, otherwise the cycle (also intrinsicified) is used. This means that when pasting two strings, where all characters belong to the main Latin alphabet (to put it simply, they fit into 1 byte), the performance should be much better than when sticking together heterogeneous strings. The benchmark confirms this (see output for nonLatin = false ).

Now the AbstractStringBuilder.append(CharSequence, int, int) method AbstractStringBuilder.append(CharSequence, int, int) :

 @Override public AbstractStringBuilder append(CharSequence s, int start, int end) { if (s == null) { s = "null"; } checkRange(start, end, s.length()); int len = end - start; ensureCapacityInternal(count + len); appendChars(s, start, end); return this; } private final void appendChars(CharSequence s, int off, int end) { if (isLatin1()) { byte[] val = this.value; for (int i = off, j = count; i < end; i++) { char c = s.charAt(i); if (StringLatin1.canEncode(c)) { val[j++] = (byte)c; } else { count = j; inflate(); StringUTF16.putCharsSB(this.value, j, s, i, end); count += end - i; return; } } } else { StringUTF16.putCharsSB(this.value, count, s, off, end); } count += end - off; }

Here, the approach is similar to the existing one in the previous example: for homogeneous strings a simpler mechanism is used (here, cognitive copying in a loop), for heterogeneous StringUTF16 - a call to StringUTF16 , however note that the StringUTF16::putCharsSB method StringUTF16::putCharsSB not intrinsic.

 public static void putCharsSB(byte[] val, int index, CharSequence s, int off, int end) { checkBoundsBeginEnd(index, index + end - off, val); for (int i = off; i < end; i++) { putChar(val, index++, s.charAt(i)); } }

So, the internal structure of both methods and the reason for their different performance are more or less clear to us. The question naturally arises - what to do with the knowledge gained next? There are several options at once:

1) keep it in your head and change it with your hands when a suspicious code is detected
2) go to Tagir and ask him to file a check that will do the work for us
3) make changes to the JDK so that the code does not change at all.

Of course, we start with the third. Ready to take a chance?

Dive into the abyss

We will train ~~on cats~~ on the source code of the eleventh Java, you can download it here .

The simplest and most obvious way to improve is to select the substring already inside the AbstractStringBuilder.append(CharSequence, int, int) method AbstractStringBuilder.append(CharSequence, int, int) :

 // было public AbstractStringBuilder append(CharSequence s, int start, int end) { if (s == null) { s = "null"; } checkRange(start, end, s.length()); int len = end - start; ensureCapacityInternal(count + len); appendChars(s, start, end); return this; } // стало public AbstractStringBuilder append(CharSequence s, int start, int end) { if (s == null) { s = "null"; } checkRange(start, end, s.length()); return append(s.subSequence(start, end).toString()); }

Now you need to build the JDK, run the tests and run the benchmark StringBuilderAppendBenchmark::appendBounds on it, the results of which need to be compared with the results of the same benchmark on the original JDK:

 # здесь колонка before содержит данные прогона на исходном JDK, # after - на изменённом Benchmark nonLatin length before after Units avgt true 5 44,6 64,4 ns/op avgt true 10 45,7 66,3 ns/op avgt true 50 129,0 168,9 ns/op avgt true 100 218,7 281,9 ns/op avgt true 500 907,1 1116,2 ns/op avgt true 1000 1626,4 2002,5 ns/op gc.alloc.rate.norm true 5 184,0 264,0 B/op gc.alloc.rate.norm true 10 200,0 296,0 B/op gc.alloc.rate.norm true 50 688,0 904,0 B/op gc.alloc.rate.norm true 100 1192,0 1552,0 B/op gc.alloc.rate.norm true 500 5192,0 6752,0 B/op gc.alloc.rate.norm true 1000 10200,0 13256,0 B/op avgt false 5 20,8 38,0 ns/op avgt false 10 24,0 37,8 ns/op avgt false 50 66,4 82,9 ns/op avgt false 100 111,0 138,8 ns/op avgt false 500 419,2 531,9 ns/op avgt false 1000 840,4 1002,7 ns/op gc.alloc.rate.norm false 5 80,0 152,0 B/op gc.alloc.rate.norm false 10 88,0 168,0 B/op gc.alloc.rate.norm false 50 320,0 440,0 B/op gc.alloc.rate.norm false 100 544,0 688,0 B/op gc.alloc.rate.norm false 500 2144,0 2688,0 B/op gc.alloc.rate.norm false 1000 4152,0 5192,0 B/op

What is called, suddenly! Improvements not only did not happen, but worsened. Damn it, but how?

The fact is that at the very beginning, in the description of the StringBuilder::append method StringBuilder::append I made one small but critical omission. The method was described like this:

 public final class StringBuilder { @Override public StringBuilder append(String str) { super.append(str); return this; } }

And here is its full look:

 public final class StringBuilder { @Override @HotSpotIntrinsicCandidate public StringBuilder append(String str) { super.append(str); return this; } }

The Java code, which we considered above, being warmed up and compiled at the C2 level, does not matter, since it is not performed, but the intrinsic. This is easily proved by removing the profile using the async-profiler . Hereinafter, the profile is removed for length = 1000 and nonLatin = true :

 # профиль метода `appendSubString`, JDK без наших изменений ns percent samples top ---------- ------- ------- --- 19096340914 43.57% 1897673 jbyte_disjoint_arraycopy <--------- 13500185356 30.80% 1343343 jshort_disjoint_arraycopy <--------- 4124818581 9.41% 409533 java.lang.String.<init> # создание подстроки 2177311938 4.97% 216375 java.lang.StringUTF16.compress # создание подстроки 1557269661 3.55% 154253 java.util.Arrays.copyOfRange # создание подстроки 349344451 0.80% 34823 appendSubString_avgt_jmhStub 279803769 0.64% 27862 java.lang.StringUTF16.newString 274388920 0.63% 27312 org.openjdk.jmh.infra.Blackhole.consume 160962540 0.37% 15946 SpinPause 122418222 0.28% 11795 __memset_avx2

The StringBuilder code (and AbstractStringBuilder ) doesn't even smell here, almost 3/4 of the profile is occupied by an intrinsic. I would like to observe approximately the same picture in the StringBuilder.append(CharSequence, int, int) profile of the "improved" by us.

In fact, we have this:

  ns percent samples top ---------- ------- ------- --- 19071221451 43.78% 1897827 jbyte_disjoint_arraycopy 6409223440 14.71% 638348 jlong_disjoint_arraycopy 3933622128 9.03% 387403 java.lang.StringUTF16.newBytesFor 2067248311 4.75% 204193 java.lang.AbstractStringBuilder.ensureCapacityInternal 1929218737 4.43% 194751 java.lang.StringUTF16.compress 1678321343 3.85% 166458 java.util.Arrays.copyOfRange 1621470408 3.72% 160849 java.lang.String.checkIndex 969180099 2.22% 96018 java.util.Arrays.copyOf 581600786 1.34% 57818 java.lang.AbstractStringBuilder.<init> 417818533 0.96% 41611 appendBounds_jmhTest 406565329 0.93% 40479 java.lang.String.<init> 340972882 0.78% 33727 java.lang.AbstractStringBuilder.append 299895915 0.69% 29982 java.lang.StringBuilder.toString 183885595 0.42% 18136 SpinPause 168666033 0.39% 16755 org.openjdk.jmh.infra.Blackhole.consume

You say: "Here they are, intrinsiki, at the very top!" Indeed, only this is a little wrong intrinsiki (incl. Compare the name of the second from above). Recall:

 public final class StringBuilder { @Override @HotSpotIntrinsicCandidate public StringBuilder append(String str) { super.append(str); return this; } }

Here the intrinsic replaces the call to StringBuilder.append(String) , but in our patch there is no call! Called AbstractStringBuilder.append(String) . The jbyte_disjoint_arraycopy call we jbyte_disjoint_arraycopy is an intrinsic for StringLatin1::inflate , called from AbstractStringBuider::putStringAt via String::getBytes . That is, unlike StringBuilder::append not only StringBuilder::append platform dependent, but also Java code,

With the cause of the failure figured out, try to succeed otherwise. It is easy to guess that we need to somehow refer to StringBuilder::append . You can do this by tearing off the previous patch and making changes to the StringBuilder itself:

 public final class StringBuilder { // было @Override public StringBuilder append(CharSequence s, int start, int end) { super.append(s, start, end); return this; } // стало @Override public StringBuilder append(CharSequence s, int start, int end) { if (s == null) { s = "null"; } checkRange(start, end, s.length()); return this.append(s.subSequence(start, end).toString()); } }

Now everything is done according to the mind: an intrinsic StringBuilder :: append is called.
Reassemble, run, compare:

 # здесь колонка before содержит данные прогона на исходном JDK, # after - на изменённом Benchmark nonLatin length before after Units avgt true 5 44,6 60,2 ns/op avgt true 10 45,7 59,1 ns/op avgt true 50 129,0 164,6 ns/op avgt true 100 218,7 276,2 ns/op avgt true 500 907,1 1088,8 ns/op avgt true 1000 1626,4 1959,4 ns/op gc.alloc.rate.norm true 5 184,0 264,0 B/op gc.alloc.rate.norm true 10 200,0 296,0 B/op gc.alloc.rate.norm true 50 688,0 904,0 B/op gc.alloc.rate.norm true 100 1192,0 1552,0 B/op gc.alloc.rate.norm true 500 5192,0 6752,0 B/op gc.alloc.rate.norm true 1000 10200,0 13256,0 B/op avgt false 5 20,8 37,9 ns/op avgt false 10 24,0 37,9 ns/op avgt false 50 66,4 80,9 ns/op avgt false 100 111,0 125,6 ns/op avgt false 500 419,2 483,6 ns/op avgt false 1000 840,4 893,8 ns/op gc.alloc.rate.norm false 5 80,0 152,0 B/op gc.alloc.rate.norm false 10 88,0 168,0 B/op gc.alloc.rate.norm false 50 320,0 440,0 B/op gc.alloc.rate.norm false 100 544,0 688,0 B/op gc.alloc.rate.norm false 500 2144,0 2688,0 B/op gc.alloc.rate.norm false 1000 4152,0 5187,2 B/op

I really am very sad, but not better. Now the new profile:

  ns percent samples top ---------- ------- ------- --- 19614374885 44.12% 1953620 jbyte_disjoint_arraycopy 6645299702 14.95% 662146 jlong_disjoint_arraycopy 4065789919 9.15% 400167 java.lang.StringUTF16.newBytesFor 2374627822 5.34% 234746 java.lang.AbstractStringBuilder.ensureCapacityInternal 1837858014 4.13% 183822 java.lang.StringUTF16.compress 1472039604 3.31% 145956 java.util.Arrays.copyOfRange 1316397864 2.96% 130747 appendBounds_jmhTest 956823151 2.15% 94959 java.util.Arrays.copyOf 573091712 1.29% 56933 java.lang.AbstractStringBuilder.<init> 434454076 0.98% 43202 java.lang.String.<init> 368480388 0.83% 36439 java.lang.AbstractStringBuilder.append 304409057 0.68% 30442 java.lang.StringBuilder.toString 272437989 0.61% 26833 SpinPause 201051696 0.45% 19985 java.lang.StringBuilder.<init> 198934052 0.45% 19810 appendBounds_avgt_jmhStub

Little has changed. It remains unclear to me why the intrinsic did not work when accessing StringBuilder.append(String) from within StringBuilder . There is a suspicion that inserting (inline) the body of the StringBuilder.append(String) method into the StringBuilder.append(CharSequence, int, int) body StringBuilder.append(CharSequence, int, int) changes something in the processing of VM calls to methods.

Anyway, it's a fiasco, bro. It was not possible to patch the JDK, but we can still do the manual replacement where it makes sense.

Literary digression about failures

Response encryption came in two days. Navigator does not want to part with the "Oto Velara", a company that builds amazingly fast and powerful warships. The navigator does not want to read the encryption to me. He simply repeats the command response: "No." Encryption does not clarify why no. “No” means in any case that he is a person known to a large computer. If nothing was known about him, the answer would be positive: try. It's a pity. It is a pity to lose such an interesting person. A commander, probably sorry for me. Maybe the first time is a pity. He sees me breaking into the Varygs. And he doesn’t want to push me again in greyhounds.
He is silent. But I know that in providing a wild shortage of workers:
- I, comrade general, work tomorrow in support. Allow me to go?
- Go. - And suddenly smiles. - You know, every cloud has a silver lining.
- I, comrade general, are always bad without good.
- Well, no. You have been forbidden to meet him, this is bad. But to the treasures of our experience, we have added another grain.

Findings:

The code of JDK methods in some cases is not related to the present execution, because instead of the method body, an intrinsic can be executed that is hidden in the bowels of the VM.
such methods can be recognized, in particular, they are indicated by the @HotSpotIntrinsicCandidate label, although some methods are intrinsicized without any hints, for example String::equals (and many, many others ).
The conclusion from the first two is that our reasoning about how JDK code works may contradict reality. C'est la vie

PS
Other possible replacements:

 StringBuilder sb = new StringBuilder(); sb.append(str, 0, endIndex); // --> StringBuilder sb = new StringBuilder(str.substring(o, endIndex));

Pps
The developers of Orakla rightly point out that

It seems to me to create a code path into
sb.append (cs, int, int) that allocates memory
only sometimes makes things run faster. As you observed, the performance
tradeoffs aren't obvious.

Instead, if you want to optimize sb.append (cs, int, int) maybe we should just go
by or behind the intrinsics.

The proposed solution is intrinsification of StringBuilder.append(CharSequence, int, int) .

→ Task
→ Discussion

Pps
Interestingly, at the moment when writing something like

 StringBuilder sb = new StringBuilder(); sb.append(str.substring(0, endIndex));

"Idea" offers to simplify the code to

 StringBuilder sb = new StringBuilder(); sb.append(s, 0, endIndex);

If the performance in this place is not very important to you, then it is probably more correct to use the second, simplified version. Still, most of the code we write for our comrades, not for machines.

How to worsen performance, improving it

I look in the book, I see a fig

Dive into the abyss

More articles: