There are True and False sharing which allows processors to exchange cache lines. How can there be a visibility problem?

If sharing allows kernels to see each other's caches, then what is the problem that volatile solves?

Or you can also rephrase this: what are the lack of sharing mechanisms to prevent data leakage in a controlled race?

It is possible that sharing works, but it works only for those processors that already have this variable in the cache, and recruit threads can read data from memory that are no longer relevant, since those threads that have been working with this for a long time variable - did you manage to change its value after the last unloading into memory?

That is, in the interval between the first reading of a variable from memory and its first change, the sharing does not work? (as a hypothesis).

Update: "if the cache consistency protocol (cache coherency) requires processor caches to store a memory cell in a consistent state, then why do we need volatile, which does the same thing?"

  • 3
    I think the problem is that this is the most false sharing is not guaranteed. Variables may well sit on different cache lines. - VladD
  • And what is your example with counters? Since volatile is not enough for a safe increment, this may be the problem. - VladD
  • @VladD I don’t have a specific example, but it’s a bunch of identical ones that if you put volatile everything will work, the theoretical question is precisely the disadvantages of the sharing mechanisms that turn out to cure volatile. Sitting on different cache lines do you mean that a variable does not fit in one cache line or something else? - Pavel
  • one
    @Pavel: Is it possible to complete an example in which a leak without volatile? Because for me, the code looks fine. - VladD
  • 2
    I like this question, I read the answer with interest. Therefore, keep the competition - a little extra attention and turnips, and maybe another answer. - Nick Volynkin

3 answers 3

False Sharing

False Sharing is a term describing the mechanism of undesirable performance degradation when different threads modify independent variables that are on the same cache line.

Read the wonderful article from Habr , where these mechanisms are described. In short, here's a quote:

In this case, if one of the threads modifies the field of its structure, then the entire cache line is declared invalid for the remaining processor cores in accordance with the cache coherency protocol. Another thread can no longer use its structure, despite the fact that it already lies in the L1 cache of its core. In old P4 processors in such a situation, a long synchronization with the main memory would be required, that is, the modified data would be sent to the main memory and then read into the L1 cache of another core.

Volatile

The volatile modifier is a java language keyword introduced into a language to support secure multi-threaded programming. It imposes some additional conditions on the read / write variable. It is important to understand three things about volatile variables:

  1. The read / write operations of the volatile variable are atomic.

  2. The result of the operation of writing a value to a volatile variable by one stream becomes visible to all other threads that use this variable to read values ​​from it.

  3. The volatile keyword denies some optimizations / permutations in the processor and / or compiler.

Those. Comparing these concepts is not entirely correct, for volatile is a word for implementing secure multi-threaded programs, and false sharing is a term describing performance degradation.

Watch the wonderful lectures from Alexey Shipilev on the java memory model (and not only), where he arranges everything on the shelves.

If you have questions, I can try to disclose by updating my answer.

UPD: Answers to the questions below.

and where is it written in the spec that volatile guarantees us atomic operations?

Links: Essentials from Oracle specification

Is it not for this that we set synchronize to prevent the consequences of non-atomicity? If volatile guaranteed atomicity, would it be the only one that would solve all the problems, does it happen or not?

It is important to understand that there are two aspects of thread safety: (1) execution control and (2) memory visibility. The first is responsible for monitoring the execution of the code (including the order of instructions) and allowing / disallowing some blocks of the program to run concurrently (concurrently / simultaneously). The second is what memory actions are visible or not visible to other threads. This is because each processor has several cache levels between the processor itself and the shared memory, so threads running on different processor cores can see "different memory" at the same time due to the processor's local cache.

Synchronized

Using synchronized does not allow another process to capture a monitor (or lock) on the same object , thus preventing competitive (simultaneous) execution of code enclosed in a synchronized block. It is important to note that synchronization creates the so-called happens-before relationship. This relationship allows the thread that captured the monitor to "see" all changes made by another thread before the monitor is captured and released. In practice, this will correspond (a rough approximation) to the fact that the processor will update the caches at the time of the monitor capture and write to the memory after it is released. These operations are rather long (relatively).

Volatile

The use of volatile makes variable operations using the memory of the program "bypassing" the processor's cache. This can be useful when we need the visibility of this variable in different streams, but at the same time we do not care about the access to this variable. Also on 32bit java, the record long & double becomes atomic when a variable is declared as volatile. In the new JSR-133 specification (in Java5), volatile semantics have been enhanced. It imposed the rules of visibility and the rules prohibiting some of the compiler optimitization / jvm.

Examples

Volatile - will help

Suppose we have some kind of immutable object , the link to which is available for many threads, and they constantly use it in their calculations. Volatile is great for this situation. It is necessary that other threads begin to use the new object as soon as it is announced (at this point I mean that we will change the link from the existing object to the new one that has been configured). At the same time, we do not need to specifically synchronize this update, reset the caches.

Volatile - does not help

Take the usual counter:

 volatile int counter = 0; public void update() { counter++; //или counter = counter + 1; } 

The increment operation is non-atomic and consists of three operations: read, increment, write. In this example, a situation may occur when:

  • Stream1: enters the method reads the value "0";
  • Stream 1: increases the value by one "1";
  • Execution proceeds to the second thread;
  • Stream2: reading the value "0";
  • Flow2: increase in value by one "1";
  • Stream2: write "1" in counter;
  • Execution proceeds to the first thread;
  • Stream1: write "1" in counter;

As a result, instead of the value "2", the value "1" is stored in the counter. In this case, synchronization of the update() method or using AtomicInteger , etc. will help. this is beyond the scope of this question.

Summing up all of the above - volatile variables are used when all operations occurring with an object are "atomic" as in the first example (the link to the fully formed object changes, the record is from one single stream) and there is no competition for the state of the object .

  • and where is it written in the spec that volatile guarantees us atomic operations? Is it not for this that we set synchronize to prevent the consequences of non-atomicity? If volatile guaranteed atomicity, would it be the only one that would solve all the problems, does it happen or not? - Pavel
  • Updated the answer. In the evening I will try to give an example of the code. - damintsew
  • 1. And the meaning is to make the Volatile constant if the variable is unchangeable. Well, let the threads cache it, what problems may arise here? Final primitives and so thread-safe why limit performance by adding Volatile and disallowing threads to cache a constant. To the cache, then they quickly go down than in memory? Or maybe I misunderstood what you meant by the "unchangeable object" .... - Pavel
  • ** 2. ** If you add synchronization in the second example, it will work without Volatile, it turns out that it is not needed at all in terms of leaks, although it is quite useful to treat a drop in performance from False Sharing. Speaking correctly? Or confused? - Pavel
  • 1. The most simple example that I can give. Imagine that this common object is a set of constants for calculating the conversion from one currency to another while it has the following form: volatile const = {final rateUsd, final rateEur ...}. It turns out that the constants themselves are final, but the object itself is not. If I replace this object, then some threads may use the link to the new object, and others to the old one. Marking volatile JMM ensures that all threads should see a new value on the next request. - damintsew

I'll write a couple of clarifications.

As far as I understand, the question itself sounds a bit different: "if the cache consistency protocol (cache coherency) binds the processor caches to keep the memory cell in a consistent state, then why do we need volatile, which does the same thing?"

First, there is a discussion of two different levels. JLS operates inside the JVM, the cache coherence protocol is present only in a specific processor architecture. Cache coherence does not have to exist on the architecture for which the JVM is compiled and running , so JLS makes the optional feature mandatory (in fact, the feature is slightly more than just consistency, below). I’m pretty sure that 99% of multi-core processors now have this protocol, but Java cannot rely on something that is not guaranteed - it is assumed that all Java applications must run the same on all architectures (except when interacting with the OS, where there may be, for example, different ways). Therefore, JLS was almost obliged to introduce such a concept, even if it exists on most systems out of the box, because even if the JVM is implemented on some python, it still has to execute the code in the same way as on any other system.

Secondly, if we take the definition from Wikipedia:

a sequential order

(free translation) multiprocessor cache is consistent if all write operations to the same address are performed in a sequential order

here it is worth paying attention to the "memory location". In Java, there are data types that can occupy more than one word with which the processor operates - at least, when running on a 32bit operating system, double and long will take up two words each. If I understand everything correctly, then the following situation may arise on such a system:

 линия кэша 1: <другие данные><старшие или младшие 32 бита double> линия кэша 2: <остаток double><другие данные> 

In this case, the processor even under strict cache coherence has the right to update exactly half of the double, as a result of which the threads have the right to see garbage instead of real value. Volatile prohibits such a situation, ensuring the atomicity of the recording of any variable.

Thirdly, besides the directly "iron" problems, the compiler participates in the code execution (indirectly). I do not know how applicable this is to modern Java, but the aggressive compiler has the right to apply the following optimizations:

 boolean flag = true; while (flag) { doProcessing(); } // хм, flag не отмечен volatile, значит, программист считает, что он может обновляться только локально // закэширую-ка я его в регистре процессора, так будет быстрее eax = load(flag); while (eax) { doProcessing(); } 

the register will never be updated - it has nothing to do with the cache integrity protocol. Again, I don’t know how the existing Java compilers actually behave, but this particular example is given in the JLS as unsafe.

And finally, the volatile semantics interfere with the program execution order. JLS requires the following conditions:

  • All actions within the same thread have a happens-before relationship with each other - i.e. the result of the higher-level action code will always be visible to the lower-level action of the code.
  • All actions with volatile have an happens-before dependency to each other - if someone writes some value in the volatile-field, all subsequent readings no longer have the right to see the outdated value
  • The relation happens-before is transitive, i.e. if the operation A happens-before B, and B happens-before C, then A happens true-happens C - then C will see all changes made by A.

The compiler, the JVM and the processor have the right to move expressions as they please, as long as these conditions are met. If you take the following code

 int result = 0; boolean done = false; .... this.result = 1; this.done = true; 

then he has every right to turn into

 this.done = true; this.result = 1; 

because all subsequent expressions will still see the same result. In this example, another thread that saw done = true can still read 0 from result . However, if you declare done as volatile, then writing to result must occur before writing true to done , and reading true to write after it, and thus you can guarantee the visibility of changes in listener threads. This does not negate the possibility that as a result more than one record will occur during this period, it only guarantees that by the time of reading from result there will be a modern done update or a later value in it.

Update

In addition to all of the above, there is another funny case. Java frankly suffers from large hips, more precisely, from GC execution time on such a hip. Naturally, they are trying to deal with this problem with the help of GC, working in parallel with the application. One of the tactics in this case is the evacuation of living objects from the region being cleaned, in order to then simply declare it free for complete rewriting. In this case, two copies of the object (one at the old address, and another one being evacuated) that require synchronization of the records and reading can live at the same time in the JVM. Fortunately for the implementers, the JMM does not promise anything for regular readings, so most operations can be freed from synchronization, and at one point there can be a situation that all records go to one object, and reading is done from another until the access is not synchronized. This, like all the examples described above, is in full agreement with cache coherence, but allows for anomalies when the application is running (and all for the same reasons - cache coherence works at the level of individual memory blocks, JVM objects and fields). This paragraph refers to Shenandoah GC, which is expected in the tenth java, but such ways to shoot a leg can be easily expected in other situations.

  • The abbreviation JLS breaks into the text so suddenly. Can decoding and definition? - Nick Volynkin
  • @NickVolynkin Java Language Specification, a set of rules and conditions by which a Java program should work - etki
  • @etki Thank you for the answer, and the fact that the question was more precisely formulated was what I was trying to ask. You write: "the register will never be updated at the same time" and could not reveal the concept of the register a little. It's just a completely new detail in the argument. I thought that there are several levels of cache and that's it, and here came the "register". Do not tell a little what kind of beast? - Pavel
  • one
    @Pavel the processor has its own data storage area, which is necessary for operating data - these are registers. If you explain very roughly, then the processor needs to store intermediate data somewhere and add the results of the calculation, and it cannot use the RAM for this, because it would slow down the calculation speed by orders of magnitude. Therefore, each core has its own small storage, represented by registers, each of which may contain one processor word. The simplest example of using a register is storing the argument and / or the result of an add operation. - etki
  • one
    @Pavel is not quite, registers can be used by the programmer at their discretion, and only those things that are openly written / requested by instructions are sent to memory from memory. Just some processor instructions, including the above addition, read / write from / to strictly specified registers. - etki

About false sharing was written very correctly above.

In general, there is a simple rule, if there is one writer and many readers, volatile fits perfectly.

If there are many writers, atomic operations or other synchronizing primitives are needed.

  • 2
    “There is a simple rule, if there is one writer and many readers, volatile fits perfectly” - well, this is the case if you have a primitive data structure, like a single flag. If you have two volatile flags, everything is already bad. - VladD
  • Like, because naturally, you are right :) Similarly with atomic operations. - Akzhan Abdulin
  • @Akzhan Abdulin Thank you for your interesting comment. But I would like to support VladD in that yes with primitives it’s still easier, the real hell begins with complex objects with a deep tree structure ... - Pavel