C / C++ Memory Model (Atomics et al.)
Data race
Two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other.
data race == undefined behavior.
Two threads accessing the same memory location, and at least one of the access is a write.
Register Allocation
Q: Is is always true that a local auto variable gets allocated in memory?
void foo()
{
int bar;
std::cout << "bar is " << &bar;
}
A: If the address of a variable is taken, the variable better be stored in a location that has an address. A register doesn’t have an address, so it has to be stored in memory whether it’s available or not.
Synchronization Operation
The library defines a number of atomic operations (7.17) and operations on mutexes (7.26.4) that are specially identified as synchronization operations.
These operations play a special role in making assignments in one thread visible to another. A synchronization operation on one or more memory locations is either an acquire operation, a release operation, both an acquire and release operation, or a consume operation.
A synchronization operation without an associated memory location is a fence and can be either an acquire fence, a release fence, or both an acquire and release fence.
Atomics
Atomics are a way to control compiler optimization and reordering.
Acquire and Release
-
Compiler can only see memory access by a single thread.
-
How to think about atomic: acquire and release are one way barrier.
A; // A can move after atomic_read();
// move down: OK
atomic_read(); // <=> lock.acquire() (acquire exclusion)
// move up: BLOCKED
B; // B cannot move before atomic_read() or after atomic_write()
// move down: BLOCKED
atomic_write() <=> lock.release() (release exclusion)
// move up: OK
C; // C can move before atomic_write();
The compiler:
- cannot move B outside of the region.
- can move A and C inside the region.
- cannot move C before the region: it cannot move up across an acquire.
- cannot move A after the region: it cannot move down across a release.
In addition, for sequential consistent model, you cannot reorder acquire with release, e.g., this:
atomic_write();
atomic_read();
cannot be reordered into:
atomic_read();
atomic_write();
MV: So apparently an atomic read prevents load fusing:
E.g. given:
read variable
if set
foo
read variable
if set
bar
bar
it can be load fused with:
register = variable
if register set
foo
bar
With atomic read:
// nothing can move up
vvv atomic_read vvv
if set
foo
// nothing can move up
vvv atomic_read vvv
if set
bar
At most, bar can be executed before foo.
std::atomic
Default seq const. Each individual write of an atomic: - guaranteed to be all or nothing - guaranteed to be executed in order
LLVM on atomics
Folding a load: Any atomic load from a constant global can be constant-folded, because it cannot be observed. Similar reasoning allows SROA (scalar replacement of aggregates) with atomic loads and stores.
std::atomic_ref
It is a class template to provide atomic operations on a non-atomic variable - it atomically interprets in-place a non-atomic variable or memory address.
C++ introduced atomic_ref to allow atomic access to any kind of data, even when
not declared atomic. For the lifetime of the std::atomic_ref
object, the
object it references is considered an atomic object.
Fences
Explicit barriers against reordering.
CONS:
- Non portables thru OS and architectures.
- You need to write the right one each time you used a shared variable
- Worsen performances: they stop everything, does not only affect one shared var, but everything.
Misc 2
Compiler can optimize only when it has visibility, e.g. it has visibility of what happens inside a function.
Compiler has to assume that any opaque function call is a full barrier, so it cannot move things around.
Opaque function: a fn the compiler has no prior information about. This implies that the compiler can make no assumptions about the side effects of the function call.
Tests with atomic
Atomic alone is not magic, it is not enough to force the compiler do something.
E.g.
std::atomic<int> foo;
foo = 1;
foo = 2;
foo = 3;
while (foo) {
sleep(1);
}
Gets optimized the hell out of it, exactly as if using plain int for foo:
0x100003f80 <main()+16>: mov $0x1,%edi
0x100003f85 <main()+21>: call 0x100003f8c
0x100003f8a <main()+26>: jmp 0x100003f80 <main()+16>
This was observed with clang-12 -03
.
By using volatile
:
static inline void loop(volatile int& bar)
{
while (foo) {
sleep(1);
}
}
int foo;
foo = 1;
foo = 2;
foo = 3;
loop(foo);
you force the compiler to always do a memory access (the three stores gets folded into a single store of imm. 3).
If we insert a library call to an opaque function before the loop (even if we
know the call takes the argument as read only, and APPARENTLY event if the
function argument is const int*
) then things starts to change:
int foo;
foo = 1;
foo = 2;
printf("foo: %p\n", &foo);
//...
- if the loop is a
sleep(1)
, sleep can be considered as another opaque function, so it behaves again as a barrier and there is no folding:
//...
while (foo) {
sleep(1);
}
-
if the loop is just incrementing a local counter
-
plain
int
makes the compiler fold and optimize out loads.```c //...
int ctr = 0; while (foo) { ctr++; } printf("ctr is %d\n", ctr); ```
-
declaring
foo
asstd::atomic
acts as a barrier and forces loads.```c ...
int ctr = 0; while (foo) { ctr++; } printf("ctr is %d\n", ctr); ```
Atomics starts to have sense when e.g. moving the variable as static, instead of local. Now, using atomic instead of plain int prevent the compiler to optimize loads.
static std::atomic<int> foo;
int main() {
foo = -1
int ctr = 0;
while (foo) {
ctr++
}