Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 874 Vote(s) - 3.48 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Can x86 reorder a narrow store with a wider load that fully contains it?

#1
[Intel® 64 and IA-32 Architectures Software Developer’s Manual][1] says:

> **8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations**<br/>The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location.
However, loads are not reordered with stores to the same location.

What about loads that partially or fully overlap previous stores, but don't have the same start address? (See the end of this post for a specific case)

---

Suppose the following C-like code:

// lock - pointer to an aligned int64 variable
// threadNum - integer in the range 0..7
// volatiles here just to show direct r/w of the memory as it was suggested in the comments
int TryLock(volatile INT64* lock, INT64 threadNum)
{
if (0 != *lock)
return 0; // another thread already had the lock

((volatile INT8*)lock)[threadNum] = 1; // take the lock by setting our byte

if (1LL << 8*threadNum != *lock)
{ // another thread set its byte between our 1st and 2nd check. unset ours
((volatile INT8*)lock)[threadNum] = 0;
return 0;
}

return 1;
}

Or its x64 asm equivalent:

; rcx - address of an aligned int64 variable
; rdx - integer in the range 0..7
TryLock PROC
cmp qword ptr [rcx], 0
jne @fail

mov r8, rdx
mov rax, 8
mul rdx

mov byte ptr [rcx+r8], 1

bts rdx, rax
cmp qword ptr [rcx], rdx
jz @success

mov byte ptr [rcx+r8], 0

@fail:
mov rax, 0
ret

@success:
mov rax, 1
ret

---

Then suppose that TryLock is concurrently executed in two threads:

INT64 lock = 0;

void Thread_1() { TryLock(&lock, 1); }
void Thread_5() { TryLock(&lock, 5); }

### The question:

The `((INT8*)lock)[1] = 1;` and `((INT8*)lock)[5] = 1;` stores aren't to the same location as the 64bit load of `lock`. However, they are each fully contained by that load, so does that "count" as the same location? It seems impossible that a CPU could do that.

What about `((INT8*)lock)[0] = 1`? The address of the store is then the same as the address of the following load. Are these operations "to the same location", even if the earlier case wasn't?

p.s. please notice that the question isn't about C/Asm code, it's about behaviour of the x86 CPUs.

[1]:
Reply

#2
Can `mov byte [rcx+r8], 1` reorder with the `cmp qword [rcx], rdx` load that follows it? This is the `lock[threadNum]=1` store and the following load to make sure nobody else wrote a byte.

The load must return data that includes the store, because the executing thread always observes its own actions to happen in program order. (This is true even on weakly-ordered ISAs).

---

**It turns out this exact locking idea has been proposed before (for the Linux kernel), and [Linus Torvalds explained that x86 really does allow this kind of reordering][1]**

Despite the term ["store-forwarding failure or stall"][2], it doesn't mean the data has to commit to cache before the load can read it. It actually can be read from the store buffer while the cache line is still in S state ([MESI][3]). (And on in-order Atom cores, you don't even get a store-forwarding stall at all.)

Real hardware does work this way (as Alex's tests show): the CPU will merge data from L1D with data from the store buffer, without committing the store to L1D.

This by itself isn't reordering *yet*<sup>1</sup> (the load sees the store's data, and they're adjacent in the global order), but it leaves the door open for reordering. The cache line can be invalidated by another core after the load, but before the store commits. A store from another core can become globally visible after our load, but before our store.

So the load includes data from our own store, but not from the other store from another CPU. The other CPU can see the same effect for its load, and thus both threads enter the critical section.

---

<sup>1</sup>
(This is the point I was making [in comments on Alex's answer][4]. If x86 didn't allow this reordering, CPUs could still do the store-forwarding speculatively before the store becomes globally visible, and shoot it down if another CPU invalidated the cache line before the store committed. That part of Alex's answer didn't prove that x86 worked the way it does. Only experimental testing and careful reasoning about the locking algo gave us that.)

If x86 did disallow this reordering, a store/partially-overlapping-reload pair would work like an MFENCE: Earlier loads can't become globally visible before the load, and earlier stores can't become globally visible before the store. The load has to become globally visible before any following loads or stores, and it would stop the store from being delayed, too.

Given this reasoning, it's not totally obvious why perfectly-overlapping stores aren't equivalent to an MFENCE as well. Perhaps they actually are, and x86 only manages to make spill/reload or arg-passing on the stack fast with speculative execution!

---

### The locking scheme:

It looks like `TryLock` can fail for both/all callers: They all see it initially zero, they all write their byte, then they all see at least two non-zero bytes each. This is not ideal for heavily-contended locks, compared to using a `lock`ed instruction. There is a hardware arbitration mechanism to handle conflicting `lock`ed insns. (TODO: find the Intel forum post where an Intel engineer posted this in response to another software retry loop vs. `lock`ed instruction topic, IIRC.)

The narrow-write / wide-read will always trigger a store-forwarding stall on modern x86 hardware. I think this just means the load result isn't ready for several cycles, not that execution of other instructions stalls (at least not in an OOO design).

In a lightly-contended lock that's used frequently, the branch will be correctly predict to take the no-conflict path. Speculative execution down that path until the load finally completes and the branch can retire shouldn't stall, because store-forwarding stalls are not quite long enough to fill up the ROB.

* SnB: ~12 cycles longer than when store-forwarding works (~5c)
* HSW: ~10c longer
* SKL: ~11c longer than when store-forwarding works (4c for 32 and 64bit operands, which is 1c less than previous CPUs)
* AMD K8/K10: Agner Fog doesn't give a number.
* AMD Bulldozer-family: 25-26c (Steamroller)

* Atom: "Unlike most other processors, the Atom can do store
forwarding even if the read operand is larger than the preceding write operand or differently aligned", and there is only 1c latency. Only fails when crossing a cache-line boundary.
* Silvermont: ~5c extra (base: 7c)
* AMD Bobcat/Jaguar: 4-11c extra (base: 8c/3c)

So *if* the whole locking scheme works, it might do well for lightly-contended locks.

I think you could turn it into a multiple-readers/single-writer lock by using bit 1 in each byte for readers and bit 2 for writers. TryLock_reader would ignore the reader bits in other bytes. TryLock_writer would work like the original, requiring a zero in all bits in other bytes.

----


BTW, for memory ordering stuff in general, [Jeff Preshing's blog is excellent][5].


[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

[4]:

[To see links please register here]

[5]:

[To see links please register here]

Reply

#3
> Can x86 reorder a narrow store with a wider load that fully contains
> it?

Yes, x86 can reorder a narrow store with a wider load that fully contains it.

That is why your lock algorithm is broken, `shared_value` isn't equal to 800000:

1. GCC 6.1.0 x86_64 - link to assembler code:

[To see links please register here]


* `shared_value = `**`662198`**:

[To see links please register here]



2. Clang 3.8.0 x86_64 - link to assembler code:

[To see links please register here]


* `shared_value = `**`538246`**:

[To see links please register here]



See below correct example.

----------


> The question:
>
> The ((INT8*)lock)[ 1 ] = 1; and ((INT8*)lock)[ 5 ] = 1; stores aren't to
> the same location as the 64bit load of lock. However, they are each
> fully contained by that load, so does that "count" as the same
> location?
>

No, that does not.


> [Intel® 64 and IA-32 Architectures Software Developer’s Manual][1] says:
>
> 8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations The Intel-64 memory-ordering model allows a load to be
> reordered with an earlier store to a different location. However,
> loads are not reordered with stores to the same location.

This is a simplified rule for the case when the STORE and LOAD of the same size.

But a general rule is that the write into the memory is delayed for a time, and STORE (address+value) enqueued to the Store Buffer to waits cache-line in exclusive-state (E) - when this cache line will be invalidated (I) in cache of other CPU-Cores. But you can use asm operation `MFENCE` (or any operation with `[LOCK]` prefix) to forced wait until the write is done, and any following instructions can be done only after the Store Buffer will have been cleared, and STORE will be visible to all CPU-Cores.

About reordering two lines:

((volatile INT8*)lock)[threadNum] = 1; // STORE
if (1LL << 8*threadNum != *lock) // LOAD

* If STORE and LOAD size is equal, then LOAD CPU-Core do (Store-forwarding) lookup into Store-Buffer and sees all the required data - you can get all actual data right now before STORE has been done

* If STORE and LOAD size is not equal, STORE (1 Byte) and LOAD (8 Byte), then even if LOAD CPU-Core do lookup into Store-Buffer then it sees only 1/8 of the required data - you can't get all actual data right now before STORE has been done. Here could be 2 variants of CPU actions:

1. **case-1:** CPU-Core loads other data from cache-line which in shared-state (S), and overlaps 1 Byte from Store Buffer, but the STORE still remains in the Store Buffer and waits for receipt of an exclusive-state (E) cache line to modify it - i.e. CPU-Core reads data before STORE has been done - in your example is data-races (error). STORE-LOAD reordered to LOAD-STORE in globally visible. **- This is exactly what happens on x86_64**

2. **case-2:** CPU-Core wait when Store-Buffer will be flushed, STORE has waited an exclusive-state (E) of cache line and STORE has been done, then CPU-Core loads all required data from cache-line. STORE-LOAD isn't reordered in globally visible. But this is the same as if you used the `MFENCE`.

Conclusion, you must use `MFENCE` after STORE in any case:

1. It completely solve the problem in the **case-1.**
2. It will not have any effect on the behavior and performance in the **case-2.** Explicit `MFENCE` for empty Store-Buffer will end immediately.

----------


**The correct example on C and x86_64 asm:**

We force the CPU-Core to act as in the **case-2** by using `MFENCE`, consequently there **isn't StoreLoad reordering**

* GCC 6.1.0 (uses `mfence` to flush Store-Buffer):

[To see links please register here]

* Clang 4.0(uses `[LOCK] xchgb reg, [addr]` to flush Store-Buffer):

[To see links please register here]


Note: `xchgb` is always has prefix `LOCK`, so it usually is not written in asm or indicated in brackets.

All other compilers can be selected manually on the links above: PowerPC, ARM, ARM64, MIPS, MIPS64, AVR.


C-code - should use Sequential Consistency for the first STORE and next LOAD:

#ifdef __cplusplus
#include <atomic>
using namespace std;
#else
#include <stdatomic.h>
#endif

// lock - pointer to an aligned int64 variable
// threadNum - integer in the range 0..7
// volatiles here just to show direct r/w of the memory as it was suggested in the comments
int TryLock(volatile uint64_t* lock, uint64_t threadNum)
{
//if (0 != *lock)
if (0 != atomic_load_explicit((atomic_uint_least64_t*)lock, memory_order_acquire))
return 0; // another thread already had the lock

//((volatile uint8_t*)lock)[threadNum] = 1; // take the lock by setting our byte
uint8_t* current_lock = ((uint8_t*)lock) + threadNum;
atomic_store_explicit((atomic_uint_least8_t*)current_lock, (uint8_t)1, memory_order_seq_cst);

//if (1LL << 8*threadNum != *lock)
// You already know that this flag is set and should not have to check it.
if ( 0 != ( (~(1LL << 8*threadNum)) &
atomic_load_explicit((atomic_uint_least64_t*)lock, memory_order_seq_cst) ))
{ // another thread set its byte between our 1st and 2nd check. unset ours

//((volatile uint8_t*)lock)[threadNum] = 0;
atomic_store_explicit((atomic_uint_least8_t*)current_lock, (uint8_t)0, memory_order_release);
return 0;
}

return 1;
}

GCC 6.1.0 - x86_64 asm-code - should use `MFENCE` for the first STORE:

TryLock(unsigned long volatile*, unsigned long):
movq (%rdi), %rdx
xorl %eax, %eax
testq %rdx, %rdx
je .L7
.L1:
rep ret
.L7:
leaq (%rdi,%rsi), %r8
leaq 0(,%rsi,8), %rcx
movq $-2, %rax
movb $1, (%r8)
rolq %cl, %rax
mfence
movq (%rdi), %rdi
movq %rax, %rdx
movl $1, %eax
testq %rdi, %rdx
je .L1
movb $0, (%r8)
xorl %eax, %eax
ret

Full example how it works:

[To see links please register here]


shared_value = 800000


----------


**What will happen if you do not use `MFENCE` - Data-Races**

There is a **StoreLoad reordering** as in the described above **case-1** (i.e. if don't use Sequential Consistency for STORE) - asm:

[To see links please register here]


* GCC 6.1.0 x86_64 - `shared_value = 610307`:

[To see links please register here]

* Clang 3.8.0 x86_64 - `shared_value = 678949`:

[To see links please register here]


I changed the memory barrier for STORE from `memory_order_seq_cst` to `memory_order_release`, it removes `MFENCE` - and now there are data-races - shared_value is not equal to 800000.

[![enter image description here][2]][2]


[1]:
[2]:


Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through