07-24-2023, 02:52 AM
[Intel® 64 and IA-32 Architectures Software Developer’s Manual][1] says:
> **8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations**<br/>The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location.
However, loads are not reordered with stores to the same location.
What about loads that partially or fully overlap previous stores, but don't have the same start address? (See the end of this post for a specific case)
---
Suppose the following C-like code:
// lock - pointer to an aligned int64 variable
// threadNum - integer in the range 0..7
// volatiles here just to show direct r/w of the memory as it was suggested in the comments
int TryLock(volatile INT64* lock, INT64 threadNum)
{
if (0 != *lock)
return 0; // another thread already had the lock
((volatile INT8*)lock)[threadNum] = 1; // take the lock by setting our byte
if (1LL << 8*threadNum != *lock)
{ // another thread set its byte between our 1st and 2nd check. unset ours
((volatile INT8*)lock)[threadNum] = 0;
return 0;
}
return 1;
}
Or its x64 asm equivalent:
; rcx - address of an aligned int64 variable
; rdx - integer in the range 0..7
TryLock PROC
cmp qword ptr [rcx], 0
jne @fail
mov r8, rdx
mov rax, 8
mul rdx
mov byte ptr [rcx+r8], 1
bts rdx, rax
cmp qword ptr [rcx], rdx
jz @success
mov byte ptr [rcx+r8], 0
@fail:
mov rax, 0
ret
@success:
mov rax, 1
ret
---
Then suppose that TryLock is concurrently executed in two threads:
INT64 lock = 0;
void Thread_1() { TryLock(&lock, 1); }
void Thread_5() { TryLock(&lock, 5); }
### The question:
The `((INT8*)lock)[1] = 1;` and `((INT8*)lock)[5] = 1;` stores aren't to the same location as the 64bit load of `lock`. However, they are each fully contained by that load, so does that "count" as the same location? It seems impossible that a CPU could do that.
What about `((INT8*)lock)[0] = 1`? The address of the store is then the same as the address of the following load. Are these operations "to the same location", even if the earlier case wasn't?
p.s. please notice that the question isn't about C/Asm code, it's about behaviour of the x86 CPUs.
[1]:
> **8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations**<br/>The Intel-64 memory-ordering model allows a load to be reordered with an earlier store to a different location.
However, loads are not reordered with stores to the same location.
What about loads that partially or fully overlap previous stores, but don't have the same start address? (See the end of this post for a specific case)
---
Suppose the following C-like code:
// lock - pointer to an aligned int64 variable
// threadNum - integer in the range 0..7
// volatiles here just to show direct r/w of the memory as it was suggested in the comments
int TryLock(volatile INT64* lock, INT64 threadNum)
{
if (0 != *lock)
return 0; // another thread already had the lock
((volatile INT8*)lock)[threadNum] = 1; // take the lock by setting our byte
if (1LL << 8*threadNum != *lock)
{ // another thread set its byte between our 1st and 2nd check. unset ours
((volatile INT8*)lock)[threadNum] = 0;
return 0;
}
return 1;
}
Or its x64 asm equivalent:
; rcx - address of an aligned int64 variable
; rdx - integer in the range 0..7
TryLock PROC
cmp qword ptr [rcx], 0
jne @fail
mov r8, rdx
mov rax, 8
mul rdx
mov byte ptr [rcx+r8], 1
bts rdx, rax
cmp qword ptr [rcx], rdx
jz @success
mov byte ptr [rcx+r8], 0
@fail:
mov rax, 0
ret
@success:
mov rax, 1
ret
---
Then suppose that TryLock is concurrently executed in two threads:
INT64 lock = 0;
void Thread_1() { TryLock(&lock, 1); }
void Thread_5() { TryLock(&lock, 5); }
### The question:
The `((INT8*)lock)[1] = 1;` and `((INT8*)lock)[5] = 1;` stores aren't to the same location as the 64bit load of `lock`. However, they are each fully contained by that load, so does that "count" as the same location? It seems impossible that a CPU could do that.
What about `((INT8*)lock)[0] = 1`? The address of the store is then the same as the address of the following load. Are these operations "to the same location", even if the earlier case wasn't?
p.s. please notice that the question isn't about C/Asm code, it's about behaviour of the x86 CPUs.
[1]: