Create an account

Very important

  • To access the important data of the forums, you must be active in each forum and especially in the leaks and database leaks section, send data and after sending the data and activity, data and important content will be opened and visible for you.
  • You will only see chat messages from people who are at or below your level.
  • More than 500,000 database leaks and millions of account leaks are waiting for you, so access and view with more activity.
  • Many important data are inactive and inaccessible for you, so open them with activity. (This will be done automatically)


Thread Rating:
  • 667 Vote(s) - 3.66 Average
  • 1
  • 2
  • 3
  • 4
  • 5
INC instruction vs ADD 1: Does it matter?

#1
From [Ira Baxter answer on, *Why do the INC and DEC instructions **not** affect the Carry Flag (CF)?*](

[To see links please register here]

)

> Mostly, I stay away from `INC` and `DEC` now, because they do partial condition code updates, and this can cause funny stalls in the pipeline, and `ADD`/`SUB` don't. So where it doesn't matter (most places), I use `ADD`/`SUB` to avoid the stalls. I use `INC`/`DEC` only when keeping the code small matters, e.g., fitting in a cache line where the size of one or two instructions makes enough difference to matter. This is probably pointless nano[literally!]-optimization, but I'm pretty old-school in my coding habits.

And I would like to ask why it can cause stalls in the pipeline while add doesn't? After all, both `ADD` and `INC` updates flag registers. The only difference is that `INC` doesn't update `CF`. But why it matters?
Reply

#2
Depending on the CPU implementation of the instructions, a partial register update may cause a stall. According to [Agner Fog's optimization guide, page 62](

[To see links please register here]

),

> For historical reasons, the `INC` and `DEC` instructions leave the carry flag unchanged, while the other arithmetic flags are written to. This causes a false dependence on the previous value of the flags and costs an extra μop. To avoid these problems, it is recommended that you always use `ADD` and `SUB` instead of `INC` and `DEC`. For example, `INC EAX` should be replaced by `ADD EAX,1`.

See also page 83 on "Partial flags stalls" and page 100 on "Partial flags stall".
Reply

#3
Update: the [Efficiency cores][1] on Alder Lake are [Gracemont][2], and run `inc reg` as a single uop, but at only 1/clock, vs. 4/clock for `add reg, 1` (

[To see links please register here]

). This may be a false dependency on FLAGS like P4 had; the uops.info tests didn't try adding a dep-breaking instruction. Other than the TL:DR, I haven't updated other parts of this answer.

Update2: [One 2021 source claims][3] Ice Lake can't macro-fuse `inc`/`dec` (or intructions with memory operands) with `jcc`. If that's actually true, use `sub ecx, 1` / `jnz` instead of the traditional `dec ecx` / `jnz` in cases like the bottom of loops. But [Agner Fog][4] and [uiCA][5] both say ICL can still fuse `inc`/`dec`. It's possible it changed with a microcode update between 2019 and 2021, but [testing on Tiger Lake][6] showed it doesn't work that way.

----

**TL:DR/advice for modern CPUs: Probably use `add`; Intel Alder Lake's E-cores are relevant for "generic" tuning and seem to run `inc` slowly**.

Other than Alder Lake and earlier Silvermont-family, use `inc` except with a memory destination; that's fine on mainstream Intel or any AMD. (e.g. like gcc `-mtune=core2`, `-mtune=haswell`, or `-mtune=znver1`). `inc mem` costs an extra uop vs. `add` on Intel P6 / SnB-family; the load can't micro-fuse.

If you care about Silvermont-family (including KNL in Xeon Phi, and some netbooks, chromebooks, and NAS servers), probably avoid `inc`. `add 1` only costs 1 extra byte in 64-bit code, or 2 in 32-bit code. But it's not a performance disaster (just locally 1 extra ALU port used, not creating false dependencies or big stalls), so if you don't care *much* about SMont then don't worry about it.

Writing CF instead of leaving it unmodified can potentially be useful with other surrounding code that might benefit from CF dep-breaking, e.g. shifts. See below.

If you want to inc/dec without touching *any* flags, `lea eax, [rax+1]` runs efficiently and has the same code-size as `add eax, 1`. (Usually on fewer possible execution ports than add/inc, though, so add/inc are better when destroying FLAGS is not a problem.

[To see links please register here]

)

---

**On modern CPUs, `add` is never *slower* than `inc` (except for indirect code-size / decode effects), but usually it's not faster either, so you should prefer `inc` for code-size reasons**. Especially if this choice is repeated many times in the same binary (e.g. if you are a compiler-writer).

`inc` saves 1 byte (64-bit mode), or 2 bytes (opcodes 0x40..F `inc r32`/`dec r32` short form in 32-bit mode, re-purposed as the REX prefix for x86-64). This makes a small percentage difference in total code size. This helps instruction-cache hit rates, iTLB hit rate, and number of pages that have to be loaded from disk.

Advantages of `inc`:

* **code-size** directly
* Not using an immediate can have uop-cache effects on Sandybridge-family, which could offset the better micro-fusion of `add`. (See [Agner Fog's table 9.1 in the Sandybridge section of his microarch guide][7].) Perf counters can easily measure issue-stage uops, but it's harder to measure how things pack into the uop cache and uop-cache read bandwidth effects.
* Leaving CF unmodified is an advantage in some cases, on CPUs where you can read CF after `inc` without a stall. (Not on Nehalem and earlier.)

----

There is one exception among modern CPUs: **Silvermont/Goldmont/Knight's Landing** decodes `inc`/`dec` efficiently as 1 uop, but expands to 2 in the allocate/rename (aka issue) stage. The extra uop merges partial flags. [`inc` throughput is only 1 per clock, vs. 0.5c (or 0.33c Goldmont) for independent `add r32, imm8`][8] because of the dep chain created by the flag-merging uops.

Unlike P4, the register result doesn't have a false-dep on flags (see below), so out-of-order execution takes the flag-merging off the latency critical path when nothing uses the flag result. (But the OOO window is much smaller than mainstream CPUs like Haswell or Ryzen.) Running `inc` as 2 separate uops is probably a win for Silvermont in most cases; most x86 instructions write all the flags without reading them, breaking these flag dependency chains.

SMont/KNL has a queue between decode and allocate/rename (See [Intel's optimization manual, figure 16-2][9]) so expanding to 2 uops during issue can fill bubbles from decode stalls (on instructions like one-operand `mul`, or `pshufb`, which produce more than 1 uop from the decoder and cause a 3-7 cycle stall for microcode). Or on Silvermont, just an instruction with more than 3 prefixes (including escape bytes and mandatory prefixes), e.g. REX + any SSSE3 or SSE4 instruction. But note that there is a ~28 uop loop buffer, so small loops don't suffer from these decode stalls.

`inc`/`dec` aren't the only instructions that decode as 1 but issue as 2: `push`/`pop`, `call`/`ret`, and `lea` with 3 components do this too. So do KNL's AVX512 gather instructions. Source: [Intel's optimization manual][9], 17.1.2 Out-of-Order Engine (KNL). It's only a small throughput penalty (and sometimes not even that if anything else is a bigger bottleneck), so it's generally fine to still use `inc` for "generic" tuning.

----

Intel's optimization manual still recommends `add 1` over `inc` in general, to avoid risks of partial-flag stalls. But since Intel's compiler doesn't do that by default, it's not too likely that future CPUs will make `inc` slow in all cases, like P4 did.

[Clang 5.0 and Intel's ICC 17 (on Godbolt)][10] do use `inc` when optimizing for speed (`-O3`), not just for size. `-mtune=pentium4` makes them avoid `inc`/`dec`, but the default `-mtune=generic` doesn't put much weight on P4.

ICC17 [`-xMIC-AVX512`][11] (equivalent to gcc's `-march=knl`) does avoid `inc`, which is probably a good bet in general for Silvermont / KNL. But it's not usually a performance disaster to use `inc`, so it's probably still appropriate for "generic" tuning to use `inc`/`dec` in most code, especially when the flag result isn't part of the critical path.

---

**Other than Silvermont, this is mostly-stale optimization advice left over from Pentium4**. On modern CPUs, there's only a problem if you actually read a flag that wasn't written by the last insn that wrote *any* flags. [e.g. in BigInteger `adc` loops.][12] (And in that case, you need to preserve CF so using `add` would break your code.)

`add` writes all the condition-flag bits in the EFLAGS register. Register-renaming makes write-only easy for out-of-order execution: see [write-after-write and write-after-read hazards][13]. `add eax, 1` and `add ecx, 1` can execute in parallel because they are fully independent of each other. (Even Pentium4 renames the condition flag bits separate from the rest of EFLAGS, since even `add` leaves the interrupts-enabled and many other bits unmodified.)

**On P4, `inc` and `dec` depend on the previous value of the all the flags**, so they can't execute in parallel with each other or preceding flag-setting instructions. (e.g. `add eax, [mem]` / `inc ecx` makes the `inc` wait until after the `add`, even if the add's load misses in cache.) **This is called a false dependency**. Partial-flag writes work by reading the old value of the flags, updating the bits other than CF, then writing the full flags.

**All other out-of-order x86 CPUs (including AMD's), rename different parts of flags separately, so internally they do a write-only update to all the flags except CF**. (source: [Agner Fog's microarchitecture guide][7]). Only a few instructions, like `adc` or `cmc`, truly read and then write flags. But also `shl r, cl` (see below).

----

**Cases where `add dest, 1` is preferable to `inc dest`, at least for Intel P6/SnB uarch families**:

* **Memory-destination**: `add [rdi], 1` can [micro-fuse the store and the load+add on Intel Core2 and SnB-family][14], so it's 2 fused-domain uops / 4 unfused-domain uops.
`inc [rdi]` can only micro-fuse the store, so it's 3F / 4U.
According to Agner Fog's tables, AMD and Silvermont run memory-dest `inc` and `add` the same, as a single macro-op / uop.

But beware of uop-cache effects with `add [label], 1` which needs a 32-bit address and an 8-bit immediate for the same uop.
* **Before a variable-count shift/rotate** to break the dependency on flags and avoid partial-flag merging: `shl reg, cl` has an input dependency on the flags, because of unfortunate CISC history: [it has to leave them unmodified if the shift count is 0][15].

On Intel SnB-family, variable-count shifts are 3 uops (up from 1 on Core2/Nehalem). AFAICT, two of the uops read/write flags, and an independent uop reads `reg` and `cl`, and writes `reg`. It's a weird case of having better latency (1c + inevitable resource conflicts) than throughput (1.5c), and only being able to achieve max throughput if mixed with instructions that break dependencies on flags. ([I posted more about this][16] on Agner Fog's forum). Use BMI2 `shlx` when possible; it's 1 uop and the count can be in any register.

Anyway, `inc` (writing flags but leaving `CF` unmodified) before variable-count `shl` leaves it with a false dependency on whatever wrote CF last, and on SnB/IvB can require an extra uop to merge flags.

Core2/Nehalem manage to avoid even the false dep on flags: Merom runs a loop of 6 independent `shl reg,cl` instructions at nearly two shifts per clock, same performance with cl=0 or cl=13. Anything better than 1 per clock proves there's no input-dependency on flags.

I tried loops with `shl edx, 2` and `shl edx, 0` (immediate-count shifts), but didn't see a speed difference between `dec` and `sub` on Core2, HSW, or SKL. I don't know about AMD.

Update: The nice shift performance on Intel P6-family comes at the cost of a large performance pothole which you need to avoid: when an instruction depends on the flag-result of a shift instruction: **The *front end* stalls until the instruction is *retired*.** (Source: [Intel's optimization manual, (Section 3.5.2.6: Partial Flag Register Stalls)][17]). So `shr eax, 2` / `jnz` is pretty catastrophic for performance on Intel pre-Sandybridge, I guess! Use `shr eax, 2` / `test eax,eax` / `jnz` if you care about Nehalem and earlier. Intel's examples makes it clear this applies to immediate-count shifts, not just count=`cl`.

> In processors based on Intel Core microarchitecture [this means Core 2 and later], shift immediate by 1 is handled by special hardware such that it does not experience partial flag stall.

Intel actually means the special opcode with no immediate, which shifts by an implicit `1`. I think there is a performance difference between the two ways of encoding `shr eax,1`, with the short encoding (using the original 8086 opcode [`D1 /5`][15]) producing a write-only (partial) flag result, but the longer encoding (`C1 /5, imm8` with an immediate `1`) not having its immediate checked for 0 until execution time, but without tracking the flag output in the out-of-order machinery.

Since looping over bits is common, but looping over every 2nd bit (or any other stride) is very uncommon, this seems like a reasonable design choice. This explains why compilers like to `test` the result of a shift instead of directly using flag results from `shr`.

Update: for variable count shifts on SnB-family, Intel's optimization manual says:

> **3.5.1.6 Variable Bit Count Rotation and Shift**
>
> In Intel microarchitecture code name Sandy Bridge, The “ROL/ROR/SHL/SHR reg, cl” instruction has three micro-ops. **When the flag result is not needed, one of these micro-ops may be discarded, providing
better performance in many common usages**. When these instructions update partial flag results that are subsequently used, the full three micro-ops flow must go through the execution and retirement pipeline,
experiencing slower performance. In Intel microarchitecture code name Ivy Bridge, executing the full three micro-ops flow to use the updated partial flag result has additional delay.
>
> Consider the looped sequence below:
>
> loop:
> shl eax, cl
> add ebx, eax
> dec edx ; DEC does not update carry, causing SHL to execute slower three micro-ops flow
> jnz loop
>
> The DEC instruction does not modify the carry flag. Consequently, the
> SHL EAX, CL instruction needs to execute the three micro-ops flow in
> subsequent iterations. The SUB instruction will update all flags. So
> replacing `DEC` with `SUB` will allow `SHL EAX, CL` to execute the two
> micro-ops flow.


----

# Terminology

**Partial-flag stalls happen when flags are read**, if they happen at all. P4 never has partial-flag stalls, because they never need to be merged. It has false dependencies instead.

Several answers / comments mix up the terminology. They describe a false dependency, but then call it a partial-flag stall. It's a slowdown which happens because of writing only some of the flags, but the term "partial-flag **stall**" is what happens on pre-SnB Intel hardware when partial-flag writes have to be merged. Intel SnB-family CPUs insert an extra uop to merge flags without stalling. Nehalem and earlier stall for ~7 cycles. I'm not sure how big the penalty is on AMD CPUs.

(Note that partial-register penalties are not always the same as partial-flags, see below).

### Partial flag stall on Intel P6-family CPUs:
bigint_loop:
adc eax, [array_end + rcx*4] # partial-flag stall when adc reads CF
inc rcx # rcx counts up from negative values towards zero
# test rcx,rcx # eliminate partial-flag stalls by writing all flags, or better use add rcx,1
jnz
# this loop doesn't do anything useful; it's not normally useful to loop the carry-out back to the carry-in for the same accumulator.
# Note that `test` will change the input to the next adc, and so would replacing inc with add 1

---

In other cases, e.g. a partial flag write followed by a full flag write, or a read of only flags written by `inc`, is fine. On SnB-family CPUs, [`inc/dec` can even macro-fuse with a `jcc`, the same as `add/sub`][18].

After P4, Intel mostly gave up on trying to get people to re-compile with `-mtune=pentium4` or modify hand-written asm as much to avoid serious bottlenecks. (Tuning for a specific microarchitecture will always be a thing, but **P4 was unusual in deprecating so many things that used to be fast on previous CPUs**, and thus were common in existing binaries.) P4 wanted people to use a RISC-like subset of the x86, and also had branch-prediction hints as prefixes for JCC instructions. (It also had other serious problems, like the trace cache that just wasn't good enough, and weak decoders that meant bad performance on trace-cache misses. Not to mention the whole philosophy of clocking very high ran into the power-density wall.)

When Intel abandoned P4 (NetBurst uarch), they returned to P6-family designs (Pentium-M / Core2 / Nehalem) which inherited their partial-flag / partial-reg handling from earlier P6-family CPUs (PPro to PIII) which pre-dated the netburst mis-step. (Not everything about P4 was inherently bad, and some of the ideas re-appeared in Sandybridge, but overall NetBurst is widely considered a mistake.) Some very-CISC instructions are still slower than the multi-instruction alternatives, e.g. `enter`, [`loop`][19], or `bt [mem], reg` (because the value of reg affects which memory address is used), but these were all slow in older CPUs so compilers already avoided them.

Pentium-M even improved hardware support for partial-regs (lower merging penalties). In Sandybridge, Intel kept partial-flag and partial-reg renaming and made it much more efficient when merging is needed (merging uop inserted with no or minimal stall). SnB made major internal changes and is considered a new uarch family, even though it inherits a lot from Nehalem, and some ideas from P4. (But note that SnB's decoded-uop cache is *not* a trace cache, though, so it's a very different solution to the decoder throughput/power problem that NetBurst's trace cache tried to solve.)

---

**For example, `inc al` and `inc ah` can run in parallel on P6/SnB-family CPUs, but reading `eax` afterwards requires merging**.

PPro/PIII stall for 5-6 cycles when reading the full reg. Core2/Nehalem stall for only 2 or 3 cycles while inserting a merging uop for partial regs, but partial flags are still a longer stall.

SnB inserts a merging uop without stalling, like for flags. Intel's optimization guide says that for merging AH/BH/CH/DH into the wider reg, inserting the merging uop takes an entire issue/rename cycle during which no other uops can be allocated. But for low8/low16, the merging uop is "part of the flow", so it apparently doesn't cause additional front-end throughput penalties beyond taking up one of the 4 slots in an issue/rename cycle.

**In IvyBridge (or at least Haswell), Intel dropped partial-register renaming for low8 and low16 registers, keeping it only for high8 registers (AH/BH/CH/DH). Reading high8 registers has extra latency. Also, `setcc al` has a false dependency on the old value of rax, unlike in Nehalem and earlier (and probably Sandybridge). See [this HSW/SKL partial-register performance Q&A][20] for the details.**

(I've previously claimed that Haswell could merge AH with no uop, but that's not true and not what Agner Fog's guide says. I skimmed too quickly and unfortunately repeated my wrong understanding in lots of comments and other posts.)

AMD CPUs, and Intel Silvermont, don't rename partial regs (other than flags), so `mov al, [mem]` has a false dependency on the old value of eax. (The upside is no partial-reg merging slowdowns when reading the full reg later.)

----

Normally, the only time `add` instead of `inc` will make your code faster on AMD or mainstream Intel is when your code actually depends on the doesn't-touch-CF behaviour of `inc`. i.e. **usually `add` only helps when it would break your code**, but note the `shl` case mentioned above, where the instruction reads flags but usually your code doesn't care about that, so it's a false dependency.

If you *do* actually want to leave CF unmodified, pre SnB-family CPUs have serious problems with partial-flag stalls, but on SnB-family the overhead of having the CPU merge the partial flags is very low, so it can be best to keep using `inc` or `dec` as part of a loop condition when targeting those CPU, with some unrolling. (For details, see the BigInteger `adc` Q&A I linked earlier). It can be useful to use `lea` to do arithmetic without affecting flags at all, if you don't need to branch on the result.

---

## Skylake doesn't have partial-flag merging costs

**Update: Skylake doesn't have partial-flag merging uops at all: CF is just a separate register from the rest of FLAGS. Instructions that need both parts (like `cmovbe`) read both inputs separately. That makes `cmovbe` a 2-uop instruction, but most other `cmovcc` instructions 1-uop on Skylake. See

[To see links please register here]

**.

`adc` *only* reads CF so it can be single-uop on Skylake with no interaction at all with an `inc` or `dec` in the same loop.

(TODO: rewrite earlier parts of this answer.)


[1]:

[To see links please register here]

[2]:

[To see links please register here]

[3]:

[To see links please register here]

[4]:
[5]:

[To see links please register here]

[6]:

[To see links please register here]

[7]:

[To see links please register here]

[8]:

[To see links please register here]

[9]:

[To see links please register here]

[10]:

[To see links please register here]

:'int+global_a%3B%0Avoid+foo()+%7B+%2B%2Bglobal_a%3B+%7D%0A%0Aint+inc_in_regs(int+a,+int+b)+%7B%0A++%2B%2Ba%3B+%2B%2Bb%3B%0A++return+a+*+b%3B%0A%7D%0A%0A//+clang+doesn!'t+know+that+add+1+is+better+for+memory+destinations%0A//+and+it+doesn!'t+seem+to+know+that+inc+might+be+worth+avoiding+on+KNL.%0A%0A//+ICC+avoids++inc++with+-xMIC-AVX512+(i.e.+Knight!'s+Landing)'),l:'5',n:'0',o:'C%2B%2B+source+%231',t:'0')),k:50.01365001365001,l:'4',m:50,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:clang500,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',trim:'1'),libs:!(),options:'-xc+-Wall+-std%3Dgnu11+-fverbose-asm+-O3+-mtune%3Dgeneric',source:1),l:'5',n:'0',o:'x86-64+clang+5.0.0+(Editor+%231,+Compiler+%233)',t:'0')),header:(),l:'4',m:50,n:'0',o:'',s:0,t:'0')),k:50.01365001365001,l:'3',n:'0',o:'',t:'0'),(g:!((g:!((h:compiler,i:(compiler:icc17,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',trim:'1'),libs:!(),options:'-xc+-Wall+-std%3Dgnu11+-fverbose-asm+-O3+-xMIC-AVX512',source:1),l:'5',n:'0',o:'x86-64+icc+17+(Editor+%231,+Compiler+%231)',t:'0')),k:49.98634998634998,l:'4',m:38.773388773388774,n:'0',o:'',s:0,t:'0'),(g:!((h:compiler,i:(compiler:icc17,filters:(b:'0',binary:'1',commentOnly:'0',demangle:'0',directives:'0',execute:'1',intel:'0',trim:'1'),libs:!(),options:'-xc+-Wall+-std%3Dgnu11+-fverbose-asm+-O3',source:1),l:'5',n:'0',o:'x86-64+icc+17+(Editor+%231,+Compiler+%232)',t:'0')),header:(),l:'4',m:61.226611226611226,n:'0',o:'',s:0,t:'0')),k:49.98634998634998,l:'3',n:'0',o:'',t:'0')),l:'2',n:'0',o:'',t:'0')),version:4
[11]:

[To see links please register here]

[12]:

[To see links please register here]

[13]:

[To see links please register here]

[14]:

[To see links please register here]

[15]:

[To see links please register here]

[16]:

[To see links please register here]

[17]:

[To see links please register here]

[18]:

[To see links please register here]

[19]:

[To see links please register here]

[20]:

[To see links please register here]

Reply



Forum Jump:


Users browsing this thread:
1 Guest(s)

©0Day  2016 - 2023 | All Rights Reserved.  Made with    for the community. Connected through