Thursday, April 19, 2007, 05:31 PM - OptimizationToday, I played a bit with loops unrolling and AMD Code Analyst.
One of the ways to unroll loops in C is to use a Duff's device. While a bit strange at first, this is quit handy for quick loops unrolling.
While having a look at the generated assembly code with and without it (compiler is Visual Studio 2005/8), I noticed a few interesting things:
*VC8 is able to unroll loops, provided that the number of runs is known at compile time. My loop should run 16x, and VC8 unrolled it 4x. It might seem trivial for a compiler, but I don't remember previous versions having this ability.
*On a Core2, my "manual" 4x unrolling (using a Duff's device) is still faster than the 4x auto-unrolled produced by VC8, due to different instructions scheduling.
The generated code flow is a bit different in both cases:
*VC8 auto-4x-unroll features a "continue" jump at the end of the code block for looping, targeting the beginning of the code block. In my 16x loop, this jump is followed 3 times, and skipped the last time.
*My Duff's version features an "exit" jump after the first part of the unrolling (1*code - conditional jump - 3*code). This jump is skipped the first 3 times, and followed on the last pass.
The interesting point is provided by Code Analyst, and its pipeline simulator. I used it to simulated an Athlon64 pipeline, and looked at the result:
In the case of the Duff's device, the exit jump is mispredicted 3 times, leading to the Duff's version being slower than the automatic VC8 unrolling. This being mispredicted 3x, it means that an Athlon64 is unable to predict such a branch, always predicting it to be followed.
However, the code is faster using Duff's device when running over a Core2. That means that using a trick such as this will perhaps increase performance a bit on Core2, but will quite reduce speed on Athlon64, by conflicting with its branch predictors.
Considering that VC8 is able to unroll some loops by itself, we should better think twice before playing with tricks such as Duff's one.
Thursday, December 14, 2006, 11:08 AM - OptimizationWhen switching from x86 to x64, it seems that Lame is getting about a 15% speed increase. Something to be considered is that in x86 mode we are using some hand-coded mmx and 3dnow functions, and some sse intrinsics functions (well, only 1 function right now), while in x64 we are only using the sse intrinsics functions. So the comparison is:
x86: mmx + 3dnow + sse
x64: sse only => +15% speed
In many benchmarks comparing x86 vs x64, when there is a speed increase, the speed up is more important on K8 based processors (Athlon64/Opteron) than on Core based processors (Core2). I've seen several articles implying, based on such tests, that there could be a non-optimal performance of Core processors in x64.
Let's try to find a possible explaination, based on Lame results.
*Lame is not using 64bits integer arithmetics, so speedup can not be caused by the ability to process this kind of computation in a reduced number of cycles.
*Lame is heavily floating-point based.
Speed increase could be because the compiler is vectorizing code to use sse/sse2 operations, which are always available in x64. However, experience demonstrated that current compilers are not good at fully automatic vectorisation, so it's unlikely to be the case.
As Lame is using single precision floating point arithmetic, we can also discard any (unlikely) potential speed increase of double precision arithmetic in x64.
The only remaining point that I can think of, is the fact that the sse ISA is register based while the x87 ISA is stack based, and x64 adds even more registers into the sse ISA.
Could it be because x64 is increasing the number of internal registers of the floating point units? Unlikely, as the K8 floating point core has 120 internal registers available, which is plenty.
The likely explaination is that sse code is more compact than x87 code. Sse is register-based, while x87 is stack based. In a stack based model, you have to first push your operands on the top of the stack, and only then you can do the calculation. In contrast, in a register based scheme you can keep data into other register while operating on new data, as you can do computation on data stored in any register, not just on top of the stack.
A point to consider is that the sse ISA is using 128bits registers, that K8 must process in 2 chunks of 64bits. It means that using unvectorized sse with only 32bits of data into registers needs to use floating point computation units twice (with 1 run totally wasted), compared to only once in x87, to just have the same computationnal result.
Despite of this, K8 provides a substancial speed up in floating point when comparing sse based x64 versus x87 based x86 mode. Obviously, it means that computation units are not fully loaded, otherwise the doubled computation would decrease speed.
It seems that K8 features good execution units, but that it's not optimally efficient into feeding them, and that sse based floating point in x64 helps it to feed its units.
This would mean that the speed increase witnessed in x64 is not because of the 64bits mode itself, but because it's helping an unoptimal decoding stage of K8. It would explain a few things:
*There is more speed increase when going from x86 to x64 on K8 than Core, as Core features a more efficient decoding stage (micro and macro-ops fusion).
*Speed increase when going from x86 to x64 is more important (on K8) in floating point based software than in integer based software.