The dec/jnz pair operates faster then a loopsz for several reasons. First, dec and jnz pair up in the different modules of the netburst pipeline, so they can be executed simultaneously. Top that off with the fact that dec and jnz both require few cycles to execute, while the loopnz (and all the loop instructions, for that matter) instruction takes more cycles to complete. loop instructions are rarely seen output by good compilers.
Manual Optimization
The following lines of assembly code are not optimized, but they can be optimized very easily. Can you find a way to optimize these lines?
Duff’s device
The famous “Duff’s device” in C makes use of the fact that a case statement is still legal within a sub-block of its matching switch statement. Tom Duff used this for an optimised output loop. Duff’s device is an optimized implementation of a serial copy that uses a technique widely applied in assembly language for loop unwinding. It is perhaps the most dramatic use of case label fall-through in the C programming language to date.