Hello!
Your question touches on. several related subjects, not the least of which is the following:
In DonaldKnuth's paper "StructuredProgrammingWithGoToStatements", he wrote: "Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."
The lesson implied that you shouldn't be concerned with replacing your C++ code with assembler, until you know (because you performed performance analysis) that (a) it's necessary, and (b) it will be effective.
Now, with respect to your example, I can make several comments and/or suggestions.
First, most compilers (including gcc) support generating the assembler output from the compiler for your examination. I recommend that you do so, because it it almost certain to show you just why your hand-written assembler isn't as fast as that generated by the compiler.
Now, looking at the assembler code you wrote, I can find places that were correct (ie. they work), but weren't optimal (and result in unnecessary slowness).
Here are some suggestions:
- In the inner loop, begun at `label` and ended with `loop label`, you are loading x and y every time. Move these loads outside of the loop, instead. You'll avoid the push/pop, too.
- Rather than load (to registers eax and ebx) the values to be added, then storing the result, load 1 into a register, then. add that register to memory, directly. Avoids a load, a store, and an extra register.
- The loop instruction, though convenient, is notoriously slower than the equivalent (decrement register, jump if not 0).
- You can remove the `cmp edi,0` instruction, as the previous `dec edi` sets the flag for the following `jnz start`.