Hello!
"Cache-friendly" code means that the code in question is constructed (either the instructions themselves, or the data they reference, or both) such that they take advantage of the behavior of the instruction, data, or both caches of the processor on which they're executing. To explain what I mean by this, I need to talk a bit about what a cache *is*, and how they work.
The instructions that a processor executes, and the data that these instruction manipulate, are in memory. We usually call this memory the RAM (random access memory) of the machine. Very typically, however, a processor can execute the equivalent of hundreds of instructions between requesting a value (or instruction) from an arbitrary memory location, and actually having the value (or instruction) to act upon. This delay has actually gotten worse over time, because (a) processors today can execute instruction much faster than previously, while (b) the speed of the pathway(s) between the processor and normal memory haven't improved nearly as much. There has almost always been *some* kind of speed gap, though, which is why caches were invented, long ago.
The premise is this: However, access to memory, whether to the instruction to execute, or the data to manipulate, tend to have period of regularity. So, rather than jumping around, randomly, within memory, a program might, instead, often reference the same instructions over and over (in a loop), or the same data. The program might execute many instructions in a row, or a log list of data from one end to the other. The bulk of memory is cheap[er], but slower than the processor, but what if we had a more expensive, smaller, but much faster memory in which we would keep the instructions or data we reference the most often? *This* pathway would be much faster, and the processor would have to wait (or, at least, not as long) if the instructions or data it needed next were in this smaller, faster memory.
So, that's what the processor and memory architects did. The cache(s) are generally several orders of magnitude smaller than the normal, physical memory; this is because they are expensive to design and implement, and because there's a diminishing returns issue, too. There are a variety of details in the design of caches, which influence how a program needs to behave to take best advantage of them. And these details tend to vary with the particular implementation (version) of the processor architecture. So, "one size all" is definitely not the ways to go, generally.
However, all of that said, the difference between "cache-friendly" and "cache-unfriendly" code is not whether the code in question will work properly, but how fast will it execute. All other things remaining equal, "cache-friendly" code will execute faster than "cache unfriendly" code, simply because the processor will find the instructions and/or data it next needs in the cache(s) (the closer to the processor the better), avoiding the need to go all the way out to physical memory as much as possible.
So, how might one accomplish this, you asked?
Generally, you need to know the organization of the cache(s) of the processor in question, first. How long is a cache 'line'? (I.e., when a cache gets instructions and/or data from memory, how big a chuck does it get each time?) Is there a separate instruction (I) and data (D) cache, or are they combined? Is there more than one level of cache? What are the cache(s) sizes?
Armed with this information, you can start making plans:
- If the cache lines are e.g. 64 bytes (which is pretty common these days), try to arrange instruction so that loops are small[er], and can be fitted into 1-to-a-few cache lines.
- Start functions and loops at the start of a cache line. (If lines are 64 bytes long, the cache will load lines starting on 64 byte boundaries.)
- Try to keep the most frequently-used variables (a) close together and (b) on 1-to-a-few cache lines.
- Try to engineer algorithms that don't 'jump around' randomly, either in the code being executed, or the data to be accessed. Sequential is best. (Some caches are designed to e.g. discover that the processor commonly loads the next sequential line from the last, e.g. when the code 'walks' an array from one end to the other, and so it pre-loads the next line even before the processor asks for it, to try to anticipate.)