Prefetching in a Loop
Memory latency completely hidden
Notes:
When the stride does not change, we can issue a prefetch. This is shown in our loop here, where the memory instruction references the current address and at the same time we issue a prefetch for address plus stride. This prefetch is automatically issued by the hardware, but first a check is made if the data is not already in the cache. If the data is already in the cache, the prefetch is simply discarded. If our prediction is correct, then we should have a cache hit on the next iteration of this loop.
Let us not forget the timing of the prefetch though. The memory latency is only hidden if the length of one loop iteration is longer than the memory latency. If not, then we will still incur a miss.