|
The old implementation was striding packed matrices to avoid nearby columns aliasing each other. However, the columns concurrently consumed by the kernel are already shuffled into contiguous blocks, so that was pointless. The best rationale that I can reconstruct is that this is a relic from early ruy code where it wasn't clear yet that we would retain packing code at all, however we have long decided to retain packing code for a variety of other reasons, in particular to allow kernels to dictate shuffled layouts, which can be required by the SIMD instruction set (e.g. ARM NEON dotprod).
I tried simply removing RUY_OPT(AVOID_ALIASING), and that is an improvement in a majority of cases, however that is a 15% regression on Pixel4 little cores on NxNxN int8 matmuls when N is a multiple of 2048. PMU measurements show that that is explained by a 4x increase of the L1D refill rate, and moreover, that performance difference goes away when caching pre-packed matrices. My interpretation is that this happens as the source and packed matrix data alias each other, making packing code thrash L1. This interpretation is confirmed by the fact that the following solution (implemented in this CL) is effective:
We extend the Allocator interface to allow specifying a "pointer to avoid aliasing with". So when we allocate the packed buffer, we pass the source matrix buffer as the buffer to avoid aliasing with. Thus the only difference at all is in the address of the packed buffer, and it makes the following impact on little core latencies:
Pixel4, 1 little core, kNeonDotprod
size (NxNxN) | Gop/s before | Gop/s without AVOID_ALIASING | Gop/s after
-------------+--------------+------------------------------+-------------
1024 | 40.39 | 41.1 | 41.17
2048 | 41.81 | 36.45 | 42.33
3072 | 42.21 | 43.33 | 42.94
4096 | 42.45 | 36.71 | 42.72
The other measurable benefit is on PMU metrics, particularly on cache refill rates. The old AVOID_ALIASING was a regression of L1D refill rate by 1.2x to 1.5x, this is fixed in this new AVOID_ALIASING. There were also consistent but more moderate regression to other cache levels refill rates with the old AVOID_ALIASING, also fixed by this new implementation.
Of course, the other benefit is that the new AVOID_ALIASING is much better understood heuristically, since it's now explicitly about avoiding the source and packed matrices aliasing each other.
PiperOrigin-RevId: 319904379
|