github.com/google/ruy.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2020-07-07	Revisiting RUY_OPT(AVOID_ALIASING).	Benoit Jacob
	The old implementation was striding packed matrices to avoid nearby columns aliasing each other. However, the columns concurrently consumed by the kernel are already shuffled into contiguous blocks, so that was pointless. The best rationale that I can reconstruct is that this is a relic from early ruy code where it wasn't clear yet that we would retain packing code at all, however we have long decided to retain packing code for a variety of other reasons, in particular to allow kernels to dictate shuffled layouts, which can be required by the SIMD instruction set (e.g. ARM NEON dotprod). I tried simply removing RUY_OPT(AVOID_ALIASING), and that is an improvement in a majority of cases, however that is a 15% regression on Pixel4 little cores on NxNxN int8 matmuls when N is a multiple of 2048. PMU measurements show that that is explained by a 4x increase of the L1D refill rate, and moreover, that performance difference goes away when caching pre-packed matrices. My interpretation is that this happens as the source and packed matrix data alias each other, making packing code thrash L1. This interpretation is confirmed by the fact that the following solution (implemented in this CL) is effective: We extend the Allocator interface to allow specifying a "pointer to avoid aliasing with". So when we allocate the packed buffer, we pass the source matrix buffer as the buffer to avoid aliasing with. Thus the only difference at all is in the address of the packed buffer, and it makes the following impact on little core latencies: Pixel4, 1 little core, kNeonDotprod size (NxNxN) \| Gop/s before \| Gop/s without AVOID_ALIASING \| Gop/s after -------------+--------------+------------------------------+------------- 1024 \| 40.39 \| 41.1 \| 41.17 2048 \| 41.81 \| 36.45 \| 42.33 3072 \| 42.21 \| 43.33 \| 42.94 4096 \| 42.45 \| 36.71 \| 42.72 The other measurable benefit is on PMU metrics, particularly on cache refill rates. The old AVOID_ALIASING was a regression of L1D refill rate by 1.2x to 1.5x, this is fixed in this new AVOID_ALIASING. There were also consistent but more moderate regression to other cache levels refill rates with the old AVOID_ALIASING, also fixed by this new implementation. Of course, the other benefit is that the new AVOID_ALIASING is much better understood heuristically, since it's now explicitly about avoiding the source and packed matrices aliasing each other. PiperOrigin-RevId: 319904379
2020-04-10	Wrap the gtest header so that we can disable unused-param warnings in it.	Benoit Jacob
	PiperOrigin-RevId: 305906546
2020-04-09	Internal change	Benoit Jacob
	PiperOrigin-RevId: 305517731
2020-03-30	Move ruy's code to a ruy/ subdirectory.	Benoit Jacob
	The motivation is that having source files in the repository root runs into a number of corner cases with copybara setups and with external CMake build systems, so enclosing all code in ruy/ avoids that while generally making our setup much more similar to that of other related projects (TensorFlow, IREE). PiperOrigin-RevId: 303448881