github.com/google/ruy.git - Unnamed repository; edit this file 'description' to name the repository.

Age	Commit message (Collapse)	Author
2020-04-07	Tighten visibility: only make select targets publicly visible, default to ↵	Benoit Jacob
	private. PiperOrigin-RevId: 305317717
2020-04-07	Reference ruy from its new location as a separate GitHub project.	Benoit Jacob
	PiperOrigin-RevId: 304653289
2020-03-30	Move ruy's code to a ruy/ subdirectory.	Benoit Jacob
	The motivation is that having source files in the repository root runs into a number of corner cases with copybara setups and with external CMake build systems, so enclosing all code in ruy/ avoids that while generally making our setup much more similar to that of other related projects (TensorFlow, IREE). PiperOrigin-RevId: 303448881
2020-03-10	Do not depend on TensorFlow's config_setting's.	Benoit Jacob
	PiperOrigin-RevId: 300165710
2020-03-10	Give Ruy public visibility	Benoit Jacob
	PiperOrigin-RevId: 300149932
2020-03-10	Add a unit test covering GetBlockByIndex. This is where traversal orders are ↵	Benoit Jacob
	implemented. A mistake there would not be caught in matrix multiplication tests as it would be a performance-only bug (or even a memory-locality-only bug not necessarily affecting latencies). PiperOrigin-RevId: 292199511
2020-03-10	drop the old benchmark_opt_set_* targets. they were broken since the move of ↵	Benoit Jacob
	code to .cc files in separate libraries caused the defining of the RUY_OPT_SET token in these targets to no longer affect the internal code being compiled. PiperOrigin-RevId: 291532431
2020-03-10	Changes to BlockMap, in particular add Hilbert-curve fractal traversal above ↵	Benoit Jacob
	a certain size threshold. Renames cache_friendly_traversal_threshold to local_data_cache_size so it's more explicit about what it is in practice. Introduce shared_data_cache_size, needed in the decision of whether to use Hilbert curve. Hilbert curve is more expensive to decode and only worth it if it allows to reduce DRAM accesses, which depends on shared_data_cache_size. Centralize defaults in a new :cpu_cache_size library. Centralize the reading of these defaults in Spec so that users can override these consistently by passing own spec (either to provide more accurate/runtime values or for test coverage purposes). On Pixel4, This does not significantly affect latencies, outside of a 1%-2% improvement on latencies on 4 threads on very large matrix sizes. The motivation for this is that it reduces DRAM accesses: the PMU observes typically a 10% reduction, up to 20%, of 'L3 data cache refill' events on very large matrix multiplications (1000x1000 and above). DRAM accesses should be an increasing function of that, perhaps even more or less proportional to that, so this indicates that this change will significantly reduce DRAM accesses and thus power usage. This was observed consistently on all 2x2=4 combinations of {1, 4} threads on {little, big} cores on Pixel4. PiperOrigin-RevId: 291531754
2020-03-10	Use preload-for-write instructions before actual store instructions in kernels.	Benoit Jacob
	Introduce a RUY_PREFETCH_STORE optset, separate from RUY_PREFETCH_LOAD. Unlike RUY_PREFETCH_LOAD which is detrimental in some contexts (matmul-only benchmarking), RUY_PREFETCH_STORE seems never detrimental, so we need to control it separately. This is a substantial speedup when the destination matrix stride is close to a multiple of L1 cache aliasing periodicity. For example, on a typical ARM CPU that is 1024-byte periodicity, so for float matrices that happens whenever the destination matrix has a number of rows close to 1024/sizeof(float) = 256. The impact varies gradually as one gets closer to such values. Typically, the impact is large when one is within 64 bytes (1 cache line size) of the nearest such value. For shallow shapes (small depth, e.g. depth=16 as is common in MobileNet v3) this can be a 2x speedup, as there is not much arithmetic to amortize the high cost of these writes. As far as I understand, what we really want here is a 'non-temporal store instruction' such as the STNP instruction but I'm not aware of a NEON non-temporal instruction. Using a prefetch-stream instruction just before a regular store seems to convey a strictly weaker hint (we are not conveying that the order of observation does not matter) but already seems to be effective at avoiding cache aliasing issues. I also experimented with 'pstl3strm' instead of 'pstl1strm' but that wasn't better. PiperOrigin-RevId: 291440490
2020-03-10	Drop the dependency on gemmlowp/fixedpoint.	Benoit Jacob
	That was the last gemmlowp dependency in ruy. PiperOrigin-RevId: 289684215
2020-03-10	Remove ruy's dependency on the gemmlowp profiler.	Benoit Jacob
	Introduce the 'ruy profiler', a more modern descendent of it (pure C++11, correct, tested including under TSan, more useful features including formatted parametrized labels and better reporting of multi-thread profiles, treeview-manipulation API, more documentation, more accurate). Port ruy to using the ruy profiler (TFLite should follow). Add per-GEMM-shape profiling labels, now very easy thanks to formatted parametrized labels, previously too cumbersome to do to be submitted so we had to keep unsubmitted patches for that common profiling need. PiperOrigin-RevId: 289680118
2020-03-10	Ruy x86: Introduce framework for SSE 4.2 and VNNI.	Alex Stark
	PiperOrigin-RevId: 289116846
2020-03-10	Ruy: Profile cache ejection.	T.J. Alumbaugh
	PiperOrigin-RevId: 283795957
2020-03-10	Move deps to BUILD file to make them easier to manage with automation	Ruy Contributors
	PiperOrigin-RevId: 282660302
2020-03-10	Ruy: Add non-zero mean in matrix test data, exercising along-row summations.	Alex Stark
	PiperOrigin-RevId: 280686463
2020-03-10	Provide path to Ruy::Mul for MatrixBatchVectorMultiplyAccumulate	T.J. Alumbaugh
	- protected by TFLITE_WITH_RUY_GEMV PiperOrigin-RevId: 278873724
2020-03-10	Ruy: Add a cache policy and implementation. Protected by #ifdef usage and ↵	T.J. Alumbaugh
	default policy is off. PiperOrigin-RevId: 277722288
2020-03-10	Ruy cache of prepacked matrices	T.J. Alumbaugh
	PiperOrigin-RevId: 277280732
2020-03-10	Use "-O3" for optimized ruy build	Terry Heo
	This change applies "-O3" for all optimized targets including Linux. For debug build ("-c dbg"), it doesn't require the additional flag "--copt=-O0". PiperOrigin-RevId: 275760820
2020-03-10	Rewrite RUY_CHECK family of macros:	Benoit Jacob
	- Drop unwanted dependency on TFLite macros (prereq ahead of future move out of tflite). - There doesn't seem to be a compelling flavor of Google logging macros that we could use here, without adding a large dependency and/or a large increase to binary size. We only need the most basic assertion functionality, this implementation achieves minimal binary size impact by using only snprintf, fprintf and abort. - Still have decent logging of compared values, and support for C++11 enum classes for now by logging numerical values (will be possible to improve when C++20 reflection becomes available). Also bump the threshold for the found_distinct_values check which was being flaky (could give false negatives based on pseudorandom values). PiperOrigin-RevId: 271631262
2020-03-10	Rewrite MakeBlockMap to be more principled and at the same time more ↵	Benoit Jacob
	explicitly empirically derived. These values have been obtained on a Qualcomm S855 device (arm64). This would need to be tuned differently to achieve optimal performance on other hardware. PiperOrigin-RevId: 270268717
2020-03-10	Remove portable_test_suite inclusion for ruy tests	Jared Duke
	PiperOrigin-RevId: 268953192
2020-03-10	Automated rollback of rollback. Fixed in preceding change.	Alex Stark
	PiperOrigin-RevId: 265455751
2020-03-10	Fix an assertion.	Benoit Jacob
	Also edit a comment in ruy/BUILD about debugging. PiperOrigin-RevId: 265083281
2020-03-10	Support Emscripten (ie typically Wasm).	Benoit Jacob
	PiperOrigin-RevId: 265063615
2020-03-10	Automated rollback from breakage	Ruy Contributors
	PiperOrigin-RevId: 264939162
2020-03-10	Ruy: Tests for CPU ID detection.	Alex Stark
	PiperOrigin-RevId: 264920884
2020-03-10	Ruy: Move common copts to recently-added bzl file.	Alex Stark
	PiperOrigin-RevId: 264915833
2020-03-10	Ruy: Add bzl files for copts handling.	Alex Stark
	PiperOrigin-RevId: 264913505
2020-03-10	Ruy: Rearrange BUILD file.	Alex Stark
	PiperOrigin-RevId: 264911552
2020-03-10	Ruy: Add bzl files for copts handling.	Alex Stark
	PiperOrigin-RevId: 264690662
2020-03-10	Ruy: Introduce CPU ID detection on x86.	Alex Stark
	This amounts to disabling Ruy paths for this cpuid instruction results lack selected features. PiperOrigin-RevId: 264683681
2020-03-10	Fix allocator in cases of sizes overflowing 32bit integer arithmetic	Benoit Jacob
	in size_util. Part of it was AllocateFast not checking if ptr_ is null before using it (null deref with offset, so didn't look like a null deref). Part of it was using round_up_pot with a large size_t value that got implicitly casted to int as round_up_pot took an int argument. This showed how it's safer to just templatize the helpers in size_util.h, make them accept either int32 or int64 (guarded in floor_log2, which is the only of these functions that cares). I just changed allocator to use only signed types (std::size_t --> std::ptrdiff_t) because I didn't want to deal with the extra complexity of dealing with both signed and unsigned in size_util. PiperOrigin-RevId: 264668215
2020-03-10	Ruy: AVX2 model C++ code.	Alex Stark
	PiperOrigin-RevId: 264602338
2020-03-10	Ruy: Split-off build targets specific to platform / ISA.	Alex Stark
	PiperOrigin-RevId: 263201086
2020-03-10	Ruy: Prune dependencies.	Alex Stark
	PiperOrigin-RevId: 262999863
2020-03-10	Specify -O3 and, on ARM32, -mfpu=neon as rule copts, for all our binary rules.	Benoit Jacob
	See the comment. PiperOrigin-RevId: 262564280
2020-03-10	Ruy: Improve includes.	Alex Stark
	PiperOrigin-RevId: 262217958
2020-03-10	Ruy: Reorganize pack and kernel headers.	Alex Stark
	PiperOrigin-RevId: 261951453
2020-03-10	Ruy: Introduce x86 (AVX-512) code.	Alex Stark
	PiperOrigin-RevId: 260750932
2020-03-10	Rewrite/simplify tracing.	Benoit Jacob
	The implementation so far was prematurely optimized. It had all threads record directly into a shared vector indexed by block_ids. The idea was (1) to avoid the overhead of locking or other synchronization primitives when tracing a multi-thread execution, and (2) to avoid overhead of growing heap buffers. The new implementation is much more straightforward, as is most evident from the fact that it doesn't use relaxed_atomic_store anymore (yet still runs free of TSan errors), and that we were able to remove the ProcessedTrace class. The above-mentioned issues (1) and (2) that drove the earlier design are now addressed as follows in the new design: (1) Each thread now records to its own specific vector of trace entries; these thread-specific vectors are only coalesced into a global vector when dumping a trace. This removed the need for any locking or atomic operations. (2) We are less careful than before about avoiding heap allocations. We just reserve upfront a rather large buffer size, large enough to avoid most subsequent heap reallocations and small enough to still not matter in practical tracing situations. The proximate motivation for this change is that the existing design, requiring indexing of trace entries by block_id, is now inconvenient as we need to experiment with TrMul implementation changes where packing is not necessarily directly associated with a block_ids anymore. PiperOrigin-RevId: 259996147
2020-03-10	Introduce a SidePair concept allowing us to rewrite much internal	Benoit Jacob
	ruy code taking advantage of LHS<->RHS code symmetry to remove some redundancy. The key motivation was that I want to experiment with some nontrivial changes to how TrMulTask handles the packing of blocks, and I didn't want to have to maintain two copies of this nontrivial code. With this change, this code is now in a EnsurePacked method that's all I'll have to edit. PiperOrigin-RevId: 259980220
2020-03-10	Ruy: Move ARM packing code into separate file.	Alex Stark
	PiperOrigin-RevId: 259809541
2020-03-10	Switch to using ruy::DetectDotprod() for sdot instruction support	Ruy Contributors
	instead of the /proc/cpuinfo code we're using currently. PiperOrigin-RevId: 258491846
2020-03-10	Fix performance regression (b/137615815) introduced by new platform	Benoit Jacob
	#defines - they were tested directly by #ifdef, and were being defined by path.h. As tune.cc did not #include path.h, it did not enable its platform-specific tuning code, resulting in a performance regression in cases relying on tuning for maximal performance --- in-order ARM. To prevent that from happening again, this moves the platform defines to a new platform.h and forces users to use a RUY_PLATFORM(X) function macro, so that if they fail to #include platform.h, they get a compilation error. PiperOrigin-RevId: 258372624
2020-03-10	Ruy - Float Kernel in ARM32 asm.	T.J. Alumbaugh
	PiperOrigin-RevId: 257833458
2020-03-10	Ruy - Float Kernel in ARM32 asm.	Ruy Contributors
	PiperOrigin-RevId: 256200412
2020-03-10	Ruy - Float Kernel in ARM32 asm.	T.J. Alumbaugh
	PiperOrigin-RevId: 256189495
2020-03-10	Ruy - Float Kernel in ARM32 asm.	T.J. Alumbaugh
	PiperOrigin-RevId: 256057185
2020-03-10	Ruy - Float Kernel in ARM32 asm.	T.J. Alumbaugh
	PiperOrigin-RevId: 256048263