Age | Commit message (Collapse) | Author |
|
private.
PiperOrigin-RevId: 305317717
|
|
PiperOrigin-RevId: 304653289
|
|
The motivation is that having source files in the repository root runs into a number of corner cases with copybara setups and with external CMake build systems, so enclosing all code in ruy/ avoids that while generally making our setup much more similar to that of other related projects (TensorFlow, IREE).
PiperOrigin-RevId: 303448881
|
|
PiperOrigin-RevId: 300165710
|
|
PiperOrigin-RevId: 300149932
|
|
implemented. A mistake there would not be caught in matrix multiplication tests as it would be a performance-only bug (or even a memory-locality-only bug not necessarily affecting latencies).
PiperOrigin-RevId: 292199511
|
|
code to .cc files in separate libraries caused the defining of the RUY_OPT_SET token in these targets to no longer affect the internal code being compiled.
PiperOrigin-RevId: 291532431
|
|
a certain size threshold.
Renames cache_friendly_traversal_threshold to local_data_cache_size so it's more explicit about what it is in practice. Introduce shared_data_cache_size, needed in the decision of whether to use Hilbert curve. Hilbert curve is more expensive to decode and only worth it if it allows to reduce DRAM accesses, which depends on shared_data_cache_size. Centralize defaults in a new :cpu_cache_size library. Centralize the reading of these defaults in Spec so that users can override these consistently by passing own spec (either to provide more accurate/runtime values or for test coverage purposes).
On Pixel4, This does not significantly affect latencies, outside of a 1%-2% improvement on latencies on 4 threads on very large matrix sizes.
The motivation for this is that it reduces DRAM accesses: the PMU observes typically a 10% reduction, up to 20%, of 'L3 data cache refill' events on very large matrix multiplications (1000x1000 and above). DRAM accesses should be an increasing function of that, perhaps even more or less proportional to that, so this indicates that this change will significantly reduce DRAM accesses and thus power usage. This was observed consistently on all 2x2=4 combinations of {1, 4} threads on {little, big} cores on Pixel4.
PiperOrigin-RevId: 291531754
|
|
Introduce a RUY_PREFETCH_STORE optset, separate from RUY_PREFETCH_LOAD. Unlike RUY_PREFETCH_LOAD which is detrimental in some contexts (matmul-only benchmarking), RUY_PREFETCH_STORE seems never detrimental, so we need to control it separately.
This is a substantial speedup when the destination matrix stride is close to a multiple of L1 cache aliasing periodicity. For example, on a typical ARM CPU that is 1024-byte periodicity, so for float matrices that happens whenever the destination matrix has a number of rows close to 1024/sizeof(float) = 256. The impact varies gradually as one gets closer to such values. Typically, the impact is large when one is within 64 bytes (1 cache line size) of the nearest such value.
For shallow shapes (small depth, e.g. depth=16 as is common in MobileNet v3) this can be a 2x speedup, as there is not much arithmetic to amortize the high cost of these writes.
As far as I understand, what we really want here is a 'non-temporal store instruction' such as the STNP instruction but I'm not aware of a NEON non-temporal instruction. Using a prefetch-stream instruction just before a regular store seems to convey a strictly weaker hint (we are not conveying that the order of observation does not matter) but already seems to be effective at avoiding cache aliasing issues.
I also experimented with 'pstl3strm' instead of 'pstl1strm' but that wasn't better.
PiperOrigin-RevId: 291440490
|
|
That was the last gemmlowp dependency in ruy.
PiperOrigin-RevId: 289684215
|
|
Introduce the 'ruy profiler', a more modern descendent of it (pure C++11, correct, tested including under TSan, more useful features including formatted parametrized labels and better reporting of multi-thread profiles, treeview-manipulation API, more documentation, more accurate).
Port ruy to using the ruy profiler (TFLite should follow). Add per-GEMM-shape profiling labels, now very easy thanks to formatted parametrized labels, previously too cumbersome to do to be submitted so we had to keep unsubmitted patches for that common profiling need.
PiperOrigin-RevId: 289680118
|
|
PiperOrigin-RevId: 289116846
|
|
PiperOrigin-RevId: 283795957
|
|
PiperOrigin-RevId: 282660302
|
|
PiperOrigin-RevId: 280686463
|
|
- protected by TFLITE_WITH_RUY_GEMV
PiperOrigin-RevId: 278873724
|
|
default policy is off.
PiperOrigin-RevId: 277722288
|
|
PiperOrigin-RevId: 277280732
|
|
This change applies "-O3" for all optimized targets including Linux.
For debug build ("-c dbg"), it doesn't require the additional flag "--copt=-O0".
PiperOrigin-RevId: 275760820
|
|
- Drop unwanted dependency on TFLite macros (prereq ahead of future move out of tflite).
- There doesn't seem to be a compelling flavor of Google logging macros that we could use here, without adding a large dependency and/or a large increase to binary size. We only need the most basic assertion functionality, this implementation achieves minimal binary size impact by using only snprintf, fprintf and abort.
- Still have decent logging of compared values, and support for C++11 enum classes for now by logging numerical values (will be possible to improve when C++20 reflection becomes available).
Also bump the threshold for the found_distinct_values check which was being flaky (could give false negatives based on pseudorandom values).
PiperOrigin-RevId: 271631262
|
|
explicitly empirically derived.
These values have been obtained on a Qualcomm S855 device (arm64). This would need to be tuned differently to achieve optimal performance on other hardware.
PiperOrigin-RevId: 270268717
|
|
PiperOrigin-RevId: 268953192
|
|
PiperOrigin-RevId: 265455751
|
|
Also edit a comment in ruy/BUILD about debugging.
PiperOrigin-RevId: 265083281
|
|
PiperOrigin-RevId: 265063615
|
|
PiperOrigin-RevId: 264939162
|
|
PiperOrigin-RevId: 264920884
|
|
PiperOrigin-RevId: 264915833
|
|
PiperOrigin-RevId: 264913505
|
|
PiperOrigin-RevId: 264911552
|
|
PiperOrigin-RevId: 264690662
|
|
This amounts to disabling Ruy paths for this cpuid instruction results lack selected features.
PiperOrigin-RevId: 264683681
|
|
in size_util.
Part of it was AllocateFast not checking if ptr_ is null
before using it (null deref with offset, so didn't look like a null deref).
Part of it was using round_up_pot with a large size_t value that got implicitly casted to int as round_up_pot took an int argument. This showed how it's safer to just templatize the helpers in size_util.h, make them accept either int32 or int64 (guarded in floor_log2, which is the only of these functions that cares).
I just changed allocator to use only signed types (std::size_t --> std::ptrdiff_t) because I didn't want to deal with the extra complexity of dealing with both signed and unsigned in size_util.
PiperOrigin-RevId: 264668215
|
|
PiperOrigin-RevId: 264602338
|
|
PiperOrigin-RevId: 263201086
|
|
PiperOrigin-RevId: 262999863
|
|
See the comment.
PiperOrigin-RevId: 262564280
|
|
PiperOrigin-RevId: 262217958
|
|
PiperOrigin-RevId: 261951453
|
|
PiperOrigin-RevId: 260750932
|
|
The implementation so far was prematurely optimized. It had all threads record directly into a shared vector indexed by block_ids. The idea was (1) to avoid the overhead of locking or other synchronization primitives when tracing a multi-thread execution, and (2) to avoid overhead of growing heap buffers. The new implementation is much more straightforward, as is most evident from the fact that it doesn't use relaxed_atomic_store anymore (yet still runs free of TSan errors), and that we were able to remove the ProcessedTrace class.
The above-mentioned issues (1) and (2) that drove the earlier design are now addressed as follows in the new design: (1) Each thread now records to its own specific vector of trace entries; these thread-specific vectors are only coalesced into a global vector when dumping a trace. This removed the need for any locking or atomic operations. (2) We are less careful than before about avoiding heap allocations. We just reserve upfront a rather large buffer size, large enough to avoid most subsequent heap reallocations and small enough to still not matter in practical tracing situations.
The proximate motivation for this change is that the existing design, requiring indexing of trace entries by block_id, is now inconvenient as we need to experiment with TrMul implementation changes where packing is not necessarily directly associated with a block_ids anymore.
PiperOrigin-RevId: 259996147
|
|
ruy code taking advantage of LHS<->RHS code symmetry to remove
some redundancy.
The key motivation was that I want to experiment with some nontrivial
changes to how TrMulTask handles the packing of blocks, and I didn't want
to have to maintain two copies of this nontrivial code. With this change,
this code is now in a EnsurePacked method that's all I'll have to edit.
PiperOrigin-RevId: 259980220
|
|
PiperOrigin-RevId: 259809541
|
|
instead of the /proc/cpuinfo code we're using currently.
PiperOrigin-RevId: 258491846
|
|
#defines - they were tested directly by #ifdef, and were being defined
by path.h. As tune.cc did not #include path.h, it did not enable its
platform-specific tuning code, resulting in a performance regression
in cases relying on tuning for maximal performance --- in-order ARM.
To prevent that from happening again, this moves the platform defines
to a new platform.h and forces users to use a RUY_PLATFORM(X) function
macro, so that if they fail to #include platform.h, they get a compilation
error.
PiperOrigin-RevId: 258372624
|
|
PiperOrigin-RevId: 257833458
|
|
PiperOrigin-RevId: 256200412
|
|
PiperOrigin-RevId: 256189495
|
|
PiperOrigin-RevId: 256057185
|
|
PiperOrigin-RevId: 256048263
|