Age | Commit message (Collapse) | Author | |
---|---|---|---|
2020-04-30 | Remove need of passing template parameter which can be deduced by a compilerstatic | Mateusz Chudyk | |
2020-04-29 | Fix assert in tile_test.inl | Mateusz Chudyk | |
2020-04-28 | Fix void * error on g++ 8.4.0. Weird error. | Kenneth Heafield | |
2020-04-28 | Smaller tile for compiling checked in code | Kenneth Heafield | |
2020-04-25 | Memoising benchmark program to decide tile size, but it likes 1x16 | Kenneth Heafield | |
2020-04-24 | Rudimentary tile benchmark. Keep in mind Multiply still needs optimization. | Kenneth Heafield | |
2020-04-24 | Silence compiler warnings on 1<< overflow | Kenneth Heafield | |
2020-04-24 | Extract randomly generated matrix class | Kenneth Heafield | |
2020-04-24 | Comment | Kenneth Heafield | |
2020-04-24 | Oops use memcmp in test for whole array | Kenneth Heafield | |
2020-04-24 | Basic general sized multiply, not optimized yet | Kenneth Heafield | |
2020-04-24 | Add empty check for Tile | Kenneth Heafield | |
2020-04-23 | Comment ends of ifdefs | Kenneth Heafield | |
2020-04-23 | General write working on AVX512, at least for tested cases | Kenneth Heafield | |
2020-04-23 | Insane implementation of most cases for writing C. Still missing offset ↵ | Kenneth Heafield | |
scatter. | |||
2020-04-23 | Tests for unrolled inner dimension are tricky | Kenneth Heafield | |
2020-04-22 | Lots of tests, including inner failing | Kenneth Heafield | |
2020-04-22 | Fix TestMultiplyNoOverhangShapes to call kernel | Kenneth Heafield | |
2020-04-22 | Merge remote-tracking branch 'origin/master' into static | Kenneth Heafield | |
2020-04-20 | Merge pull request #73 from kpu/absolute_std | Kenneth Heafield | |
Add option for absolute value STD | |||
2020-04-20 | Rename and fix interfaceabsolute_std | Nikolay Bogoychev | |
2020-04-20 | Rename and move the if outside the hot loop | Nikolay Bogoychev | |
2020-04-20 | Merge branch 'master' into absolute_std | Nikolay Bogoychev | |
2020-04-20 | Fix OMP parallel wrap typing for Shift | Kenneth Heafield | |
2020-04-20 | Workaround gcc bug producing extra move instructions | Kenneth Heafield | |
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94663 Improvement ranges from 3% (1x64x8) to 35% (8x2048x256) and is often 21-25%. Benchmark program output: BEFORE AFTER Multiply 1 64 8 Samples=75 8-bit AVX512VNNI 64 65.4933 0.875698 8-bit AVX512VNNI 62 64.8533 1.36256 Multiply 8 256 256 Samples=75 8-bit AVX512VNNI 13296 13385.3 36.0012 8-bit AVX512VNNI 10754 10873.9 31.3479 Multiply 8 2048 256 Samples=75 8-bit AVX512VNNI 86800 86974.3 59.9597 8-bit AVX512VNNI 64222 65428.6 222.893 Multiply 8 256 2048 Samples=75 8-bit AVX512VNNI 106780 107392 232.955 8-bit AVX512VNNI 86176 88366.1 402.335 Multiply 320 256 256 Samples=75 8-bit AVX512VNNI 531720 533687 1419.3 8-bit AVX512VNNI 436536 437186 352.487 Multiply 472 256 256 Samples=75 8-bit AVX512VNNI 785026 787784 2068.05 8-bit AVX512VNNI 646240 647382 416.252 Multiply 248 256 256 Samples=75 8-bit AVX512VNNI 412282 413484 971.843 8-bit AVX512VNNI 338368 338656 141.354 Multiply 200 256 256 Samples=75 8-bit AVX512VNNI 332578 333463 742.297 8-bit AVX512VNNI 272890 273103 77.2789 Multiply 256 256 256 Samples=75 8-bit AVX512VNNI 425654 427240 1095.53 8-bit AVX512VNNI 349418 349580 80.8586 Multiply 512 512 512 Samples=75 8-bit AVX512VNNI 3122382 3.13179e+06 4215.88 8-bit AVX512VNNI 2493984 2.51602e+06 6052.1 Multiply 1024 1024 1024 Samples=3 8-bit AVX512VNNI 24927622 2.49795e+07 44940.9 8-bit AVX512VNNI 19210646 1.9229e+07 17037 Multiply 4096 4096 128 Samples=3 8-bit AVX512VNNI 49870840 4.99655e+07 133057 8-bit AVX512VNNI 46146812 4.62847e+07 205448 | |||
2020-04-19 | Don't catch clang with the gcc hack, move VNNI to a function | Kenneth Heafield | |
2020-04-19 | Fix comment | Kenneth Heafield | |
2020-04-19 | Work around gcc _mm512_dpbusds_epi32 spurious vmovdqa64 instructions | Kenneth Heafield | |
Use asm ("vpdpbusds %2, %1, %0" : "+x"(c) : "x"(a), "mx"(b)); instead of c = _mm512_dpbusds_epi32(c, a, b); | |||
2020-04-19 | template argument for shuffle immediate | Kenneth Heafield | |
makes clang happy | |||
2020-04-19 | Remove StaticLoop | Kenneth Heafield | |
2020-04-19 | Change tile_test to variadic index_sequence | Kenneth Heafield | |
2020-04-19 | Sum16To32 using variadic templates | Kenneth Heafield | |
2020-04-19 | Replace StaticLoop with variadic template | Kenneth Heafield | |
2020-04-19 | Document unordered_unfurl | Kenneth Heafield | |
2020-04-19 | Header for std::size_t | Kenneth Heafield | |
2020-04-19 | Change Index to size_t | Kenneth Heafield | |
2020-04-19 | Switch reduce to taking RegisterPair | Kenneth Heafield | |
2020-04-19 | Change to integer sequence for unrolling kernels | Kenneth Heafield | |
2020-04-18 | Even more test configurations | Kenneth Heafield | |
2020-04-18 | Test statically unrolled multiplies too | Kenneth Heafield | |
2020-04-18 | Tiled multiply with basic testing work | Kenneth Heafield | |
2020-04-18 | Merge remote-tracking branch 'origin/master' into static | Kenneth Heafield | |
2020-04-13 | Juse use posix_memalign everywhere | Kenneth Heafield | |
2020-04-06 | Merge pull request #77 from kpuatamazon/master | Kenneth Heafield | |
OMP parallelization for Multiply | |||
2020-04-05 | Comments | Kenneth Heafield | |
2020-04-04 | Test SSE2 | Kenneth Heafield | |
2020-04-04 | Rename Pack to Reduce | Kenneth Heafield | |
2020-04-04 | More thoroughly test reduction code | Kenneth Heafield | |
2020-04-04 | Does AVX512 reduce work? | Kenneth Heafield | |
2020-04-04 | Reduce working for SSE2 and AVX2, working on AVX512 | Kenneth Heafield | |