https://github.com/JuliaLang/julia
Revision 36b7d3bed73a6e9419b5ebbb073e41252e2de028 authored by Harmen Stoppels on 09 February 2024, 14:37:01 UTC, committed by GitHub on 09 February 2024, 14:37:01 UTC
Adds a convenient way to enable PGO+LTO on Julia and LLVM together:

1. `cd contrib/pgo-lto`
2. `make -j$(nproc) stage1`
3. `make clean-profiles`
4. `./stage1.build/julia -O3 -e 'using Pkg;
Pkg.add("LoopVectorization"); Pkg.test("LoopVectorization")'`
5. `make -j$(nproc) stage2`

<details>
<summary>* Output looks roughly like as follows</summary>

```c++
$ make -C contrib/pgo-lto top 
make: Entering directory '/dev/shm/julia/contrib/pgo-lto'
llvm-profdata show --topn=50 /dev/shm/julia/contrib/pgo-lto/profiles/merged.prof | c++filt
Instrumentation level: IR  entry_first = 0
Total functions: 85943
Maximum function count: 7867557260
Maximum internal block count: 3468437590
Top 50 functions with the largest internal block counts: 
  llvm::BitVector::operator|=(llvm::BitVector const&), max count = 7867557260
  LateLowerGCFrame::ComputeLiveness(State&), max count = 3468437590
  llvm::hashing::detail::hash_combine_recursive_helper::hash_combine_recursive_helper(), max count = 1742259834
  llvm::SUnit::addPred(llvm::SDep const&, bool), max count = 511396575
  llvm::LiveRange::overlaps(llvm::LiveRange const&, llvm::CoalescerPair const&, llvm::SlotIndexes const&) const, max count = 508061762
  llvm::StringMapImpl::LookupBucketFor(llvm::StringRef), max count = 505682177
  std::map<llvm::BasicBlock*, BBState, std::less<llvm::BasicBlock*>, std::allocator<std::pair<llvm::BasicBlock* const, BBState> > >::operator[](llvm::BasicBlock* const&), max count = 395628888
  llvm::LiveRange::advanceTo(llvm::LiveRange::Segment const*, llvm::SlotIndex) const, max count = 384642728
  llvm::LiveRange::isLiveAtIndexes(llvm::ArrayRef<llvm::SlotIndex>) const, max count = 380291040
  llvm::PassRegistry::enumerateWith(llvm::PassRegistrationListener*), max count = 352313953
  ijl_method_instance_add_backedge, max count = 349608221
  llvm::SUnit::ComputeHeight(), max count = 336604330
  llvm::LiveRange::advanceTo(llvm::LiveRange::Segment*, llvm::SlotIndex), max count = 331030109
  llvm::SmallPtrSetImplBase::insert_imp(void const*), max count = 272966545
  llvm::LiveIntervals::checkRegMaskInterference(llvm::LiveInterval&, llvm::BitVector&), max count = 257449540
  LateLowerGCFrame::ComputeLiveSets(State&), max count = 252096274
  /dev/shm/julia/src/jltypes.c:has_free_typevars, max count = 230879464
  ijl_get_pgcstack, max count = 216953592
  LateLowerGCFrame::RefineLiveSet(llvm::BitVector&, State&, std::vector<int, std::allocator<int> > const&), max count = 188013152
  /dev/shm/julia/src/flisp/flisp.c:apply_cl, max count = 174863813
  /dev/shm/julia/src/flisp/builtins.c:fl_memq, max count = 168621603
```
</details>


This results quite often in spectacular speedups for time to first X as
it reduces the time spent in LLVM optimization passes by 25 or even 30%.

Example 1:

```julia
using LoopVectorization
function f!(a, b)
    @turbo for i in eachindex(a)
        a[i] *= b[i]
    end
    return a
end
f!(rand(1), rand(1))
```

```console
$ time ./julia -O3 lv.jl
```

Without PGO+LTO: 14.801s
With PGO+LTO: 11.978s (-19%)

Example 2:

```console
$ time ./julia -e 'using Pkg; Pkg.test("Unitful");'
```

Without PGO+LTO: 1m47.688s
With PGO+LTO: 1m35.704s (-11%)

Example 3 (taken from issue #45395, which is almost only LLVM):

```console
$ JULIA_LLVM_ARGS=-time-passes ./julia script-45395.jl
```

Without PGO+LTO:

```
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 101.0130 seconds (98.6253 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  53.6961 ( 54.7%)   0.1050 (  3.8%)  53.8012 ( 53.3%)  53.8045 ( 54.6%)  Unroll loops
  25.5423 ( 26.0%)   0.0072 (  0.3%)  25.5495 ( 25.3%)  25.5444 ( 25.9%)  Global Value Numbering
   7.1995 (  7.3%)   0.0526 (  1.9%)   7.2521 (  7.2%)   7.2517 (  7.4%)  Induction Variable Simplification
   6.0541 (  5.1%)   0.0098 (  0.3%)   5.0639 (  5.0%)   5.0561 (  5.1%)  Combine redundant instructions #2
```

With PGO+LTO:

```
===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 72.6507 seconds (70.1337 wall clock)

   ---User Time---   --System Time--   --User+System--   ---Wall Time---  --- Name ---
  36.0894 ( 51.7%)   0.0825 (  2.9%)  36.1719 ( 49.8%)  36.1738 ( 51.6%)  Unroll loops
  16.5713 ( 23.7%)   0.0129 (  0.5%)  16.5843 ( 22.8%)  16.5794 ( 23.6%)  Global Value Numbering
   5.9047 (  8.5%)   0.0395 (  1.4%)   5.9442 (  8.2%)   5.9438 (  8.5%)  Induction Variable Simplification
   4.7566 (  6.8%)   0.0078 (  0.3%)   4.7645 (  6.6%)   4.7575 (  6.8%)  Combine redundant instructions #2
```

Or -28% time spent in LLVM.

`perf` reports show this is mostly fewer instructions and reduction in
icache misses.

---

Finally there's a significant reduction in binary sizes. For libLLVM.so:

```
79M	usr/lib/libLLVM-13jl.so (before)
67M	usr/lib/libLLVM-13jl.so (after)
```

And it can be reduced by another 2MB with `--icf=safe` when using LLD as
a linker anyways.

- [x] Two out-of-source builds would be better than a single in-source
build, so that it's easier to find good profile data

---------

Co-authored-by: Oscar Smith <oscardssmith@gmail.com>
Co-authored-by: Lilith Orion Hafner <lilithhafner@gmail.com>
1 parent 27b31d1
History
Tip revision: 36b7d3bed73a6e9419b5ebbb073e41252e2de028 authored by Harmen Stoppels on 09 February 2024, 14:37:01 UTC
Add PGO+LTO Makefile (#45641)
Tip revision: 36b7d3b
File Mode Size
.devcontainer
.github
base
cli
contrib
deps
doc
etc
src
stdlib
test
.buildkite-external-version -rw-r--r-- 5 bytes
.clang-format -rw-r--r-- 3.3 KB
.clangd -rw-r--r-- 114 bytes
.codecov.yml -rw-r--r-- 52 bytes
.git-blame-ignore-revs -rw-r--r-- 371 bytes
.gitattributes -rw-r--r-- 65 bytes
.gitignore -rw-r--r-- 571 bytes
.mailmap -rw-r--r-- 12.7 KB
CITATION.bib -rw-r--r-- 513 bytes
CITATION.cff -rw-r--r-- 940 bytes
CONTRIBUTING.md -rw-r--r-- 23.4 KB
HISTORY.md -rw-r--r-- 372.8 KB
LICENSE.md -rw-r--r-- 1.3 KB
Make.inc -rw-r--r-- 55.9 KB
Makefile -rw-r--r-- 30.2 KB
NEWS.md -rw-r--r-- 11.5 KB
README.md -rw-r--r-- 7.4 KB
THIRDPARTY.md -rw-r--r-- 3.9 KB
VERSION -rw-r--r-- 11 bytes
julia.spdx.json -rw-r--r-- 37.8 KB
pkgimage.mk -rw-r--r-- 7.1 KB
sysimage.mk -rw-r--r-- 4.2 KB
typos.toml -rw-r--r-- 78 bytes

README.md

back to top