Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A faster log function #8869

Closed
simonbyrne opened this issue Nov 1, 2014 · 108 comments
Closed

A faster log function #8869

simonbyrne opened this issue Nov 1, 2014 · 108 comments
Labels
maths Mathematical functions performance Must go faster

Comments

@simonbyrne
Copy link
Contributor

The log function can often be the dominant cost in certain algorithms, particularly those related to sampling and fitting statistical models, e.g. see this discussion.

I've implemented the Tang (1990) table-based algorithm here:
https://github.com/simonbyrne/libm.jl/blob/master/src/log_tang.jl

It seems to be about 10% faster than the openlibm function, but there is probably still room for improvement. Does anyone have any suggestions?

For reference:

Once we have native fused multiply-add operations (#8112), we can probably speed it up further (on processors that support them).

@nalimilan
Copy link
Member

If you can give me a simple way of doing benchmarks, I can compare openlibm and system libm on OS X 10.6, to see whether the performance improvement was already present on that version. But it's not my machine, so I can't install development tools.

@ViralBShah
Copy link
Member

I also found the Apple log to be faster in tests I had done a while back when I was putting openlibm together.

@ViralBShah
Copy link
Member

Have you compared with Intel's log in MKL? They may also have fast vectorized versions. Perhaps @ArchRobison knows something about this.

@simonbyrne
Copy link
Contributor Author

@nalimilan I have a very, very rough test here:
https://github.com/simonbyrne/libm.jl/blob/master/test/perf/log.jl
But we probably should use something more rigorous (e.g. testing different ranges).

@ViralBShah I don't have access to MKL, so have not compared it.

@ViralBShah
Copy link
Member

@simonbyrne It is intalled on julia.mit.edu, if you have an account there. If you would like it, I can create one for you - just drop me an email. Pretty beefy machine for julia development - 80 cores / 1TB RAM.

@simonbyrne
Copy link
Contributor Author

Here's the timings using the above test script on julia.mit.edu:

Openlibm:
elapsed time: 0.366460477 seconds (13824 bytes allocated)

System:
elapsed time: 0.752183874 seconds (96 bytes allocated)

Intel math:
elapsed time: 0.597063228 seconds (96 bytes allocated)

Julia fdlibm:
elapsed time: 0.368376388 seconds (96 bytes allocated)

Julia tang:
elapsed time: 0.284002105 seconds (96 bytes allocated)

(here Intel math is the ICC libimf.so, not the MKL routines: I'm still figuring those out).

So that's a 25% boost over openlibm (much more than on my machine).

@JeffBezanson JeffBezanson added the performance Must go faster label Nov 1, 2014
@simonbyrne
Copy link
Contributor Author

@ViralBShah I haven't had any luck figuring out how to call the MKL vector math routines from within julia: any tips?

@ViralBShah
Copy link
Member

Sorry, none of those are not hooked up. Has been a while since I looked up the Intel manual. IMF has the libm stuff that you can get to with USE_INTEL_LIBM.

@andreasnoack
Copy link
Member

Have a look here: https://github.com/simonster/VML.jl

@ViralBShah
Copy link
Member

Perhaps worth also looking at Yeppp.

@simonbyrne
Copy link
Contributor Author

Thanks, I'll try out the package. I had a look at Yeppp's code: their log function is pretty crazy. As they outline in this talk, they intentionally avoid table lookups and divisions, which means they need a 20-term polynomial expansion. This is pretty much the opposite of Tang's approach, which uses the table as a way to get a shorter polynomials.

@ViralBShah
Copy link
Member

That implementation is crazy! I wonder if we can leverage Yeppp or integrate parts of it into openlibm. The build process is completely crazy too.

@nalimilan
Copy link
Member

@simonbyrne On Mac OS X 10.6 with Julia 0.2.1, I get these results (with an old 2GHz Core 2 Duo). Unfortunately I could not test your implementation as cloning a package fails.

julia> @testsum "System:" syslog X
       @testsum "Openlibm:" log X
Warning: Possible conflict in library symbol log

System:
elapsed time: 0.463031635 seconds (64 bytes allocated)

Openlibm:
elapsed time: 0.455063121 seconds (64 bytes allocated)

Is the warning expected or does it indicate that the same library was used for both calls? If not, it would seem to indicate that Apple's libm improved since 10.6, or that it got better than openlibm on recent CPUs.

@simonbyrne
Copy link
Contributor Author

@nalimilan Thanks for looking at it. I'm not sure what the warning is, but it could be getting mixed up between Base.log and libm.log (my package: I really should rename it...)?

Perhaps the reason they refuse to release the source is that they've licensed a proprietary library?

@simonbyrne
Copy link
Contributor Author

It turns that julia.mit.edu doesn't support AVX. @simonster Any ideas on which library I should link to?

@staticfloat
Copy link
Member

@nalimilan That warning is talking about a symbol being imported from a shared library; e.g. it's warning that a log function is already defined in the Julia process, and you're loading a library that redefines that function. I don't remember what we changed in the last year, but that warning might mean that the same log is being called in both cases.

@nalimilan
Copy link
Member

Actually, the error only appears after calling log once, which I think loads openlibm. But the timings are the same before and after that, so the benchmark is probably fine. (Libm.jl isn't involved since I could not even install it on 0.2.1.)

@simonbyrne
Copy link
Contributor Author

It looks as though Apple have changed their algorithm, as they get more accurate results than a standard implementation of Tang's algorithm would provide. On 10 million points:

Algorithm Max relative error (units in last place)
Openlibm 0.863
OS X 10.9 libm 0.508
Tang's algorithm 0.555 (theoretical bound of 0.566)

@ntessore
Copy link

ntessore commented Nov 7, 2014

I have an application that does a lot of raw number crunching. Redefining the basic special functions log, exp, ... to use the system libm as per @simonbyrne's example immediately cut down the runtime by ~20%. So I would be very much interested in (1) either a faster implementation of these basic functions, or (2) a way to tell Julia to use the system libm automatically.

@nolta
Copy link
Member

nolta commented Nov 7, 2014

For (2), compile julia with USE_SYSTEM_LIBM=1.

@ntessore
Copy link

ntessore commented Nov 7, 2014

@nolta Ah, interesting! Thanks. I was using Homebrew, where such options cannot be passed directly. I will try that.

@nolta
Copy link
Member

nolta commented Nov 7, 2014

If we implement log in Base, will we eventually abandon openlibm in favor of a pure julia implementation? I'm ok with that.

@ViralBShah
Copy link
Member

That is right. We can abandon openlibm with a pure julia implementation, which might be easier to maintain and optimize for vectorization, etc. Otherwise we just have to pile on more libraries.

@eschnett
Copy link
Contributor

eschnett commented Nov 8, 2014

Prompted by a different thread I began to port a SIMD math library to Julia. Currently, exp and log are available as a proof of concept. See https://github.com/eschnett/Vecmathlib.jl .

@ViralBShah
Copy link
Member

This looks really cool!

@simonbyrne
Copy link
Contributor Author

@ntessore if you're computing log on arrays, and you're on OS X, you might want to try out the Accelerate library that comes with the OS. It is blazingly fast compared with the standard log (from what I understand it uses SIMD operations and a bunch of extra threads), though does appear to sacrifice some accuracy. I have started putting together a package here:

https://github.com/simonbyrne/Accelerate.jl

It lacks documentation, but if you just call Accelerate.log you should get what you want (if not, please file an issue). Once I have wrapped it up I'll put it on METADATA, but I'd be grateful for any feedback.

@ntessore
Copy link

@simonbyrne Thanks for creating this, I will try and let you know if anything comes up.

Maybe we could put together a general Fast.jl package that offers fast but potentially less accurate math functions on all systems, choosing whatever options are best for a given platform. That way, code is less platform dependent, even though the performance might vary.

@simonbyrne
Copy link
Contributor Author

I tried looking at the Apple log function with a debugger, but they seem to have some basic reverse engineering protection in place (interestingly, this post is by one of their floating point gurus). However I did take a look at a hex dump of the library, and they have a table of pairs of values:

[(log(F),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8]

which suggests that they are using a table lookup approach, similar (but different from) the Tang algorithm. I'll try playing around with it and see how I go.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 8, 2020

I'm pretty sure the julia version had the computation constant propagated out.

@ViralBShah
Copy link
Member

Yeah probably right. Here's a better example:

julia> a = abs.(randn(10^3));

julia> @benchmark log.(a)
BenchmarkTools.Trial: 
  memory estimate:  8.00 KiB
  allocs estimate:  4
  --------------
  minimum time:     7.796 μs (0.00% GC)
  median time:      8.399 μs (0.00% GC)
  mean time:        9.844 μs (8.69% GC)
  maximum time:     1.795 ms (99.32% GC)
  --------------
  samples:          10000
  evals/sample:     4

julia> f(x::Float64) = ccall((:log, :libopenlibm), Float64, (Float64, ), x)
f (generic function with 1 method)

julia> @benchmark f.(a)
BenchmarkTools.Trial: 
  memory estimate:  8.00 KiB
  allocs estimate:  4
  --------------
  minimum time:     9.934 μs (0.00% GC)
  median time:      10.366 μs (0.00% GC)
  mean time:        10.828 μs (0.00% GC)
  maximum time:     39.387 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> g(x::Float64) = ccall((:log, :libm), Float64, (Float64, ), x)
g (generic function with 1 method)

julia> @benchmark g.(a) 
BenchmarkTools.Trial: 
  memory estimate:  8.00 KiB
  allocs estimate:  4
  --------------
  minimum time:     6.013 μs (0.00% GC)
  median time:      6.494 μs (0.00% GC)
  mean time:        7.372 μs (11.12% GC)
  maximum time:     1.082 ms (99.04% GC)
  --------------
  samples:          10000
  evals/sample:     6

We are in between the Apple libm and openlibm.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 8, 2020

I believe you actually just want log($(1000.0)). a should also be $a but it matters less in this particular case.

@ViralBShah
Copy link
Member

Using $a speeds up all the 3 cases by 0.5 μs.

@chriselrod
Copy link
Contributor

That seems to work, and is simpler than the Ref approach I've been using:

julia> @benchmark log($(Ref(1000.0))[])
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.541 ns (0.00% GC)
  median time:      4.582 ns (0.00% GC)
  mean time:        4.592 ns (0.00% GC)
  maximum time:     14.723 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark log($(1000.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.463 ns (0.00% GC)
  median time:      4.488 ns (0.00% GC)
  mean time:        4.498 ns (0.00% GC)
  maximum time:     14.917 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark log(1000.0)
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.103 ns (0.00% GC)
  median time:      1.105 ns (0.00% GC)
  mean time:        1.111 ns (0.00% GC)
  maximum time:     11.283 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

For comparison:

julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so"), Float64, (Float64, ), x)
g (generic function with 1 method)

julia> g(100.0)
ERROR: could not load library "/usr/lib64/libm.so"
/usr/lib64/libm.so: invalid ELF header
Stacktrace:
 [1] g(::Float64) at ./REPL[16]:1
 [2] top-level scope at REPL[17]:1

julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so.6"), Float64, (Float64, ), x)
g (generic function with 1 method)

julia> g(100.0)
4.605170185988092

julia> f(x::Float64) = ccall((:log, :libopenlibm), Float64, (Float64, ), x)
f (generic function with 2 methods)

julia> @benchmark f($(1000.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     6.245 ns (0.00% GC)
  median time:      6.261 ns (0.00% GC)
  mean time:        6.273 ns (0.00% GC)
  maximum time:     16.794 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark g($(1000.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.936 ns (0.00% GC)
  median time:      3.946 ns (0.00% GC)
  mean time:        3.958 ns (0.00% GC)
  maximum time:     14.315 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Any idea why I may be getting an invalid elf header error?

@non-Jedi
Copy link
Contributor

non-Jedi commented Mar 8, 2020

For what it's worth, on linux, log appears to be slower than log from glibc libm:

julia> @btime ccall((:log, :libopenlibm), Float64, (Float64,), $1000.0)
  8.754 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime ccall((:log, "/usr/lib/libm.so.6"), Float64, (Float64,), $1000.0)
  5.306 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime log(Ref($1000.0)[])
  6.297 ns (0 allocations: 0 bytes)
6.907755278982137

EDIT: beat to the punch it looks like.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 8, 2020

libm.so is usually a linker script. In fact, all .so files in LD_LIBRARY_PATH are mainly meant for compile time.

@non-Jedi
Copy link
Contributor

non-Jedi commented Mar 8, 2020

julia> @benchmark log($(1000.0))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.463 ns (0.00% GC)
  median time:      4.488 ns (0.00% GC)
  mean time:        4.498 ns (0.00% GC)
  maximum time:     14.917 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

Huh. This doesn't work for avoiding constant propagation on my machine. Wonder what the difference is.

julia> @benchmark log($(1000.0))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     0.017 ns (0.00% GC)
  median time:      0.019 ns (0.00% GC)
  mean time:        0.019 ns (0.00% GC)
  maximum time:     0.033 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

@ViralBShah
Copy link
Member

On my computer, FMA_NATIVE should be true, using the test here:

const FMA_NATIVE = muladd(nextfloat(1.0),nextfloat(1.0),-nextfloat(1.0,2)) != 0

but, it is not getting used. I've built Julia from source on this computer:

julia> const FMA_NATIVE = muladd(nextfloat(1.0),nextfloat(1.0),-nextfloat(1.0,2)) != 0
true

julia> Base.Math.FMA_NATIVE
false

@chriselrod
Copy link
Contributor

chriselrod commented Mar 8, 2020

@non-Jedi I'm also on Linux, and found "/usr/lib(64)/libm.so.6" to be the fastest.
libmvec.so on Linux is also extremely fast, and provides SIMD implementations of a few functions. SLEEFPirates/LoopVectorization will use it when it can find it under "/usr/lib64/", "/usr/lib", or "/lib/x86_64-linux-gnu". Any list giving all the directories to search to cover most of the Linux distros?

I'm on the latest Julia master. What version are you on?

@ViralBShah I have the same problem. Here is a issue:
#33011

I'll try hard coding it to true, recompiling, and checking the difference that makes.

@ViralBShah
Copy link
Member

My tests suggest that the hardcoding the FMA to true has very little effect. Looking into the code to see what values it is triggered for.

@ViralBShah
Copy link
Member

For log, the proc2 method that uses FMA is only invoked for a small range of values:

if 0.9394130628134757 < x < 1.0644944589178595

Even in those cases, I can't see a noticeable difference.

@ViralBShah
Copy link
Member

@chriselrod It is probably easier to experiment with libm.jl for this purpose

@chriselrod
Copy link
Contributor

Interesting. I'm not seeing much of a difference either.
I already rebuilt Julia, but Libm makes it easy to test both at the same time:

julia> Libm.is_fma_fast(Float64)
false

julia> @benchmark Base.log($(1.04))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.653 ns (0.00% GC)
  median time:      3.678 ns (0.00% GC)
  mean time:        3.698 ns (0.00% GC)
  maximum time:     20.121 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> @benchmark Libm.log($(1.04))
BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     3.656 ns (0.00% GC)
  median time:      3.917 ns (0.00% GC)
  mean time:        3.884 ns (0.00% GC)
  maximum time:     13.617 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1000

julia> Base.Math.FMA_NATIVE
true

It is faster than the system libm version for 1.04 on this computer (which again takes about 3.94ns).

@ViralBShah
Copy link
Member

Speedwise, the llvm log seems faster. It is using the Tang algorithm as well: https://github.com/llvm-mirror/libclc/blob/master/generic/lib/math/log1p.cl

julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
l (generic function with 1 method)

julia> @btime l(Ref($1000.0)[])
  4.303 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime log(Ref($1000.0)[])
  5.790 ns (0 allocations: 0 bytes)
6.907755278982137

@ViralBShah ViralBShah reopened this Mar 8, 2020
@ViralBShah
Copy link
Member

Reopening in order to consider whether we want to use the llvm version.

@chriselrod
Copy link
Contributor

I get nearly identical performance between the llvm and Julia logs

julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
l (generic function with 1 method)

julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so.6"), Float64, (Float64, ), x)
g (generic function with 1 method)

julia> @btime l(Ref($1000.0)[])
  4.583 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime log(Ref($1000.0)[])
  4.571 ns (0 allocations: 0 bytes)
6.907755278982137

julia> @btime g(Ref($1000.0)[])
  3.938 ns (0 allocations: 0 bytes)
6.907755278982137

If they're both using the same algorithm, it's interesting that you see a difference.

@ViralBShah
Copy link
Member

ViralBShah commented Mar 8, 2020

Note that what I have linked is log1p, but it does illustrate doing a few things differently:

https://github.com/llvm-mirror/libclc/blob/9aa6f350a6ce0f2cfc7e489495af8899ca74e079/generic/lib/math/log1p.cl#L105

    // Note that we use a lookup table of size 64 rather than 128,
    // and compensate by having extra terms in the minimax polynomial
    // for the kernel approximation.

@ViralBShah
Copy link
Member

Where does one find the glibc log implementation?

@non-Jedi
Copy link
Contributor

non-Jedi commented Mar 8, 2020

On someone's private github mirror, here I think:

https://github.com/bminor/glibc/blob/92b963699aae2da1e25f47edc7a0408bf3aee4d2/sysdeps/ieee754/dbl-64/e_log.c

EDIT: note that glibc is obvious licensed under the GPL, so any implementation that could be considered "derived" from glibc must also be so licensed.

@ViralBShah
Copy link
Member

Looks surprisingly like the musl log: https://git.musl-libc.org/cgit/musl/tree/src/math/log.c

But one is MIT licensed and the other is GPL. Hmm.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 8, 2020

Speedwise, the llvm log seems faster. It is using the Tang algorithm as well: https://github.com/llvm-mirror/libclc/blob/master/generic/lib/math/log1p.cl

julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
l (generic function with 1 method)

IIUC that's not what you are using. I'm 99% sure the LLVM intrinsics just got lowered to libm functions. The one you linked is the OpenCL C library, which shouldn' be what you are using directly....

@ViralBShah
Copy link
Member

ViralBShah commented Mar 8, 2020

IIUC that's not what you are using. I'm 99% sure the LLVM intrinsics just got lowered to libm functions. The one you linked is the OpenCL C library, which shouldn' be what you are using directly....

Ok. That's what I thought originally. However, the llvm log seems slightly faster than the system libm on my mac.

@yuyichao
Copy link
Contributor

yuyichao commented Mar 8, 2020

You can check code_native and/or in the debugger to make sure/compare.

Also note that I'm not sure what log function has the best performance and I'm merely responding to obvious issues in my notification. A few other issues to consider,

  1. Most FMA check (at least any that derived from the one in base) is unreliable.
  2. Last time I check LLVM still doesn't clear floating point registers aggressive enough on x86. Calling into a different library compiled using different compilers/flags also affect this. Comparing different implementation could be unfair due to this. (A julia implementation, for example would be compiled with exactly the same flag as the benchmark caller, a C implementation likely won't, glibc libm has uarch dispatch, so does sleef, libmvec and in some sense julia, whereas openlibm doesn't)
  3. And this is not taking into account the obvious compiler difference. I've seen the same math function compiled with gcc and clang to have very different performance (to the level of the difference you are seeing here.) I usually see GCC wins IIRC but Clang also comes ahead sometimes.

Anyway, I have no idea what's the best log implemenation, or any libm function for that matter..... However, if you are interested in algorithm, just blinding doing random testing may not give the most sensible result and I've certainly seen cases where the small effects matters at this level......

@ViralBShah
Copy link
Member

We clearly have the right algorithm and the original purpose of this issue is resolved. The julia log is overall reasonably competitive if not the fastest. There may be value in calling the system log in certain cases (mac and linux64 it appears), but we should perhaps open a new issue for that and it would need significant testing in cases where we use the system provided versions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maths Mathematical functions performance Must go faster
Projects
None yet
Development

No branches or pull requests