-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A faster log function #8869
Comments
If you can give me a simple way of doing benchmarks, I can compare openlibm and system libm on OS X 10.6, to see whether the performance improvement was already present on that version. But it's not my machine, so I can't install development tools. |
I also found the Apple |
Have you compared with Intel's |
@nalimilan I have a very, very rough test here: @ViralBShah I don't have access to MKL, so have not compared it. |
@simonbyrne It is intalled on julia.mit.edu, if you have an account there. If you would like it, I can create one for you - just drop me an email. Pretty beefy machine for julia development - 80 cores / 1TB RAM. |
Here's the timings using the above test script on julia.mit.edu:
(here Intel math is the ICC libimf.so, not the MKL routines: I'm still figuring those out). So that's a 25% boost over openlibm (much more than on my machine). |
@ViralBShah I haven't had any luck figuring out how to call the MKL vector math routines from within julia: any tips? |
Sorry, none of those are not hooked up. Has been a while since I looked up the Intel manual. IMF has the libm stuff that you can get to with USE_INTEL_LIBM. |
Have a look here: https://github.com/simonster/VML.jl |
Perhaps worth also looking at Yeppp. |
Thanks, I'll try out the package. I had a look at Yeppp's code: their log function is pretty crazy. As they outline in this talk, they intentionally avoid table lookups and divisions, which means they need a 20-term polynomial expansion. This is pretty much the opposite of Tang's approach, which uses the table as a way to get a shorter polynomials. |
That implementation is crazy! I wonder if we can leverage Yeppp or integrate parts of it into openlibm. The build process is completely crazy too. |
@simonbyrne On Mac OS X 10.6 with Julia 0.2.1, I get these results (with an old 2GHz Core 2 Duo). Unfortunately I could not test your implementation as cloning a package fails.
Is the warning expected or does it indicate that the same library was used for both calls? If not, it would seem to indicate that Apple's libm improved since 10.6, or that it got better than openlibm on recent CPUs. |
@nalimilan Thanks for looking at it. I'm not sure what the warning is, but it could be getting mixed up between Base.log and libm.log (my package: I really should rename it...)? Perhaps the reason they refuse to release the source is that they've licensed a proprietary library? |
It turns that julia.mit.edu doesn't support AVX. @simonster Any ideas on which library I should link to? |
@nalimilan That warning is talking about a symbol being imported from a shared library; e.g. it's warning that a |
Actually, the error only appears after calling |
It looks as though Apple have changed their algorithm, as they get more accurate results than a standard implementation of Tang's algorithm would provide. On 10 million points:
|
I have an application that does a lot of raw number crunching. Redefining the basic special functions log, exp, ... to use the system libm as per @simonbyrne's example immediately cut down the runtime by ~20%. So I would be very much interested in (1) either a faster implementation of these basic functions, or (2) a way to tell Julia to use the system libm automatically. |
For (2), compile julia with |
@nolta Ah, interesting! Thanks. I was using Homebrew, where such options cannot be passed directly. I will try that. |
If we implement |
That is right. We can abandon openlibm with a pure julia implementation, which might be easier to maintain and optimize for vectorization, etc. Otherwise we just have to pile on more libraries. |
Prompted by a different thread I began to port a SIMD math library to Julia. Currently, |
This looks really cool! |
@ntessore if you're computing https://github.com/simonbyrne/Accelerate.jl It lacks documentation, but if you just call |
@simonbyrne Thanks for creating this, I will try and let you know if anything comes up. Maybe we could put together a general Fast.jl package that offers fast but potentially less accurate math functions on all systems, choosing whatever options are best for a given platform. That way, code is less platform dependent, even though the performance might vary. |
I tried looking at the Apple log function with a debugger, but they seem to have some basic reverse engineering protection in place (interestingly, this post is by one of their floating point gurus). However I did take a look at a hex dump of the library, and they have a table of pairs of values: [(log(F),1/F) for F = 1.0:0x1p-8:2.0-0x1p-8] which suggests that they are using a table lookup approach, similar (but different from) the Tang algorithm. I'll try playing around with it and see how I go. |
I'm pretty sure the julia version had the computation constant propagated out. |
Yeah probably right. Here's a better example:
We are in between the Apple libm and openlibm. |
I believe you actually just want |
Using |
That seems to work, and is simpler than the julia> @benchmark log($(Ref(1000.0))[])
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.541 ns (0.00% GC)
median time: 4.582 ns (0.00% GC)
mean time: 4.592 ns (0.00% GC)
maximum time: 14.723 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark log($(1000.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.463 ns (0.00% GC)
median time: 4.488 ns (0.00% GC)
mean time: 4.498 ns (0.00% GC)
maximum time: 14.917 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark log(1000.0)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.103 ns (0.00% GC)
median time: 1.105 ns (0.00% GC)
mean time: 1.111 ns (0.00% GC)
maximum time: 11.283 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 For comparison: julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so"), Float64, (Float64, ), x)
g (generic function with 1 method)
julia> g(100.0)
ERROR: could not load library "/usr/lib64/libm.so"
/usr/lib64/libm.so: invalid ELF header
Stacktrace:
[1] g(::Float64) at ./REPL[16]:1
[2] top-level scope at REPL[17]:1
julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so.6"), Float64, (Float64, ), x)
g (generic function with 1 method)
julia> g(100.0)
4.605170185988092
julia> f(x::Float64) = ccall((:log, :libopenlibm), Float64, (Float64, ), x)
f (generic function with 2 methods)
julia> @benchmark f($(1000.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.245 ns (0.00% GC)
median time: 6.261 ns (0.00% GC)
mean time: 6.273 ns (0.00% GC)
maximum time: 16.794 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark g($(1000.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.936 ns (0.00% GC)
median time: 3.946 ns (0.00% GC)
mean time: 3.958 ns (0.00% GC)
maximum time: 14.315 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 Any idea why I may be getting an invalid elf header error? |
For what it's worth, on linux, julia> @btime ccall((:log, :libopenlibm), Float64, (Float64,), $1000.0)
8.754 ns (0 allocations: 0 bytes)
6.907755278982137
julia> @btime ccall((:log, "/usr/lib/libm.so.6"), Float64, (Float64,), $1000.0)
5.306 ns (0 allocations: 0 bytes)
6.907755278982137
julia> @btime log(Ref($1000.0)[])
6.297 ns (0 allocations: 0 bytes)
6.907755278982137 EDIT: beat to the punch it looks like. |
libm.so is usually a linker script. In fact, all |
Huh. This doesn't work for avoiding constant propagation on my machine. Wonder what the difference is. julia> @benchmark log($(1000.0))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 0.017 ns (0.00% GC)
median time: 0.019 ns (0.00% GC)
mean time: 0.019 ns (0.00% GC)
maximum time: 0.033 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000 |
On my computer, FMA_NATIVE should be true, using the test here: Line 144 in 17ad922
but, it is not getting used. I've built Julia from source on this computer:
|
@non-Jedi I'm also on Linux, and found "/usr/lib(64)/libm.so.6" to be the fastest. I'm on the latest Julia master. What version are you on? @ViralBShah I have the same problem. Here is a issue: I'll try hard coding it to true, recompiling, and checking the difference that makes. |
My tests suggest that the hardcoding the FMA to true has very little effect. Looking into the code to see what values it is triggered for. |
For Line 258 in 17ad922
Even in those cases, I can't see a noticeable difference. |
@chriselrod It is probably easier to experiment with libm.jl for this purpose |
Interesting. I'm not seeing much of a difference either. julia> Libm.is_fma_fast(Float64)
false
julia> @benchmark Base.log($(1.04))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.653 ns (0.00% GC)
median time: 3.678 ns (0.00% GC)
mean time: 3.698 ns (0.00% GC)
maximum time: 20.121 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> @benchmark Libm.log($(1.04))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.656 ns (0.00% GC)
median time: 3.917 ns (0.00% GC)
mean time: 3.884 ns (0.00% GC)
maximum time: 13.617 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 1000
julia> Base.Math.FMA_NATIVE
true It is faster than the system libm version for 1.04 on this computer (which again takes about 3.94ns). |
Speedwise, the llvm log seems faster. It is using the Tang algorithm as well: https://github.com/llvm-mirror/libclc/blob/master/generic/lib/math/log1p.cl
|
Reopening in order to consider whether we want to use the llvm version. |
I get nearly identical performance between the llvm and Julia logs julia> l(x) = ccall("llvm.log.f64", llvmcall, Float64, (Float64, ), x)
l (generic function with 1 method)
julia> g(x::Float64) = ccall((:log, "/usr/lib64/libm.so.6"), Float64, (Float64, ), x)
g (generic function with 1 method)
julia> @btime l(Ref($1000.0)[])
4.583 ns (0 allocations: 0 bytes)
6.907755278982137
julia> @btime log(Ref($1000.0)[])
4.571 ns (0 allocations: 0 bytes)
6.907755278982137
julia> @btime g(Ref($1000.0)[])
3.938 ns (0 allocations: 0 bytes)
6.907755278982137 If they're both using the same algorithm, it's interesting that you see a difference. |
Note that what I have linked is
|
Where does one find the glibc log implementation? |
On someone's private github mirror, here I think: EDIT: note that glibc is obvious licensed under the GPL, so any implementation that could be considered "derived" from glibc must also be so licensed. |
Looks surprisingly like the musl log: https://git.musl-libc.org/cgit/musl/tree/src/math/log.c But one is MIT licensed and the other is GPL. Hmm. |
IIUC that's not what you are using. I'm 99% sure the LLVM intrinsics just got lowered to libm functions. The one you linked is the OpenCL C library, which shouldn' be what you are using directly.... |
Ok. That's what I thought originally. However, the llvm log seems slightly faster than the system libm on my mac. |
You can check Also note that I'm not sure what log function has the best performance and I'm merely responding to obvious issues in my notification. A few other issues to consider,
Anyway, I have no idea what's the best log implemenation, or any libm function for that matter..... However, if you are interested in algorithm, just blinding doing random testing may not give the most sensible result and I've certainly seen cases where the small effects matters at this level...... |
We clearly have the right algorithm and the original purpose of this issue is resolved. The julia log is overall reasonably competitive if not the fastest. There may be value in calling the system |
The log function can often be the dominant cost in certain algorithms, particularly those related to sampling and fitting statistical models, e.g. see this discussion.
I've implemented the Tang (1990) table-based algorithm here:
https://github.com/simonbyrne/libm.jl/blob/master/src/log_tang.jl
It seems to be about 10% faster than the openlibm function, but there is probably still room for improvement. Does anyone have any suggestions?
For reference:
Once we have native fused multiply-add operations (#8112), we can probably speed it up further (on processors that support them).
The text was updated successfully, but these errors were encountered: