HyperHessians.jl is a package to compute hessians using forward mode automatic differentiation.
It works similar to ForwardDiff.hessian
but should have better run-time and compile-time performance in all cases.
There are some limitations compared to ForwardDiff.jl:
- Only support for basic numeric types (
Float64
,Float32
, etc.). - Currently no "tagging" of numbers to avoid perturbation confusion.
- Not as many primitives implemented.
To compute the hessian of a function f
w.r.t the vector x
the quick and dirty way is to call
HyperHessians.hessian(f, x)
:
julia> x = [1.0,2.0,3.0,4.0];
julia> HyperHessians.hessian(x->sum(cumprod(x)), x)
4×4 Matrix{Float64}:
0.0 16.0 10.0 6.0
16.0 0.0 5.0 3.0
10.0 5.0 0.0 2.0
6.0 3.0 2.0 0.0
This also works for scalar functions:
julia> f(x) = exp(x) / sqrt(sin(x)^3 + cos(x)^3);
julia> HyperHessians.hessian(f, 2.0)
82.55705026089272
When the input is a vector, the basic usage will, however, not give the best performance.
For best performance we want to do the following things:
- Cache the input array of custom numbers that FastHessian uses so they can be resued upon multiple calls to
hessian
. - Decide on a good "chunk size" which is roughly the size of the section of the hessian computed for every call to the function.
- Pre-allocate the output hessian matrix.
Step 1 and 2 are done by creating a HessianConfig
object:
x = rand(32); # input array
chunk_size = HyperHessians.Chunk{8}() # chosen chunk size
cfg = HyperHessians.HessianConfig(x, chunk_size) # creating the config object
The larger the chunk size the larger part of the Hessian is computed on every call to f
(if the chunk size is equal to
the input vector, the whole hessian is computed in one call to f
).
However, with a larger chunk size the special numbers FastHessian use become larger and if they become too large this can lead to inefficient execution.
A choice of a chunk size is, therefore, a trade-off and the optimal one is likely to be dependent on the particular function getting differentiated. A decent overall choice seems to be a chunk size of 8. It is also in general a good idea to pick a chunk size as a multiple of 4 to use SIMD effectively.
The chunk_size
argument can be left out and HyperHessians will try to determine a reasonable choice.
If the chunk size c
is smaller than the input vector with length n
, the function will be called k = ceil(Int, n / c); k(k+1)÷2
times, each time computing a part of the hessian:
julia> mysum(x) = (@info "got called"; sum(x));
julia> x = rand(8); n = length(x); c = 4;
julia> cfg = HyperHessians.HessianConfig(x, HyperHessians.Chunk{c}());
julia> HyperHessians.hessian(mysum, x, cfg)
[ Info: got called
[ Info: got called
[ Info: got called
julia> k=ceil(Int, n / c); k*(k+1)÷2
3
Finally, it is also a good idea to allocate the output hessian and use the in-place hessian!
function instead.
Putting it all together, the result would something look like this:
julia> x = rand(8);
julia> H = similar(x, 8, 8);
julia> cfg = HyperHessians.HessianConfig(x, HyperHessians.Chunk{8}());
julia> HyperHessians.hessian!(H, f, x, cfg)
In chunked mode (when the chunk size is smaller than the input vector length) it is possible to use multithreading to compute chunks in parallel.
This is done by creating a HessianConfigThreaded
object instead of a HessianConfig
object.
Below is a benchmark:
julia> using HyperHessians, DiffTests, BenchmarkTools
julia> Threads.nthreads()
8
julia> x = rand(128);
julia> cfg = HyperHessians.HessianConfig(x);
julia> cfg_thread = HyperHessians.HessianConfigThreaded(x);
julia> H = similar(x, length(x), length(x));
julia> @btime HyperHessians.hessian!(H, DiffTests.ackley, x, cfg);
883.716 μs (0 allocations: 0 bytes)
julia> @btime HyperHessians.hessian!(H, DiffTests.ackley, x, cfg_thread);
128.609 μs (41 allocations: 3.89 KiB)
To get an estimate of the performance of HyperHessians we here benchmark it
against the ForwardDiff.jl package for two common benchmark functions rosenbrock
and ackley
.
The results can be reproduced with benchmark/fdiff.jl
.
Function | input length | Time ForwardDiff | Time HyperHessians | Speedup |
---|---|---|---|---|
ackley |
1 | 126.322 ns | 64.480 ns | 2.0 |
ackley |
8 | 1.636 μs | 446.505 ns | 3.7 |
ackley |
128 | 3.533 ms | 823.218 μs | 4.3 |
rosenbrock_1 |
1 | 84.264 ns | 36.022 ns | 2.3 |
rosenbrock_1 |
8 | 1.119 μs | 379.582 ns | 2.9 |
rosenbrock_1 |
128 | 3.392 ms | 854.630 μs | 4.0 |