The key for efficient implementations of rather complex functions such as the trigonometric ones is first to understand how much precision actually is needed (and in which range) and then choosing the proper math which approximates this goal the best. Instead of the go-to Taylor series approximations I’m showing a continued fraction series expansion for the tanh function as an example here. In addition some special cases and outlook are presented as well.
A complete and precise tanh computation as provided by some math libraries is computationally very, very expensive and in DSP land this is typically avoided whenever it’s possible. In my current example I was in need of a fast computation method but one which should also gives an absolute precise transfer curve and harmonic content match even if driven at around +12dBFS.
For this particular application I’ve chosen a continued fraction for tanh which converges quite fast with just a number of iterations. After testing a little bit it turned out that 7 divisions in series are already enough to match the desired goal. Though, 7 divisions is also summing up to some amount of cpu consumption which is not negligible if one wants to compute some other serious stuff on top (e.g oversampling) and want to compute that on realtime audio streams without taxing the cpu that much.
But since this is finite now, one can simplify the expression by some basic math or even better let the math program do such boring stuff. This directly leads to
which is computationally already much better now since just one single division is left. All other stuff can be expressed in series of multiplications and additions (which are cpu wise the cheapest, except the basic bit operations). In this case, I basically followed the Horner scheme for denominator and divisor separately.
This will lead to super efficient assembler code and the reason is that it also minimizes the number of register loads: As written below in the code just one single register is used for a(x) now where a simple multiply/add stream is then performed on. The same is done with b(x). At the end the one single division is applied.
So the code ended up into being just 13 multiply/adds plus a division (and the few register loads of course) which translates into around 20 cpu cycles or so on a modern cpu architecture. Still it exactly matches the harmonic content of the model and also the transfer curve turns out to be a precise match in-between the desired dynamic ranges.
Since such approximations are “blowing up” sooner or later which basically means that at some higher input they typically go beyond the [-1 … 1] range output wise, it’s a common practise to add a hard-clipper which simply clamps the output values back to [-1 … 1] afterwards. This should be considered carefully though. In case of a series which converges not that fast respectively blows up early the hard-clipper will break that curves continuity which causes nasty distortion and the clipper’s distortion itself will dominate and cover the tanh harmonic spectrum.
When even 20 cpu cycles is too much one should have a look at the Pade approximations as an alternative. The simpler ones can be implemented with saving some further cpu cycles but however, they might not match the models curve and spectrum anymore. Anyway, this is always subject to what the desired goal is and which tradeoff is acceptable for that specific design.