The key for efficient implementations of rather complex functions such as the trigonometric ones is first to understand how much precision actually is needed (and in which range) and then choosing the proper math which approximates this goal the best. Instead of the go-to Taylor series approximations I’m showing a continued fraction series expansion for the tanh function as an example here. In addition some special cases and outlook are presented as well.

A complete and precise tanh computation as provided by some math libraries is computationally very, very expensive and in DSP land this is typically avoided whenever it’s possible. In my current example I was in need of a fast computation method but one which should also gives an absolute precise transfer curve and harmonic content match even if driven at around +12dBFS.

For this particular application I’ve chosen a continued fraction for tanh which converges quite fast with just a number of iterations. After testing a little bit it turned out that 7 divisions in series are already enough to match the desired goal. Though, 7 divisions is also summing up to some amount of cpu consumption which is not negligible if one wants to compute some other serious stuff on top (e.g oversampling) and want to compute that on realtime audio streams without taxing the cpu that much.

But since this is finite now, one can simplify the expression by some basic math or even better let the math program do such boring stuff. This directly leads to

which is computationally already much better now since just one single division is left. All other stuff can be expressed in series of multiplications and additions (which are cpu wise the cheapest, except the basic bit operations). In this case, I basically followed the Horner scheme for denominator and divisor separately.

This will lead to super efficient assembler code and the reason is that it also minimizes the number of register loads: As written below in the code just one single register is used for a(x) now where a simple multiply/add stream is then performed on. The same is done with b(x). At the end the one single division is applied.

So the code ended up into being just 13 multiply/adds plus a division (and the few register loads of course) which translates into around 20 cpu cycles or so on a modern cpu architecture. Still it exactly matches the harmonic content of the model and also the transfer curve turns out to be a precise match in-between the desired dynamic ranges.

Since such approximations are “blowing up” sooner or later which basically means that at some higher input they typically go beyond the [-1 … 1] range output wise, it’s a common practise to add a hard-clipper which simply clamps the output values back to [-1 … 1] afterwards. This should be considered carefully though. In case of a series which converges not that fast respectively blows up early the hard-clipper will break that curves continuity which causes nasty distortion and the clipper’s distortion itself will dominate and cover the tanh harmonic spectrum.

When even 20 cpu cycles is too much one should have a look at the Pade approximations as an alternative. The simpler ones can be implemented with saving some further cpu cycles but however, they might not match the models curve and spectrum anymore. Anyway, this is always subject to what the desired goal is and which tradeoff is acceptable for that specific design.

Related links:

Great post!

Thanks!

this was a nice read, it’s kinda interesting to see how only a handful of mathematical operations can “copy” nature. (assuming the tanh curve is a part of nature π )

Thanks! Didn’t understand all this, but it leads me to one question:

Do most plug-ins run best at higher dbFS?

In these days we commonly record at lower dbFS, and I don’t have the habit of normalizing as I find something like -8dbFS for the master to be perfectly ok.

It depends on how such algorithms are scaled to the outside of the actual plug-in. But there seems to be no commonly accepted standard between developers except maybe in the broadcast world. However, if there is sufficient gain adjustments offered, then this should be no prob.

Thanks! So, gaining up in front and down out is maybe a good habit..

Well your very own FerricTDS claims to best operate at 0dBFS on the input, isn’t it ? π

Well it features ‘TRIM’ so you can adjust the operating reference level on your own in between -24 and +24dB range, ain’t that cool? π

well again with this debate about dbfs.. every developer chooses its reference value, every emulated (“analog”, “true” analog, “hyper”analog, “madafokke” analog or even the italian “socc’mel” analog) has a different reference value,simply because the original effect was developed in that way. again, uk is different from america, both are different from russia, and so on.take your time, and find the sweetspot.

a good developer’s work is to provide the plugin with anything necessary to ensure precise calculations at every setting in the range of the possibilities of the plugin. and this is what this post is about. the rest is up to you. (otherwise herbert would provide your plugin download with a little tony maserati miniaturised)

once found that sweetspot, as a general and personal rule, i found that using compressors, saturators and other level-dependant inserts , there is effectively a Β± 10db range, that roughly reflects the -21(or-18,or even -8,whatever) to 0 dbfs range. if you set the input low, the effect will play as transparent mixing tool, and setting it high, it’ll work as a colouring effect.

Of course fine adjustment is needed, but once found this sweetspot you should tweak the parameters so that the music won’t play shit passing through them.

an example:with comps, the release timing should be adjusted to compensate the input variation, and with saturators you should put two eqs, one before and one after the effect, with inverted curves (or approximately inverted if you need it) to make the saturator react better with your audio content. it introduces frequency dependance to every effect, without any audible phase distortion.

if you love simplicity,choose another job, make money, and buy those fabulous plugins that transform music in gold only by owning them.

otherwise, develop your own ideas and build your own tools. thanks herbert for choosing the last one.

Great post! Which assembler was used in the figure labeled “after conversion into assembler code”? SynthMaker’s inline assembler?

You just need SSE instruction set support and adopt the constants declaration cause that differs in all those (inline assembler) compilers.

Nice sounding and efficiant for sure π

Thanx!

Hi!

Did you try “A Fast, Compact Approximation of the Exponential Function” for tanh() calculation?

Click to access Schraudolph99.pdf

No, because that does not save you the one division and introduces other tradeoffs…

You’re right. Just checked it. This implementation has very low precision and “stepped” rounding and due to this not appliable for sound processing. Never checked it for sound before.

BTW, for modern CPUs with SSE it’s much slower than even C library exp() due to float point -> integer convertion and memory access.

Here’s my performance test results:

C library exp() (SSE2, double precision):

Test with 192000 random samples:

Time: 6 ms, Cycles: 14883k

Test with 192000 impulse samples:

Time: 6 ms, Cycles: 14868k

(x^7 + 378 * x^ 5 …) / (…) with SSE2, double precision, C compiler optimization

Test with 192000 random samples:

Time: 3 ms, Cycles: 8695k

Test with 192000 impulse samples:

Time: 3 ms, Cycles: 8655k

(((x^2 + 378) * x^2 …))) with SSE2, double precision, C compiler optimization

Test with 192000 random samples:

Time: 3 ms, Cycles: 7484k

Test with 192000 impulse samples:

Time: 3 ms, Cycles: 7372k

And this is “fast” exp() from article (same compiler settings):

Test with 192000 random samples:

Time: 7 ms, Cycles: 20028k

Test with 192000 impulse samples:

Time: 7 ms, Cycles: 20067k

interesting – thanks for sharing!

Regarding blowing up at high inputs: rational polynomials can have finite limits in infinities as long as the order of the numerator and denominator polynomials are equal (eg x/(1 + x) has finite limits). This obviously limits one to even functions, but one can fix that with variety of ways (multiply by signum, abs the denominator etc; eg x/(1 + abs(x)) is a sigmoid with finite limits). Obviously matching derivatives on both sides of zero is a good idea. Avoids the need for separate clipper though.

Effing brilliant!

This is a clever approach! Thank you for sharing this information! I’m curious as to how you determined that you needed a depth of 7 divisions of the continued fraction. The way I am thinking of this is that an approximation for tanh used in a simple algorithm output = tanh(A * input), where A is how much the input signal is driven, must be accurate up to the resolution of the audio signal. Would you then set up an inequality to determine this? For example: | tanh(4) – tanhApprox(4) | < 1 / 2^16, where tanhApprox is the continued fraction, where the 4 comes from a maximum of 12 db of gain, or roughly 4 times louder, and the value 1 / 2^16, coming from 16-bit audio resolution. When I solve this I find that tanhApprox would indeed need to be at least 7 levels deep to achieve this accuracy. Is this similar to how you calculated that you needed 7 divisions too?