The 2012 discovery of the Higgs boson at the ATLAS experiment in the LHC relied crucially on the ability to track charged particles with exquisite precision (10 microns over a 10m length) and high reliability (over 99% of roughly 1000 charged particles per collision correctly identified).
In an attempt to speed up the calculation, researchers found that merely changing the underlying math library (which should only affect at most the last bit) resulted in some collisions being missed or misidentified.
I was a user contributing to the LHC@Home BOINC project[2], where they ran into similar problems. They simulated beam stability, so iterated on the position of the simulated particles for millions of steps. As normal in BOINC each work unit is computed at least three times and if the results don't match the work unit is queued for additional runs.
They noticed that they got a lot of work units that failed the initial check compared to other BOINC projects. Digging into it they noticed that if a work unit was computed by the same CPU manufacturer, ie all Intel CPUs, then they passed as expected. But if the work unit had been processed by mixed CPUs, ie at least one run on Intel and one run on AMD, they very often disagreed.
That's when they discovered[3] this very issue about how the rounding of various floating point functions differed between vendors.
After switching to crlibm[4] for the elementary functions they used the mixed-vendor problem went away.
That Zimmermann paper is far more useful than the article. Notably LLVM libm is correctly rounded for most single precision ops.
Notable omission are crlibm/rlibm/core-math etc libs which claim to be more correct, but I suppose we can already be pretty confident about them.
lifthrasiir 32 days ago [-]
CORE-MATH is working directly with LLVM devs to get their routines to the LLVM libm, so no additional column is really required.
AlotOfReading 31 days ago [-]
I've found that LLVM is slightly worse than GCC on reproducibility once you get past the basic issues in libm. For example, LLVM will incorrectly round sqrt on ppc64el in some unusual circumstances:
...on -ffast-math. Of course you'll have arbitrary behavior of (at least) a couple ULPs on -ffast-math, but it'll be faster (hopefully)! That's, like, the flag's whole idea.
AlotOfReading 30 days ago [-]
I didn't say it was a bug for exactly that reason, but as the comment explains this doesn't affect any platform or compiler combination except ppc64 LLVM, or even most usages of sqrt for that target.
LLVM has lots of small reproducibility issues like this that GCC doesn't, but also much better documentation around its limitations. The point of this library is to eliminate as many of those as possible without performance costs.
dzaima 30 days ago [-]
"doesn't affect other things" is in no way ever whatsoever a reason to do anything in any way related to thinking, believing, or hoping that it might, would, or should affect nothing, especially around compiler optimizations (fun sentence to write, and one I violate myself for some things, but trivially true regardless). I'd be curious to hear about actual issues though.
The ppc64 case looks like llvm very intentionally not using the existing square root instruction, instead emitting a sequence of manual operations that supposedly run faster. And it's entirely in its right to do so, and it should affect no correct code (not that it's even really possible to write "correct code" under -ffast-math).
AlotOfReading 30 days ago [-]
Note that this isn't even a case where a single sqrt() call is non-reproducible, but only sequences of inlined sqrt calls within an unrolled loop and I agree with you.
Some broader context is probably warranted though. This originated out of a discussion with the authors of P3375 [0] about the actual performance costs of reproducibility. I suspected that existing compilers could already do it without a language change and no runtime cost using inline assembly magic. This library was the experiment to see if I was full of it.
There were only a few minor limitations it found. One was this issue, which happens "outside" what the library is attempting to do (though potentially still fixable as your godbolt link demonstrated). Another was that Clang and GCC have slightly different interpretations of "creative" register constraints. Clang's interpretation is closer to GCC's docs than GCC itself, but produces worse code.
Otherwise, this gives you reproducibility right up to the deeper compiler limitations like NaN propagation at essentially no performance cost. I wasn't able to find any "real" cases where it's not reproducible, only incredibly specific situations like this one across all 3 major compilers and even the minor ones I tried.
> but only sequences of inlined sqrt calls within an unrolled loop
Somewhat-relatedly, that's also a problem with vectorized math libraries, affecting both gcc and clang, where the vectorized function has a different result to the scalar standard-libm one (and indeed gcc wants at least "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only" to even allow using vector math libraries, even though it takes explicit flags to enable one (clang's fine with just "-fno-math-errno")).
For what it's worth, clang has __arithmetic_fence for doing the exact thing you're using inline asm for I believe; and the clang/llvm instruction-level constrained arith I noted would be the sane way to achieve this.
The code sample shown in P3375 should be just consequences of fma contraction on gcc/clang I believe? i.e. -ffp-contract=off makes the results consistent for both gcc and clang. I do think it's somewhat funky that -ffp-contract=on is the default, but oh well the spec allows it and it is a perf boost (and a precision boost.. if one isn't expecting the less precise result) and one can easily opt out.
Outside of -ffast-math and -ffp-contract=on (and pre-SSE x86-32 (i.e. ≥26-year-old CPUs) where doing things properly is a massive slowdown) I don't think clang and gcc should ever be doing any optimizations that change numerical values (besides NaN bit patterns).
Just optimization-fencing everything, while a nice and easy proof-of-concept, isn't something compiler vendors would just accept as the solution to implement; that's a ~tripling of IR instructions for each fp operation, which'd probably turn into a quite good compilation speed slowdown, besides also breaking a good number of correct optimizations. (though, again, this shouldn't even be necessary)
And -ffast-math should be left alone, besides perhaps desiring support to disable it at a given scope/function; I can definitely imagine that, were some future architecture to add a division instruction that can have 1ULP of error and is faster than the regular division, that compilers would absolutely use it for all divisions on -ffast-math, and you couldn't work around that with just optimization fences.
AlotOfReading 30 days ago [-]
Wasn't aware of __arithmetic_fence, though there's an open bug ticket noting that it doesn't protect against contraction (https://github.com/llvm/llvm-project/issues/91674). Still worth trying though. I was aware of GCC __builtin_assoc_barrier, but it wasn't documented to prevent contraction when I last checked. Appears they've fixed that since. Hadn't considered the IR / compilation speed issue. I'm aware of the broken optimizations, but they're not really a problem in practice as far as I can tell? You mainly lose some loop optimizations that weren't significant in the "serious" loop heavy numerics code I tested against at work.
P3375 is mainly about contraction, but there's other issues that can crop up. Intermediate promotion occasionally happens and I've also seen cases of intermediate expressions optimized down to constants without rounding error. Autovectorization is also a problem for me given the tendency of certain SIMD units to have FTZ set. I also have certain compilers that are less well-behaved than GCC and Clang in this respect.
My concern isn't accuracy though. Compilers do that fine, no need to second guess them. My hot take is that accuracy is relatively unimportant in most cases. Most code is written by people who have never read a numerical analysis book in their life and built without a full awareness of the compiler flags they're using or what those flags mean for their program. That largely works out because small errors are not usually detectable in high level program behavior except as a consequence of non-reproducibility. I would much rather accept a small amount of rounding error than deal with reproducibility issues across all the hardware I work on.
I didn't really mean the loop thing as much of a problem for the goal of reproducibility (easy enough to just not explicitly request a vector math library).
aarch32 NEON does have an implicit FTZ, and, yeah, such are annoying; though gcc and clang don't use it without -ffast-math (https://godbolt.org/z/3b11dW559)
I do agree that getting consistent results would definitely make sense as the default.
pclmulqdq 31 days ago [-]
The current correctly rounded libraries work well with 32-bit float, but only provably produce correct rounding with 64-bit float in some cases. It turns out it's rather difficult to prove that you will produce a correctly rounded result in all cases, even if you have an algorithm that works. That means that each of these libraries has only a sunset of operations.
lifthrasiir 32 days ago [-]
> All the standard Maths libraries which claim to be IEEE complaint I have tested are not compliant with this constraint.
It is possible to read the standard in the way that they still remain compliant. The standard, as of IEEE 754-2019, does not require recommended operations to be implemented in accordance to the standard in my understanding; implementations are merely recommended to ("should") define recommended operations with the required ("shall") semantics. So if an implementation doesn't claim that given recommended operation is indeed compliant, the implementation remains complaint in my understanding.
One reason for which I think this might be the case is that not all recommended operations have a known correctly rounded algorithm, in particular bivariate functions like pow (in fact, pow is the only remaining one at the moment IIRC). Otherwise no implementations would ever be complaint as long as those operations are defined!
stephencanon 31 days ago [-]
Right; IEEE 754 (2019) _recommends_ that correctly rounded transcendental functions be provided, but does not require it. The next revision of the standard may require that a subset of correctly-rounded transcendental functions be provided (this is under active discussion), but would still not require how they are bound in the language (i.e. a C stdlib might provide both `pow` and `cr_pow` or similar), and might not require that all functions be correctly rounded over their entire range (e.g. languages might require only that `cr_sin` be correctly rounded on [-π,π]).
adgjlsfhk1 32 days ago [-]
It's really hard for me to think that this is a good solution. Normal users should be using Float64 (where there is no similar solution), and Float32 should only be used when Float64 computation is too expensive (e.g. GPUs). In such cases, it's hard to believe that doing the math in Float64 and converting will make anyone happy.
jefftk 31 days ago [-]
> Float32 should only be used when Float64 computation is too expensive
Or when you're bottlenecked on memory and want to store each number in four bytes instead of eight.
saagarjha 32 days ago [-]
Single-precision is too expensive for GPUs, unfortunately.
pclmulqdq 31 days ago [-]
They may be too expensive for ML (or really not pareto-optimal for ML), but people use GPUs for a lot of things.
alkonaut 32 days ago [-]
I think you mean double precision?
saagarjha 32 days ago [-]
No.
alkonaut 31 days ago [-]
Can you elaborate? Outside of AI workloads, almost all computations on GPUs are single precision. Double precision is pretty rare, and smaller precision is mostly useless outside AI (and obviously irrelevant for trig precision)
saagarjha 31 days ago [-]
Unfortunately almost all computations on GPUs are AI workloads.
pclmulqdq 30 days ago [-]
Uh, what?
The main usage of GPUs is graphics processing, and it's not even close. That is what Graphics Processing Units are built for. AI is probably the main use of datacenter GPUs today, but even that isn't "almost all" in comparison to the HPC work out there.
winocm 32 days ago [-]
Off topic but tangentially related, here’s a fun fact, DEC Alpha actually ends up transparently converting IEEE single precision floats (S-float) to double precision floats (T-float, or register format) when performing registers loads and operations.
dzaima 31 days ago [-]
For 32-bit float single-operand ops it's simple enough to rely on a brute-force check. For 64-bit floats, though, while the goal of "sin(x) in library A really should match sin(x) in library B" is nice, it essentially ends up as "sin(x) in library A really should match... ...well, there's no option other than library A if I want sin(x) to not take multiple microseconds of bigint logic".
Though, a potentially-useful note is that for two-argument functions is that a correctly-rounded implementation means that it's possible to specialize certain constant operands to a much better implementations while preserving the same result (log(2,x), log(10,x), pow(x, 0.5), pow(x, 2), pow(x, 3), etc; floor(log(int,x)) being potentially especially useful if an int log isn't available).
vogu66 32 days ago [-]
Would working with unums side-step the issue for math problems?
jcranmer 31 days ago [-]
No, the fundamental issue is the difficulty of proving correctly-rounded results, which means implementations end up returning different results in practice. Unums do nothing to address that issue, except possibly by virtue of not having multiple implementations in the first place.
stncls 32 days ago [-]
Floating-point is hard, and standards seem like they cater to lawyers rather than devs. But a few things are slightly misleading in the post.
1. It correctly quotes the IEEE754-2008 standard:
> A conforming function shall return results correctly rounded for the applicable rounding direction for all operands in its domain
and even points out that the citation is from "Section 9. *Recommended* operations" (emphasis mine). But then it goes on to describes this as a "*requirement*" of the standard (it is not). This is not just a mistype, the post actually implies that implementations not following this recommendation are wrong:
> [...] none of the major mathematical libraries that are used throughout computing are actually rounding correctly as demanded in any version of IEEE 754 after the original 1985 release.
or:
> [...] ranging from benign disregard for the standard to placing the burden of correctness on the user who should know that the functions are wrong: “It is following the specification people believe it’s following.”
As far as I know, IEEE754 mandates correct rounding for elementary operations and sqrt(), and only for those.
2. All the mentions of 1 ULP in the beginning are a red herring. As the article itself mentions later, the standard never cares about 1 ULP. Some people do care about 1 ULP, just because it is something that can be achieved at a reasonable cost for transcendentals, so why not do it. But not the standard.
3. The author seems to believe that 0.5 ULP would be better than 1 ULP for numerical accuracy reasons:
> I was resounding told that the absolute error in the numbers are too small to be a problem. Frankly, I did not believe this.
I would personally also tell that to the author. But there is a much more important reason why correct rounding would be a tremendous advantage: reproducibility. There is always only one correct rounding. As a consequence, with correct rounding, different implementations return bit-for-bit identical results. The author even mentions falling victim to FP non-reproducibility in another part of the article.
4. This last point is excusable because the article is from 2020, but "solving" the fp32 incorrect-rounding problem by using fp64 is naive (not guaranteed to always work, although it will with high probability) and inefficient. It also does not say what to do for fp64. We can do correct rounding much faster now [1, 2]. So much faster that it is getting really close to non-correctly-rounded, so some libm may one day decide to switch to that.
3. >> I was resounding told that the absolute error in the numbers are too small to be a problem. Frankly, I did not believe this.
> I would personally also tell that to the author. But there is a much more important reason why correct rounding would be a tremendous advantage: reproducibility.
This is also what the author want from his own experiences, but failed to realize/state explicitly: "People on different machines were seeing different patterns being generated which meant that it broke an aspect of our multiplayer game."
So yes, the reasons mentioned as a rationale for more accurate functions are in fact rationale for reproducibility across hardware and platforms. For example going from 1 ulp errors to 0.6 ulp errors would not help the author at all, but having reproducible behavior would (even with an increased worst case error).
Correctly rounded functions means the rounding error is the smallest possible, and as a consequence every implementation will always return exactly the same results: this is the main reason why people (and the author) advocates for correctly rounded implementations.
jefftk 31 days ago [-]
> "solving" the fp32 incorrect-rounding problem by using fp64 is naive (not guaranteed to always work, although it will with high probability)
The key thing is there are only 2*32 float32s so you can check all of them. It sounds to me like the author did this, and realized they needed some tweaks for correct answers with log.
Also, there's a group of people who have been running tests on common libms, reporting their current accuracy states here: https://members.loria.fr/PZimmermann/papers/accuracy.pdf (that paper is updated ~monthly).
The 2012 discovery of the Higgs boson at the ATLAS experiment in the LHC relied crucially on the ability to track charged particles with exquisite precision (10 microns over a 10m length) and high reliability (over 99% of roughly 1000 charged particles per collision correctly identified).
In an attempt to speed up the calculation, researchers found that merely changing the underlying math library (which should only affect at most the last bit) resulted in some collisions being missed or misidentified.
I was a user contributing to the LHC@Home BOINC project[2], where they ran into similar problems. They simulated beam stability, so iterated on the position of the simulated particles for millions of steps. As normal in BOINC each work unit is computed at least three times and if the results don't match the work unit is queued for additional runs.
They noticed that they got a lot of work units that failed the initial check compared to other BOINC projects. Digging into it they noticed that if a work unit was computed by the same CPU manufacturer, ie all Intel CPUs, then they passed as expected. But if the work unit had been processed by mixed CPUs, ie at least one run on Intel and one run on AMD, they very often disagreed.
That's when they discovered[3] this very issue about how the rounding of various floating point functions differed between vendors.
After switching to crlibm[4] for the elementary functions they used the mixed-vendor problem went away.
[1]: https://www.davidhbailey.com/dhbtalks/dhb-icerm-2020.pdf
[2]: https://en.wikipedia.org/wiki/LHC@home
[3]: https://accelconf.web.cern.ch/icap06/papers/MOM1MP01.pdf
[4]: https://ens-lyon.hal.science/ensl-01529804/file/crlibm.pdf
Notable omission are crlibm/rlibm/core-math etc libs which claim to be more correct, but I suppose we can already be pretty confident about them.
https://github.com/J-Montgomery/rfloat/blob/8a58367db32807c8...
LLVM has lots of small reproducibility issues like this that GCC doesn't, but also much better documentation around its limitations. The point of this library is to eliminate as many of those as possible without performance costs.
The ppc64 case looks like llvm very intentionally not using the existing square root instruction, instead emitting a sequence of manual operations that supposedly run faster. And it's entirely in its right to do so, and it should affect no correct code (not that it's even really possible to write "correct code" under -ffast-math).
Some broader context is probably warranted though. This originated out of a discussion with the authors of P3375 [0] about the actual performance costs of reproducibility. I suspected that existing compilers could already do it without a language change and no runtime cost using inline assembly magic. This library was the experiment to see if I was full of it.
There were only a few minor limitations it found. One was this issue, which happens "outside" what the library is attempting to do (though potentially still fixable as your godbolt link demonstrated). Another was that Clang and GCC have slightly different interpretations of "creative" register constraints. Clang's interpretation is closer to GCC's docs than GCC itself, but produces worse code.
Otherwise, this gives you reproducibility right up to the deeper compiler limitations like NaN propagation at essentially no performance cost. I wasn't able to find any "real" cases where it's not reproducible, only incredibly specific situations like this one across all 3 major compilers and even the minor ones I tried.
[0] https://isocpp.org/files/papers/P3375R2.html
Somewhat-relatedly, that's also a problem with vectorized math libraries, affecting both gcc and clang, where the vectorized function has a different result to the scalar standard-libm one (and indeed gcc wants at least "-fno-math-errno -funsafe-math-optimizations -ffinite-math-only" to even allow using vector math libraries, even though it takes explicit flags to enable one (clang's fine with just "-fno-math-errno")).
For what it's worth, clang has __arithmetic_fence for doing the exact thing you're using inline asm for I believe; and the clang/llvm instruction-level constrained arith I noted would be the sane way to achieve this.
The code sample shown in P3375 should be just consequences of fma contraction on gcc/clang I believe? i.e. -ffp-contract=off makes the results consistent for both gcc and clang. I do think it's somewhat funky that -ffp-contract=on is the default, but oh well the spec allows it and it is a perf boost (and a precision boost.. if one isn't expecting the less precise result) and one can easily opt out.
Outside of -ffast-math and -ffp-contract=on (and pre-SSE x86-32 (i.e. ≥26-year-old CPUs) where doing things properly is a massive slowdown) I don't think clang and gcc should ever be doing any optimizations that change numerical values (besides NaN bit patterns).
Just optimization-fencing everything, while a nice and easy proof-of-concept, isn't something compiler vendors would just accept as the solution to implement; that's a ~tripling of IR instructions for each fp operation, which'd probably turn into a quite good compilation speed slowdown, besides also breaking a good number of correct optimizations. (though, again, this shouldn't even be necessary)
And -ffast-math should be left alone, besides perhaps desiring support to disable it at a given scope/function; I can definitely imagine that, were some future architecture to add a division instruction that can have 1ULP of error and is faster than the regular division, that compilers would absolutely use it for all divisions on -ffast-math, and you couldn't work around that with just optimization fences.
P3375 is mainly about contraction, but there's other issues that can crop up. Intermediate promotion occasionally happens and I've also seen cases of intermediate expressions optimized down to constants without rounding error. Autovectorization is also a problem for me given the tendency of certain SIMD units to have FTZ set. I also have certain compilers that are less well-behaved than GCC and Clang in this respect.
My concern isn't accuracy though. Compilers do that fine, no need to second guess them. My hot take is that accuracy is relatively unimportant in most cases. Most code is written by people who have never read a numerical analysis book in their life and built without a full awareness of the compiler flags they're using or what those flags mean for their program. That largely works out because small errors are not usually detectable in high level program behavior except as a consequence of non-reproducibility. I would much rather accept a small amount of rounding error than deal with reproducibility issues across all the hardware I work on.
Huh. ¯\_(ツ)_/¯
I didn't really mean the loop thing as much of a problem for the goal of reproducibility (easy enough to just not explicitly request a vector math library).
aarch32 NEON does have an implicit FTZ, and, yeah, such are annoying; though gcc and clang don't use it without -ffast-math (https://godbolt.org/z/3b11dW559)
I do agree that getting consistent results would definitely make sense as the default.
It is possible to read the standard in the way that they still remain compliant. The standard, as of IEEE 754-2019, does not require recommended operations to be implemented in accordance to the standard in my understanding; implementations are merely recommended to ("should") define recommended operations with the required ("shall") semantics. So if an implementation doesn't claim that given recommended operation is indeed compliant, the implementation remains complaint in my understanding.
One reason for which I think this might be the case is that not all recommended operations have a known correctly rounded algorithm, in particular bivariate functions like pow (in fact, pow is the only remaining one at the moment IIRC). Otherwise no implementations would ever be complaint as long as those operations are defined!
Or when you're bottlenecked on memory and want to store each number in four bytes instead of eight.
The main usage of GPUs is graphics processing, and it's not even close. That is what Graphics Processing Units are built for. AI is probably the main use of datacenter GPUs today, but even that isn't "almost all" in comparison to the HPC work out there.
Though, a potentially-useful note is that for two-argument functions is that a correctly-rounded implementation means that it's possible to specialize certain constant operands to a much better implementations while preserving the same result (log(2,x), log(10,x), pow(x, 0.5), pow(x, 2), pow(x, 3), etc; floor(log(int,x)) being potentially especially useful if an int log isn't available).
1. It correctly quotes the IEEE754-2008 standard:
> A conforming function shall return results correctly rounded for the applicable rounding direction for all operands in its domain
and even points out that the citation is from "Section 9. *Recommended* operations" (emphasis mine). But then it goes on to describes this as a "*requirement*" of the standard (it is not). This is not just a mistype, the post actually implies that implementations not following this recommendation are wrong:
> [...] none of the major mathematical libraries that are used throughout computing are actually rounding correctly as demanded in any version of IEEE 754 after the original 1985 release.
or:
> [...] ranging from benign disregard for the standard to placing the burden of correctness on the user who should know that the functions are wrong: “It is following the specification people believe it’s following.”
As far as I know, IEEE754 mandates correct rounding for elementary operations and sqrt(), and only for those.
2. All the mentions of 1 ULP in the beginning are a red herring. As the article itself mentions later, the standard never cares about 1 ULP. Some people do care about 1 ULP, just because it is something that can be achieved at a reasonable cost for transcendentals, so why not do it. But not the standard.
3. The author seems to believe that 0.5 ULP would be better than 1 ULP for numerical accuracy reasons:
> I was resounding told that the absolute error in the numbers are too small to be a problem. Frankly, I did not believe this.
I would personally also tell that to the author. But there is a much more important reason why correct rounding would be a tremendous advantage: reproducibility. There is always only one correct rounding. As a consequence, with correct rounding, different implementations return bit-for-bit identical results. The author even mentions falling victim to FP non-reproducibility in another part of the article.
4. This last point is excusable because the article is from 2020, but "solving" the fp32 incorrect-rounding problem by using fp64 is naive (not guaranteed to always work, although it will with high probability) and inefficient. It also does not say what to do for fp64. We can do correct rounding much faster now [1, 2]. So much faster that it is getting really close to non-correctly-rounded, so some libm may one day decide to switch to that.
[1] https://github.com/rutgers-apl/rlibm
[2] https://github.com/taschini/crlibm
> I would personally also tell that to the author. But there is a much more important reason why correct rounding would be a tremendous advantage: reproducibility.
This is also what the author want from his own experiences, but failed to realize/state explicitly: "People on different machines were seeing different patterns being generated which meant that it broke an aspect of our multiplayer game."
So yes, the reasons mentioned as a rationale for more accurate functions are in fact rationale for reproducibility across hardware and platforms. For example going from 1 ulp errors to 0.6 ulp errors would not help the author at all, but having reproducible behavior would (even with an increased worst case error).
Correctly rounded functions means the rounding error is the smallest possible, and as a consequence every implementation will always return exactly the same results: this is the main reason why people (and the author) advocates for correctly rounded implementations.
The key thing is there are only 2*32 float32s so you can check all of them. It sounds to me like the author did this, and realized they needed some tweaks for correct answers with log.