EDIT: after double-checking my work, I realized I have a better bound on maximum error, but not a better average error. So, the magic number depends on the goal or metric, but mean relative error seems reasonable. Leaving my original comment here, but note the big caveat that I’m half wrong.
One can do better still - 0x7EF311BC is a near optimal value at least for inputs in the range of [0.001 .. 1000].
The simple explanation here is:
The post’s number 0x7EFFFFFF results in an approximation that is always equal to or greater than 1/x. The value 0x7EEEEBB3 is better, but it’s less than 1/x around 2/3rds of the time. My number 0x7EF311BC appears to be as well balanced as you can get, half the time greater and half the time less than 1/x.
To find this number, I have a Jupyter notebook that plots the maximum absolute value of relative error over a range of inputs, for a range of magic constants. Once it’s setup, it’s pretty easy to manually binary search and find the minimum. The plot of max error looks like a big “V”. (Edit while the plot of mean error looks like a big “U” near the minimum.
The optimal number does depend on the input range, and using a different range or allowing all finite floats will change where the optimal magic value is. The optimal magic number will also change if you add one or more Newton iterations, like in that github snippet (and also seen in the ‘quake trick’ code).
PPS maybe 0x7EF0F7D0 is a pretty good candidate for minimizing the average relative error…?
mbitsnbites 17 days ago [-]
Your suggestion got me intrigued. I have a program that does an exhaustive check for maximum and average error, so I'll give your numbers a spin.
mbitsnbites 17 days ago [-]
Given my search criteria, the optimal magic number turns out to be: 0x7ef311c2
Initial approximation:
Good bits min: 4
Good bits avg: 5.242649912834
Error max: 0.0505102872849 (4.30728 bits)
Error avg: 0.0327344845327 (4.93304 bits)
1 NR step:
Good bits min: 8
Good bits avg: 10.642581939697
Error max: 0.00255139507338 (8.61450 bits)
Error avg: 0.00132373889641 (9.56117 bits)
2 NR steps:
Good bits min: 17
Good bits avg: 19.922843217850
Error max: 6.62494557693e-06 (17.20366 bits)
Error avg: 2.62858584054e-06 (18.53728 bits)
3 NR steps:
Good bits min: 23
Good bits avg: 23.674004554749
Error max: 1.19249960972e-07 (22.99951 bits)
Error avg: 3.44158509521e-08 (24.79235 bits)
Here, "good bits" is 24 minus the number of trailing non-zero-bits in the integer difference between the approximation and the correct value, looking at the IEEE 754 binary representation (if that makes sense).
Also, for the NR steps I used double precision for the inner (2.0 - x * y) part, then rounded to single precision, to simulate FMA, but single precision for the outer multiplication.
dahart 16 days ago [-]
Ah very nice, I was close with using max error - 0.05051 is the same number I got. Pretty sure 0x7ef311c2 came up for me at least a few times as I was fiddling with parameters. Is this using minimum good bits as the deciding criteria, or is it the best overall number using one of the averages and also 1-3 NR steps? Did you limit the input range, or use all finite floats? Having the min/avg error in bits is nice, it’s more intuitive than relative error.
I like the FMA simulation, that’s smart; I didn’t think about it. I did my search in Python. I don’t have it in front of me right now, and off the top of my head I’m not even sure whether my NR steps are in Python precision or fp32. :P My posts in this thread were with NR turned off, I wanted to find the best raw approximation and noticed I got a different magic number when using refinement. It really is an amazing trick, right? Even knowing how it works it still looks like magic when plotting the result.
Thanks for the update!
BTW I was also fiddling with another possible trick that is specific to reciprocal. I suspect you can simply negate all the bits except the sign and get something that’s a decent starting point for Newton iters, though it’s a much worse approximation than the subtraction. So maybe (x ^ 0x7fffffff). Not sure if negating the mantissa helps or if it’s better to negate only the exponent. I haven’t had time to analyze it properly yet, and I don’t know of any cases where it would be preferred, but I still think it’s another interesting/cute observation about how fp32 bits are stored.
mbitsnbites 16 days ago [-]
When measuring the errors I exhaustively iterate over all possible floats in the range [1, 2), by enumerating all IEEE 754 single precision representations in that range. That's "only" 2^23 numbers, so perfectly doable.
My selection criteria was abit complex, but something like this:
1. Maximize number of accurate bits in the approximation.
2. Same in NR step 1, then NR step 2 etc.
3. Minimize the max error in the approximation, and then the avg ertor in the approximation.
4. Same for NR step 1, 2, ...
dahart 18 days ago [-]
This code is technically UB in C++, right? [1] Has anyone run into a case where it actually didn’t work? Just curious. I’ve often assumed that if C++ compilers didn’t compile this code, all hell would break loose.
It might be nice to start sharing modern/safe versions of this snippet & the Quake thing. Is using memcpy the only option that is safe in both C and C++? That always felt really awkward to me.
Yes, most casts via pointers are formally UB in both C and C++, and you can induce weird behavior by compilers if you transmit casted pointers through a function boundary (see [0] for a standard example: notice how the write to *pf vanishes into thin air). Since people like doing it in practice, GCC and Clang have an -fno-strict-aliasing flag to disable this optimization, and the MSVC compiler doesn't use it in the first place (except for restrict pointers in C). They don't go too far with it regardless, since lots of code using the POSIX socket API casts around struct pointers as a matter of course.
Apart from memcpy(), the 'allowed' methods include unions in C (writing to one member and reading from another), and bit_cast<T>() and std::start_lifetime_as<T>() in C++.
There are two additional ways of making this work.
The first is to allocate the memory using char a[sizeof(float)]. In C, char pointers may alias anything, so then you can do pointer conversions that would normally be undefined behavior and it should work. The other option is to use the non-standard __attribute__((__may_alias__)) on the pointer.
By the way, using union types for this is technically undefined behavior in the C and C++ standards, but GCC and Clang decided to make it defined as an implementation choice. Other compilers might not.
gpderetta 18 days ago [-]
The union trick is actually defined in C.
And note that while char can alias anything, the reverse is not true: i.e. you can't generally cast a char array to anything else and expect sensible behaviour. There are ways to make this work (placement new in C++ for example), but it is not a way to escape TBAA: if you store a float in char array you can't then cast it to int with impunity.
atiedebee 18 days ago [-]
To be more precise, it is defined since c99[0]. In c89 it was undefined, but type punning is the most used/sensible behaviour, so they changed it in c99.
A read of the C11 standard draft, which would have this amendment if it were accepted by the C standards committee, shows that this has not been added:
Type punning via union types is therefore undefined behavior unless your compiler implements an extension to define it like GCC and Clang do.
gpderetta 17 days ago [-]
Hum, from your wg14 link: 6.5.2.3 comma 3 and note 95. I thought that was the note that was added on TC3.
Also the note is non-normative, so it is only clarifying preexisting behaviour.
But I'm far from an expert on the C standard. Also that was the C11 draft, maybe the note was removed before the final standard.
Edit: I believe the alias rules are in 6.5 comma 7; specifically:
> An object shall have its stored value accessed only by an lvalue expression that has one of
the following types:
[...]
>an aggregate or union type that includes one of the aforementioned types among its
members (including, recursively, a member of a subaggregate or contained union),
Edit2: neither commas nor the note have changed in the 202y Draft.
ryao 17 days ago [-]
You need more language in order to say that type punning is allowed. Implicitly, only the type of the last write is permitted reads, and anything else is undefined behavior. At least, this is my understanding based on my own read and the guidance from the GCC developers.
gpderetta 17 days ago [-]
From a cursory search I can't find any languages in the C standard that disallow reading not from the last written member.
I'm familiar with such language in the C++ standard.
Edit: On the contrary, this note 92:
> 92)If the member used to read the contents of a union object is not the same as the member last used to store a value in the
object the appropriate part of the object representation of the value is reinterpreted as an object representation in the new
type as described in 6.2.6 (a process sometimes called type punning). This can possibly be a non-value representation.
Edit2: and that's specifically the text that was added by dr283. I think you might be confusing with a different DR (don't remember the number) that specifically asked if generalized type punning was possible as long as an union containing the aliased types was visible in the translation unit. I think that's still open although GCC definitely forbids it.
ryao 17 days ago [-]
Where did you find that note? I do not see it in the C standard draft I linked.
gpderetta 17 days ago [-]
It is in n3301, the last 202y draft. In the 201x draft it was note 95.
ryao 17 days ago [-]
I see it now. I guess the GCC developers were wrong then.
gpderetta 17 days ago [-]
I think it is a bit more complicated. The rule, together with the aliasing rule, if taken a face value, means you could do unrestricted aliasing as long as you cast to an union type on access. I believe that's the interpretation the GCC Devs reject as is makes TBAA ineffective.
Instead they interpret it narrowly to only allow punning through objects that are actual unions (as described in the GCC docs).
Unsurprisingly the standard is kind of a mess.
17 days ago [-]
17 days ago [-]
17 days ago [-]
atiedebee 17 days ago [-]
Oh that's interesting. I guess I should actually look at the standard instead of taking cppreference's word for it next time
gpderetta 18 days ago [-]
Yes, it 2025, I thought that we could at least imply C99 when talking about plain C :).
I'm probably an optimist.
ryao 17 days ago [-]
It does not matter. The C99 standard does not define this behavior:
For C++, `bit_cast<uint32_t>(0.f)` should be Well Defined, right? I'm curious, in C, is union-casting float->uint32_t also Perfectly Legal And Well Defined?
(I am not a C or C++ expert.)
ckjellqv 18 days ago [-]
It should be well-defined with reinterpret_cast<uint32_t&> though.
LegionMammal978 18 days ago [-]
No, reinterpret_cast doesn't change the type of the underlying object. The rule in [basic.lval]/11 (against accessing casted values) applies to all glvalues, whether they come from pointers or references.
Asooka 18 days ago [-]
It would be better if compilers were fixed. Current treatment of UB is insane. The default should be no UB and a pragma that lets you turn it on for specific parts. That used to be the case and some of the world's most performant software was written for compilers that didn't assume no pointer aliasing nor no integer overflow (case in point: Quake).
The big problem, apart from "memcpy is a bit tedious to write", is that it is impossible to guarantee absence of UB in a large program. You will get inlined functions from random headers and suddenly someone somewhere is "dereferencing" a pointer by storing it as a reference, and then your null pointer checks are deleted. Or an integer overflows and your "i>=0" array bounds assert is deleted. I have seen that happen several times and each time the various bits of code all look innocent enough until you sit down and thoroughly read each function and the functions it calls and the functions they call. So we compile with the worst UB turned off (null is a valid address, integers can overflow, pointers can alias, enums can store any integer) and honestly, we cannot see a measurable difference in performance.
UB as it is currently implemented is simply an enormous footgun. It would be a lot more useful if there were a profiler that could find parts of the code, which would benefit from UB optimisations. Then we could focus on making those parts UB-safe, add asserts for the debug/testing builds, and turn on more optimisations for them. I am quite certain nobody can write a fully UB-free program. For example, did you know that multiplying two "unsigned short" variables is UB? Can you guarantee that across all the template instantiations for custom vector types you've written you never multiply unsigned shorts? I'll leave it as an exercise to the reader as to why that is.
dzaima 18 days ago [-]
Strict aliasing in particular is very dependent on coding style; I too just turn it off with essentially no harm, but certain styles of code that don't involve carefully manually CSEing everything might benefit a good bit (though with it being purely type-based imo it's not that useful).
Outside of freestanding/kernel scenarios, null treatment shouldn't affect anything, and again can similarly benefit probably a decent amount of code by removing what is unquestionably just dead code. There's the question of pointer addition resulting in/having an argument of null being UB, which is rather bad for offsetof shenanigans or misusing pointers as arbitrary data, but that's a question of provenance, not anything about null.
Global integer overflow UB is probably the most questionable thing in that list, and I'd quite prefer having separate wrap-on-overflow and UB-on-overflow types (allowing having unsigned UB-on-overflow too). That said, even having had multiple cases of code that could hit integer overflow and thus UB, I don't recall any of them actually resulting in the compiler breaking things (granted, I know not to write a+1<a & similar for signed ints), whereas I have had a case where the compiler "fixed" the code, turning a `a-b < 0` into the more correct (at least in the scenario in question) `a<b`!
I do think that it would make sense to have an easy uniform way to choose some specific behavior for most kinds of UB.
imtringued 18 days ago [-]
There is no point in talking with the people that worship UB for the sake of UB.
They don't understand that one of the biggest barriers to developers writing and adopting more C in their projects is the random jankiness that you get from the compilers. Instead they make C this elite thing for the few people who have read every single line of C code they themselves compiled and ran on their Gentoo installation. Stuff like having no bounds checks is almost entirely irrelevant outside of compute kernels. It doesn't get you much performance, because the branches are perfectly predictable. It merely reduces code size.
There is also the problem that the C development culture is uttlery backwards compared even to the semiconductor industry. If you want to have these ultra optimized release builds, then your development builds must scream when they encounter UB and it also means that no C program or library without an extensive test suite should be allowed onto any Linux distribution's package repository. Suddenly the cost of C programming appears to be unaffordably high.
ncruces 18 days ago [-]
> Stuff like having no bounds checks is almost entirely irrelevant outside of compute kernels. It doesn't get you much performance, because the branches are perfectly predictable. It merely reduces code size.
This is just not true, or languages with bounds checks wouldn't invest so much in bounds checks elimination (both automatically, in the compiler, and by programmers, leaving hints to the compiler so it can remove bounds checks).
tekknolagi 18 days ago [-]
We used memcpy everywhere in our runtime and after the 10th or so time doing it, it becomes less awkward.
dahart 18 days ago [-]
And it’s always reliably optimized out in release builds, I assume?
LegionMammal978 18 days ago [-]
On platforms thar require aligned loads and stores (not x86 nor ARM), a direct pointer cast sometimes uses an aligned load/store where a memcpy uses multiple byte loads/stores, even on a good compiler, since memcpy() doesn't require that the pointers are aligned. This can be mitigated by going through a local variable, but it gets pretty verbose.
ack_complete 18 days ago [-]
Some ARM CPUs do require aligned loads and stores, such as the Cortex-M0+ in a Raspberry Pi Pico.
stouset 18 days ago [-]
Sounds like a good place for a macro?
mbitsnbites 17 days ago [-]
We have memcpy behind a C++ template function that mimics the interface of std::bit_cast.
Asooka 18 days ago [-]
For MSVC you have to add "/Oi", otherwise it is always a function call at lower optimisation levels. Clang and GCC treat it as an intrinsic always, even in debug builds.
tekknolagi 18 days ago [-]
I haven't manually checked every case but it's normally folded into the load or shift or whatever and completely erased
Conscat 18 days ago [-]
A few years ago with GCC 10, I had some strange UD2 instructions (I think) inserted around some inline assembly that was fixed by replaced my `reinterpret_cast`s with `bit_cast`.
mbitsnbites 17 days ago [-]
The proper solution is to use std::bit_cast in modern C++ or otherwise use memcpy, and of course know what you're doing.
Some things that could mess with you:
* Floating-point endianity is not the same as integer endianity.
* Floating-point alignment requirements differ from integer alignment requirements.
* The compuler is configured to use something else than 32-bit binary32 IEEE 754 for the type "float".
* The computer does not use two's complement arithmetic for integers.
In practice, these are not real problems.
Earw0rm 18 days ago [-]
Yes, if implemented as shown.
You could use intrinsics for the bit casting & so on and it would be well-defined to the extent that those intrinsics are.
(I understand some people consider SIMD intrinsics in general to be UB at language level, in part because they let you switch between floats & ints in a hardware-specific way.)
A better value is 0x7eb504f3. You can follow that up with Newton-raphson to refine the approximation, or approximate the Newton-raphson too by multiplying 1.385356f*x.
In some cases, it can be faster than hardware division.
dahart 17 days ago [-]
Not sure I understand your number 0x7eb504f3 - does it require using the 1.385 factor? Is it possible there is a typo in the number? That value doesn’t make sense to me. I’m measuring error here as the absolute value of relative error, e.g., | (v - v_approx) / v |. With that constant alone plugged into the poster’s code, I get a much less accurate approximation, with a minimum error of at least ~0.27 (meaning it’s always far away from the target) and worst case error of ~0.293 over the input range [1/1000…1000]. For comparison, the original 0x7effffff error is min:0 max:0.125. With 0x7eeeebb3, the error I get is min:0 max:0.0667. With 0x7ef311bc, error is min:0 max:0.0505. It’s important that the min error over a large interval is zero, because it means the approximation actually touches the target at least once.
Y_Y 19 days ago [-]
Quick comparison with exact and one Newton-Raphson:
For some more explanation: the main idea behind both tricks is that the IEEE floating point formats are designed so that the most significant bits represent its exponent, that is, floor(log2(x)). Hence reinterpret-casting a float x to an unsigned integer uint(x) approximates a multiple of log2(x). So these kinds of approximations work like logarithm arithmetic: float(C + a*uint(x)) approximates x^a, for a suitable constant C. Quake's invsqrt is a=-1/2, this post is a=-1.
IEEE754 floats are a very well-designed binary format, and the fact that these approximations are possible is part of this design; indeed, the first known instance of this trick is for a=1/2 by W. Kahan, the main designer of IEE754.
mbitsnbites 20 days ago [-]
Yep. Along the same lines. This one is even simpler, though, as it requires only a single integer CPU instruction (and the simplest of all instructions too).
If you want full precision, you need to do three Newton-Raphson iterations after the initial approximation. One iteration is:
y = y * (2.0F - x * y);
magicalhippo 20 days ago [-]
It's a neat trick, and could be very useful on microcontrollers which doesn't have hardware division but does have hardware multiplication.
actionfromafar 19 days ago [-]
Hm, like 68000?
pechay 18 days ago [-]
The 68k has both signed and unsigned 32/16 bit divide instructions.
quanto 18 days ago [-]
A quick intuition:
magic number 7e ffffff is negating by two's complement both the mantissa and exponent.
1. 7e: the first sig bit has to be zero to preserve the sign of the overall number.
2. ffffff: due to Taylor series 1/(1 + X) = 1 - X ..., negating gives the multiplicative inverse.
Although an IEEE float has a sign bit (thus 2's complement does not work on the float itself) but mantissa and the exponent individually work with 2's complement for different reasons. The exponent has a biased range due to being chopped by half; where as the mantissa has a biased range due to its definition and constant offset. The exponent's 1 offset (7e instead of 7f) is a bit more difficult to see at first -- the devil is in the details, but 2's complement is the basic intuition.
There are couple of tricks you can do if you fiddle with the bits of a floating point value using integer arithmetic and binary logic.
That was a thing back in the 90th..
I wonder how hard the performance hit from moving values between integer and float pipeline is nowadays.
Last time I looked into that was the Cortex-A8 (first I-Phone area). Doing that kind of trick costed around 26 cycles (back and forth) due to pipeline stalls back then.
stephencanon 19 days ago [-]
There are basic integer operations in the FP/SIMD units on most CPUs, so there’s no generally need to “move back and forth” unless you need to branch on the result of a comparison, use a value as an address, or do some more specialized arithmetic.
gpderetta 18 days ago [-]
On x86 there is sometimes (depending on the specific microarchitecture) an extra cycle additional latency when using an integer operation on a xmm register last used with a float operation.
I have seen it explained as the integer and foat ALUs's being physically distant and the forwarding network needing an extra cycle to transport the operands.
stephencanon 17 days ago [-]
This is correct, but it’s happily pretty rare for it to matter in practice (because the domain bypass penalty is small and does not directly impact throughput, only latency).
stephencanon 18 days ago [-]
(For that matter, though, most modern FP/SIMD units have a direct approximate-reciprocal instruction that is single-cycle throughput or better and much more accurate--generally around 10-12 bits, so there's no need for this sort of thing. See, e.g. FRECPE on ARM NEON and [V]RCPP[S/D] on x86.)
ack_complete 18 days ago [-]
These kinds of tricks are still used today. They're not so useful if you need a reciprocal or square root, since CPUs now have dedicated hardware for that, but it's different if you need a _cube_ root or x^(1/2.4).
mitthrowaway2 18 days ago [-]
I wonder to what extent the dedicated hardware is essentially implementing the same steps but at the transistor level.
mbitsnbites 17 days ago [-]
The big cores do. They essentially pump division through something like an FMA (fused multiply-add) unit, possibly the same unit that is used for multiplication and addition. That's for the Newton-Raphson steps, or Goldschmidt steps.
In hardware it's much easier to do a LUT-based approximation for the initial estimate rather than the subtraction trick, though.
It's common for CPUs to give 6-8 accurate bits in the approximation. x86 gives 13 accurate bits. Back in 1975, the Cray 1 gave 30 (!) accurate bits in the first approximation, and it didn't even have a division instruction (everything about that machine was big and fast).
mbitsnbites 17 days ago [-]
There is also the case with machines that lack FP support, like some ARM Cortex M variants.
somat 18 days ago [-]
Whenever I see these tricks(see also: the quake 3 fast inverse sqrt) involving using, not casting, but using integers as floats directly and floats as integers, I wonder if there is a way to do it without the jank.
Because what do you really want? some sort of exponent or exponent math right, some variant of the log function should work. is the problem is all the log functions are gated behind the function call interface. where as the subtract function is less heavy being behind the operator interface. or are they trying to use a floating point accelerated log aka floating point subtract?
dahart 17 days ago [-]
Yes of course there’s a more graceful approach. You can use an actual divide instruction or routine instead of the bit-casting subtraction trick.
It’s not about interfaces. The point is that a single subtract is simpler math, and (probably) faster than a divide, and it gets you surprisingly close to the right answer. The reason it works is subtracting two exponents is a subtraction in log space which is equivalent to a divide in linear space.
People don’t use this trick in practice that much if at all, especially these days with GPUs that do reciprocals and square roots in hardware, it’s just an interesting thing to know about that helps solidify our understanding of floating point numbers and logarithms.
jasomill 17 days ago [-]
For more general algorithms along these lines, see exercises 25–28 in section 1.2.2 of Knuth[1] (and note that the printed solution to exercise 28 in early printings has an error[2]).
[1] D.E. Knuth, The art of computer programming. Vol. 1, third edition, Addison-Wesley, Reading, MA, 1997.
That knuth check story was pretty great. it reminded me of when I had to write pow() from scratch many years ago. However mine was not optimized in any way, and was a very sluggish naive implementation of an algorithm I found in the nist mathematical function library.
I was young and worked night shifts feeding tapes to a ibm mainframe, in the down time I would amuse myself by reading the ibm manuals laying around and writing scripts on the operator console, some useful and some not so much.
One of my more useless scripts was a sort terminal characters drawing routine, I wanted to draw lines at a specific angle which would require sin(), cos(), however, the scripting language used had no included trig functions. So I looked it up. and smuggled in a printout of the nist reference sin function. At which point I found, to my dismay, there also was no floating point power function. So the next night I smuggled in the printout for that one as well. And now, armed with a highly questionable implementation of sin(), cos(), pow() I could finally draw the hands for my stupid analog clock screensaver on the 3270 terminal. Honestly probably the highlight of my time there, but I was too scared to let anyone know about that script, didn't want them to think I had too much downtime.
If you're not constrained to software solutions you have a whole world of opportunities. E.g. if it's a graphics or neural net pipeline you can pour tricks like this (or better) onto it. If it's a CPU then you can add special instructions that do exponent manipulation and the likes.
brudgers 18 days ago [-]
Engineering mathematical calculations always looks like "jank." Take a look at Plauger's The C Standard Library. It is full of "magic" numbers.
But then again, representing irrational numbers in binary is inherently janky.
One can do better still - 0x7EF311BC is a near optimal value at least for inputs in the range of [0.001 .. 1000].
The simple explanation here is:
The post’s number 0x7EFFFFFF results in an approximation that is always equal to or greater than 1/x. The value 0x7EEEEBB3 is better, but it’s less than 1/x around 2/3rds of the time. My number 0x7EF311BC appears to be as well balanced as you can get, half the time greater and half the time less than 1/x.
To find this number, I have a Jupyter notebook that plots the maximum absolute value of relative error over a range of inputs, for a range of magic constants. Once it’s setup, it’s pretty easy to manually binary search and find the minimum. The plot of max error looks like a big “V”. (Edit while the plot of mean error looks like a big “U” near the minimum.
The optimal number does depend on the input range, and using a different range or allowing all finite floats will change where the optimal magic value is. The optimal magic number will also change if you add one or more Newton iterations, like in that github snippet (and also seen in the ‘quake trick’ code).
PPS maybe 0x7EF0F7D0 is a pretty good candidate for minimizing the average relative error…?
Also, for the NR steps I used double precision for the inner (2.0 - x * y) part, then rounded to single precision, to simulate FMA, but single precision for the outer multiplication.
I like the FMA simulation, that’s smart; I didn’t think about it. I did my search in Python. I don’t have it in front of me right now, and off the top of my head I’m not even sure whether my NR steps are in Python precision or fp32. :P My posts in this thread were with NR turned off, I wanted to find the best raw approximation and noticed I got a different magic number when using refinement. It really is an amazing trick, right? Even knowing how it works it still looks like magic when plotting the result.
Thanks for the update!
BTW I was also fiddling with another possible trick that is specific to reciprocal. I suspect you can simply negate all the bits except the sign and get something that’s a decent starting point for Newton iters, though it’s a much worse approximation than the subtraction. So maybe (x ^ 0x7fffffff). Not sure if negating the mantissa helps or if it’s better to negate only the exponent. I haven’t had time to analyze it properly yet, and I don’t know of any cases where it would be preferred, but I still think it’s another interesting/cute observation about how fp32 bits are stored.
My selection criteria was abit complex, but something like this:
1. Maximize number of accurate bits in the approximation.
2. Same in NR step 1, then NR step 2 etc.
3. Minimize the max error in the approximation, and then the avg ertor in the approximation.
4. Same for NR step 1, 2, ...
It might be nice to start sharing modern/safe versions of this snippet & the Quake thing. Is using memcpy the only option that is safe in both C and C++? That always felt really awkward to me.
[1] https://tttapa.github.io/Pages/Programming/Cpp/Practices/typ...
Apart from memcpy(), the 'allowed' methods include unions in C (writing to one member and reading from another), and bit_cast<T>() and std::start_lifetime_as<T>() in C++.
[0] https://godbolt.org/z/dxMMfazoq
The first is to allocate the memory using char a[sizeof(float)]. In C, char pointers may alias anything, so then you can do pointer conversions that would normally be undefined behavior and it should work. The other option is to use the non-standard __attribute__((__may_alias__)) on the pointer.
By the way, using union types for this is technically undefined behavior in the C and C++ standards, but GCC and Clang decided to make it defined as an implementation choice. Other compilers might not.
And note that while char can alias anything, the reverse is not true: i.e. you can't generally cast a char array to anything else and expect sensible behaviour. There are ways to make this work (placement new in C++ for example), but it is not a way to escape TBAA: if you store a float in char array you can't then cast it to int with impunity.
[0]: https://en.cppreference.com/w/c/language/union
https://open-std.org/jtc1/sc22/wg14/www/docs/dr_283.htm
It is not part of C99. It also is not part of the C standard since no subsequent C standard adopted it according to the GCC developers:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118141#c13
A read of the C11 standard draft, which would have this amendment if it were accepted by the C standards committee, shows that this has not been added:
https://open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf
Type punning via union types is therefore undefined behavior unless your compiler implements an extension to define it like GCC and Clang do.
Also the note is non-normative, so it is only clarifying preexisting behaviour.
But I'm far from an expert on the C standard. Also that was the C11 draft, maybe the note was removed before the final standard.
Edit: I believe the alias rules are in 6.5 comma 7; specifically:
> An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
[...]
>an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union),
Edit2: neither commas nor the note have changed in the 202y Draft.
I'm familiar with such language in the C++ standard.
Edit: On the contrary, this note 92:
> 92)If the member used to read the contents of a union object is not the same as the member last used to store a value in the object the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called type punning). This can possibly be a non-value representation.
Edit2: and that's specifically the text that was added by dr283. I think you might be confusing with a different DR (don't remember the number) that specifically asked if generalized type punning was possible as long as an union containing the aliased types was visible in the translation unit. I think that's still open although GCC definitely forbids it.
Instead they interpret it narrowly to only allow punning through objects that are actual unions (as described in the GCC docs).
Unsurprisingly the standard is kind of a mess.
I'm probably an optimist.
https://news.ycombinator.com/item?id=42568271m
(I am not a C or C++ expert.)
The big problem, apart from "memcpy is a bit tedious to write", is that it is impossible to guarantee absence of UB in a large program. You will get inlined functions from random headers and suddenly someone somewhere is "dereferencing" a pointer by storing it as a reference, and then your null pointer checks are deleted. Or an integer overflows and your "i>=0" array bounds assert is deleted. I have seen that happen several times and each time the various bits of code all look innocent enough until you sit down and thoroughly read each function and the functions it calls and the functions they call. So we compile with the worst UB turned off (null is a valid address, integers can overflow, pointers can alias, enums can store any integer) and honestly, we cannot see a measurable difference in performance.
UB as it is currently implemented is simply an enormous footgun. It would be a lot more useful if there were a profiler that could find parts of the code, which would benefit from UB optimisations. Then we could focus on making those parts UB-safe, add asserts for the debug/testing builds, and turn on more optimisations for them. I am quite certain nobody can write a fully UB-free program. For example, did you know that multiplying two "unsigned short" variables is UB? Can you guarantee that across all the template instantiations for custom vector types you've written you never multiply unsigned shorts? I'll leave it as an exercise to the reader as to why that is.
Outside of freestanding/kernel scenarios, null treatment shouldn't affect anything, and again can similarly benefit probably a decent amount of code by removing what is unquestionably just dead code. There's the question of pointer addition resulting in/having an argument of null being UB, which is rather bad for offsetof shenanigans or misusing pointers as arbitrary data, but that's a question of provenance, not anything about null.
Global integer overflow UB is probably the most questionable thing in that list, and I'd quite prefer having separate wrap-on-overflow and UB-on-overflow types (allowing having unsigned UB-on-overflow too). That said, even having had multiple cases of code that could hit integer overflow and thus UB, I don't recall any of them actually resulting in the compiler breaking things (granted, I know not to write a+1<a & similar for signed ints), whereas I have had a case where the compiler "fixed" the code, turning a `a-b < 0` into the more correct (at least in the scenario in question) `a<b`!
I do think that it would make sense to have an easy uniform way to choose some specific behavior for most kinds of UB.
They don't understand that one of the biggest barriers to developers writing and adopting more C in their projects is the random jankiness that you get from the compilers. Instead they make C this elite thing for the few people who have read every single line of C code they themselves compiled and ran on their Gentoo installation. Stuff like having no bounds checks is almost entirely irrelevant outside of compute kernels. It doesn't get you much performance, because the branches are perfectly predictable. It merely reduces code size.
There is also the problem that the C development culture is uttlery backwards compared even to the semiconductor industry. If you want to have these ultra optimized release builds, then your development builds must scream when they encounter UB and it also means that no C program or library without an extensive test suite should be allowed onto any Linux distribution's package repository. Suddenly the cost of C programming appears to be unaffordably high.
This is just not true, or languages with bounds checks wouldn't invest so much in bounds checks elimination (both automatically, in the compiler, and by programmers, leaving hints to the compiler so it can remove bounds checks).
Some things that could mess with you:
* Floating-point endianity is not the same as integer endianity.
* Floating-point alignment requirements differ from integer alignment requirements.
* The compuler is configured to use something else than 32-bit binary32 IEEE 754 for the type "float".
* The computer does not use two's complement arithmetic for integers.
In practice, these are not real problems.
You could use intrinsics for the bit casting & so on and it would be well-defined to the extent that those intrinsics are.
(I understand some people consider SIMD intrinsics in general to be UB at language level, in part because they let you switch between floats & ints in a hardware-specific way.)
I did this a few days ago in the approximate division thread: https://news.ycombinator.com/item?id=42481612#42489596
In some cases, it can be faster than hardware division.
[1]: https://en.wikipedia.org/wiki/Fast_inverse_square_root
More detail on https://en.wikipedia.org/wiki/Fast_inverse_square_root#Alias... .
IEEE754 floats are a very well-designed binary format, and the fact that these approximations are possible is part of this design; indeed, the first known instance of this trick is for a=1/2 by W. Kahan, the main designer of IEE754.
If you want full precision, you need to do three Newton-Raphson iterations after the initial approximation. One iteration is:
1. 7e: the first sig bit has to be zero to preserve the sign of the overall number.
2. ffffff: due to Taylor series 1/(1 + X) = 1 - X ..., negating gives the multiplicative inverse.
Although an IEEE float has a sign bit (thus 2's complement does not work on the float itself) but mantissa and the exponent individually work with 2's complement for different reasons. The exponent has a biased range due to being chopped by half; where as the mantissa has a biased range due to its definition and constant offset. The exponent's 1 offset (7e instead of 7f) is a bit more difficult to see at first -- the devil is in the details, but 2's complement is the basic intuition.
It's in Go, but ports easily to other languages.
See also: https://stackoverflow.com/questions/32042673/optimized-low-a...
That was a thing back in the 90th..
I wonder how hard the performance hit from moving values between integer and float pipeline is nowadays.
Last time I looked into that was the Cortex-A8 (first I-Phone area). Doing that kind of trick costed around 26 cycles (back and forth) due to pipeline stalls back then.
I have seen it explained as the integer and foat ALUs's being physically distant and the forwarding network needing an extra cycle to transport the operands.
In hardware it's much easier to do a LUT-based approximation for the initial estimate rather than the subtraction trick, though.
It's common for CPUs to give 6-8 accurate bits in the approximation. x86 gives 13 accurate bits. Back in 1975, the Cray 1 gave 30 (!) accurate bits in the first approximation, and it didn't even have a division instruction (everything about that machine was big and fast).
Because what do you really want? some sort of exponent or exponent math right, some variant of the log function should work. is the problem is all the log functions are gated behind the function call interface. where as the subtract function is less heavy being behind the operator interface. or are they trying to use a floating point accelerated log aka floating point subtract?
It’s not about interfaces. The point is that a single subtract is simpler math, and (probably) faster than a divide, and it gets you surprisingly close to the right answer. The reason it works is subtracting two exponents is a subtraction in log space which is equivalent to a divide in linear space.
People don’t use this trick in practice that much if at all, especially these days with GPUs that do reciprocals and square roots in hardware, it’s just an interesting thing to know about that helps solidify our understanding of floating point numbers and logarithms.
[1] D.E. Knuth, The art of computer programming. Vol. 1, third edition, Addison-Wesley, Reading, MA, 1997.
[2] https://www.werkema.com/2021/10/14/my-knuth-check/
I was young and worked night shifts feeding tapes to a ibm mainframe, in the down time I would amuse myself by reading the ibm manuals laying around and writing scripts on the operator console, some useful and some not so much.
One of my more useless scripts was a sort terminal characters drawing routine, I wanted to draw lines at a specific angle which would require sin(), cos(), however, the scripting language used had no included trig functions. So I looked it up. and smuggled in a printout of the nist reference sin function. At which point I found, to my dismay, there also was no floating point power function. So the next night I smuggled in the printout for that one as well. And now, armed with a highly questionable implementation of sin(), cos(), pow() I could finally draw the hands for my stupid analog clock screensaver on the 3270 terminal. Honestly probably the highlight of my time there, but I was too scared to let anyone know about that script, didn't want them to think I had too much downtime.
https://dlmf.nist.gov/6.6
But then again, representing irrational numbers in binary is inherently janky.