I'm sure TFA's conclusion is right; but its argument would be strengthened by providing the codegen for both versions, instead of just the better version. Quote:
"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version [...] wasting two multiplications and one or two additions. [...] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"
—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.
azeemba 33 days ago [-]
The main point is that the conditional didn't actually introduce a branch.
Showing the other generated version would only show that it's longer. It is not expected to have a branch either. So I don't think it would have added much value
comex 33 days ago [-]
But it's possible that the compiler is smart enough to optimize the step() version down to the same code as the conditional version. If true, that still wouldn't justify using step(), but it would mean that the step() version isn't "wasting two multiplications and one or two additions" as the post says.
(I don't know enough about GPU compilers to say whether they implement such an optimization, but if step() abuse is as popular as the post says, then they probably should.)
MindSpunk 33 days ago [-]
Okay but how does this help the reader? If the worse code happens to optimize to the same thing it's still awful and you get no benefits. It's likely not to optimize down unless you have fast-math enabled because the extra float ops have to be preserved to be IEEE754 compliant
account42 33 days ago [-]
Fragment and vertex shaders generally don't target strict IEEE754 compliance by default. Transforming a * (b ? 1.0 : 0.0) into b ? a : 0.0 is absolutely something you can expect a shader compiler to do - that only requires assuming a is not NaN.
burnished 33 days ago [-]
..how is it awful if it has the same result?
dcrazy 33 days ago [-]
Because it perpetuates a misconception and is harder to read.
seba_dos1 33 days ago [-]
Just look at it.
idunnoman1222 33 days ago [-]
Unless you’re writing an essay on why you’re right…
chrisjj 33 days ago [-]
> Unless you’re writing an essay on why you’re right…
He's writing an essay on why they are wrong.
"But here's the problem - when seeing code like this, somebody somewhere will invariably propose the following "optimization", which replaces what they believe (erroneously) are "conditional branches" by arithmetical operations."
Hence his branchless codegen samples are sufficient.
Further, regarding.the side-issue "The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower", no amount of codegen is going to show lower /speed/.
ncruces 33 days ago [-]
The other either optimizes the same, or has an additional multiplication, and it's definitely less readable.
TheRealPomax 33 days ago [-]
Correct: it would show proof instead of leaving it up to the reader to believe them.
Lockal 33 days ago [-]
You missed the second part where article says that "it actually runs much slower than the original version", "wasting two multiplications and one or two additions", based on idea that compiler is unable to do a very basic optimization, implying that compiler compiler will actually multiply by one. No benchmarks, no checking assembly, just straightforward misinformation.
There are 10 types of people in this work. Those who can extrapolate from missing data, and
account42 33 days ago [-]
Making assumptions about performance when you can measure is generally not a good idea.
robertlagrant 33 days ago [-]
and what? AND WHAT?
alkonaut 33 days ago [-]
I wish there was a good way of knowing when an if forces an actual branch rather than when it doesn't. The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.
pandaman 33 days ago [-]
>The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.
You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.
So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.
account42 33 days ago [-]
> So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch.
It can do an actual branch if the condition ends up the same for the entire workgroup - or to be even more pedantic, for the part of the workgroup that is still alive.
You can also check that explicitly to e.g. take a faster special case branch if possible for the entire workgroup and otherwise a slower general case branch but also for the entire workgroup instead of doing both and then selecting.
pandaman 33 days ago [-]
And this is why I wrote There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
ribit 31 days ago [-]
Execution with masking is pretty much how broaching works on GPUs. What’s more relevant however is that conditional statements add overhead on terms of additional instructions and execution state management. Eliminating small branches using conditional moves or manual masking can be a performance win.
pandaman 31 days ago [-]
No, branching works on GPU just like everywhere else - the instruction pointer gets changed to another value. But you cannot branch on a vector value unless every element of the vector is the same, this is why branching on vector values is a bad idea. However, if your vectorized computation is naturally divergent then there is no way around it, conditional moves are not going to help as they also will evaluate both branches in a conditional. The best you can do is to arrange it in such a way that you only add computation instead of alternating it, i.e. you do if() ... instead of if() ... else ... then you only take as long as the longest path.
This reminds me that people who believe that GPU is not capable of branches do stupid things like writing multiple shaders instead of branching off a shader constant e.g. you have some special mode, say x-ray vision, in a game and instead of doing a branch in your materials, you write an alternative version of every shader.
ryao 33 days ago [-]
You can always have the compiler dump the assembly output so you can examine it. I suspect few do that.
vanderZwan 33 days ago [-]
Does this also apply for shaders? And is it even useful given the enormous variation in hardware capabilities out there. My impression was that it's all JIT compiled unless you know which hardware you're targeting, e.g. Valve precompiling highly optimized shaders for the Steam Deck
(I'm not a grapics programmer, mind you, so please correct any misunderstandings on my end)
swiftcoder 33 days ago [-]
It's all JIT'd based on the specific driver/GPU, but the intermediate assembly language is sufficient to inspect things like branches and loop unrolling.
grg0 32 days ago [-]
Not really. DXIL in particular will still have branches and not care much about unrolling. You need to look at the assembly that is generated. And yes, that depends on the target hardware and compiler/driver.
account42 33 days ago [-]
You will have to check for the different GPUs you are targetting. But GPU vendors don't start from scratch for each hardware generation so you will often see similar results.
torginus 32 days ago [-]
I'll comment this here as I got downvoted when I made the point in a standalone comment - this is mostly an academic issue, since you don't want to use step of pixel-level if statements in your shader code, as it will lead to ugly aliasing artifacts as the pixel color transitions from a to b.
What you want is to use smoothstep which blends a bit between these two values and for that you need to compute both paths anyway.
pandaman 32 days ago [-]
It's absurd to claim that you'd never use step(), even in pixel shaders (there are all kinds of shaders not related to pixels at all).
torginus 32 days ago [-]
>since you don't want to use step of pixel-level if statements in your shader code
The observation relates to pixel shaders, and even within that, it relates to values that vary based on pixel-level data. In these cases having if statements without any sort of interpolation introduces aliasing, which tends to look very noticeable.
Now you might be fine with that, or have some way of masking it, so it might be fine in your use case, but most in the most common, naive case the issue does show up.
pandaman 32 days ago [-]
I don't know how many graphics products you shipped and when, but, say, clamping values at 0, is pretty common even in most basic shaders. It's not magic and won't introduce "aliasing" just for the fact of using it. On the other hand, for example, using negative dot products in you lighting computation will introduce bizarre artifacts. And yes, everyone uses various forms of MSAA for the past 15 years or so even in games. Welcome to the 21st century.
torginus 32 days ago [-]
The way you write seems to imply you have professional experience in the matter, which makes it very strange you're not getting what I'm writing about.
Nobody ever talked about clamping - and it's not even relavant to the discussion as it doesn't introduce discontinuity that can cause aliasing.
What I'm referring to is shader aliasing, which MSAA does nothing about - MSAA is for geometry aliasing.
To illustrate what I'm talking about with, an example that draws a red circle on a quad:
The first version has a hard boundary for the circle which has an ugly aliased and pixelated contour, while the latter version smooths it. This example might not be egregious, but this can and does show up in some circumstances.
ajross 33 days ago [-]
> it's also concerning that we have syntax where an if is some times a branch and some times not.
That's true on scalar CPUs too though. The CMOV instruction arrived with the P6 core in 1995, for example. Branches are expensive everywhere, even in scalar architectures, and compilers do their best to figure out when they should use an alternative strategy. And sometimes get it wrong, but not very often.
masklinn 33 days ago [-]
For scalar CPUs, historically CMOV used to be relatively slow on x86, and notably for reliable branching patterns (>75% reliable) branches could be a lot faster.
cmov also has dependencies on all three inputs, so if there's a high level of bias towards the unlikely input having a much higher latency than the likely one a cmov can cost a fair amount of waiting.
Finally cmov were absolutely terrible on P4 (10-ish cycles), and it's likely that a lot of their lore dates back to that.
account42 33 days ago [-]
You got this the wrong way around: For GPUs conditional moves are the default and real branches are a performance optimization possible only if the branch is uniform (=same side taken for the entire workgroup).
mpreda 33 days ago [-]
Exactly. Consider this example:
a = f(z);
b = g(z);
v = x > y ? a : b;
Assuming computing the two function calls f() and g() is relativelly expensive, it becomes a trade-off whether to emit conditional code or to compute both followed by a select. So it's not a simple choice, and the decision is made by the compiler.
dragontamer 33 days ago [-]
This is a GPU focused article.
The GPU will almost always execute f and g due to GPU differences vs CPU.
You can avoid the f vs g if you can ensure a scalar Boolean / if statement that is consistent across the warp. So it's not 'always' but requires incredibly specific coding patterns to 'force' the optimizer + GPU compiler into making the branch.
justsid 33 days ago [-]
It depends. If the code flow is uniform for the warp, only side of the branch needs to be evaluated. But you could still end up with pessimistic register allocation because the compiler can’t know it is uniform. It’s sometimes weirdly hard to reason about how exactly code will end up executing on the GPU.
danybittel 33 days ago [-]
f or g may have side effects too. Like writing to memory.
Now a conditional has a different meaning.
You could also have some fun stuff, where f and g return a boolean, because thanks to short circuit evaluation && || are actually also conditionals in disguise.
account42 33 days ago [-]
Side effects will be masked, the GPU is still executing exactly the same code for the entire workgroup.
plagiarist 33 days ago [-]
I think that capability in the shader language would be interesting to have. One might even want it to two-color all functions in the code. Anything annotated nonbranching must have if statements compile down to conditional moves and must only call nonbranching functions.
catlifeonmars 33 days ago [-]
This is also very relevant for cryptography use cases, where branching is a potential side channel for leaking secret information.
grg0 32 days ago [-]
The good way of knowing is to look at the assembly generated by the compiler. Maybe not a completely satisfying answer given that the result is heavily vendor-dependent, but unless the high-level language exposes some way of explicitly controlling it, then assembly is what you got.
> The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
This is a problem, though. People shouldn't do things potentially, they should look at the actual code that is generated and executed.
mwkaufma 33 days ago [-]
One can do the precisely how it's done in the article -- inspect the assembly.
chrisjj 33 days ago [-]
The good way is to inspect the code :)
> it's also concerning that we have syntax where an if is some times a branch and some times not.
It would be more concerning if we didn't. We might get a branch on one GPU and none on another.
phkahler 33 days ago [-]
>> The good way is to inspect the code :)
The best way is to profile the code. Time is what we are after, so measure that.
chrisjj 31 days ago [-]
For sure.
nice_byte 33 days ago [-]
godbolt has rga compiler now, you can always paste in hlsl and look at the actual rdna instructions that are generated (what GPU actually runs, not spirv)
NohatCoder 33 days ago [-]
But you don't generally need to care if the shader code contains a few branches, modern GPUs handles those reasonably well, and the compiler will probably make a reasonable guess about what is fastest.
account42 33 days ago [-]
You do need to care about large non-uniform branches as in the general case the GPU will have to execute both sides.
NohatCoder 32 days ago [-]
A non-branching version of the same algorithm will also run code equivalent to both branches. The branching version may sometimes skip one of the branches, the non-branching version can't. So if the functionality you want is best described by a branch, then use a branch.
33 days ago [-]
nosferalatu123 33 days ago [-]
A lot of the myth that "branches are slow on GPUs" is because, way back on the PlayStation 3, they were quite slow. NVIDIA's RSX GPU was on the PS3; it was documented that it was six cycles IIRC, but it always measured slower than that to me.
That was for even a completely coherent branch, where all threads in the warp took the same path. Incoherent branches were slower because the IFEH instruction took six cycles, and the GPU would have to execute both sides of the branch.
I believe that was the origin of the "branches are slow on GPUs" myth that continues to this day. Nowadays GPU branching is quite cheap especially coherent branches.
dahart 32 days ago [-]
If someone says branching without qualification, I have to assume it’s incoherent. The branching mechanics might have lower overhead today, but the basic physics of the situation is that throughput on each side of the branch is reduced to the percentage of active threads. If both sides of a branch are taken, and both sides are the same instruction length, the average perf over both sides is at least cut in half. This is why the belief that branches are slow on GPUs is both persistent and true. And this is why it’s worth trying harder to reformulate the problem without branching, if possible.
nice_byte 33 days ago [-]
coherent branches are "free" but the extra instructions increase register pressure. that's the main reason why dynamic branches are avoided, not that they are inherently "slow".
aappleby 33 days ago [-]
These sort of avoid-branches optimizations were effective once upon a time as I profiled them on the XBox 360 and some ancient Intel iGPUs, but yeah - don't do this anymore.
Same story for bit extraction and other integer ops - we used to emulate them with float math because it was faster, but now every GPU has fast integer ops.
Agentlien 33 days ago [-]
> now every GPU has fast integer ops.
Is that true and to what extent? Looking at the ISA for RDNA2[0] for instance - which is the architecture of both PS5 and Xbox Series S|X - all I can find is 32-bit scalar instructions for integers.
You're likely going to have a rough time with 64-bit arithmetic in any GPU. (At least on Nvidia GPUs, the instruction set doesn't give you anything but a 32-bit add-with-carry to help.) But my understanding is that a lot of the arithmetic hardware used for 53-bit double-precision ops can also be used for 32-bit integer ops, which hasn't always been the case.
ryao 33 days ago [-]
The PTX ISA for Nvidia GPUs supports 64-bit integer arithmetic:
It needs to support 64-bit integer arithmetic for handling 64-bit address calculations efficiently. The SASS ISA since Volta has explicit 32I suffixed integer instructions alongside the regular integer instructions, so I would expect the regular instructions to be 64-bit, although the documentation leave something to be desired:
Hmm, it looks like the underlying SASS does have 64-bit integer instructions now, but only with the 12.0 capability level in the recent Blackwell processors. Older versions emulate it via chained 32-bit instructions. Take this example kernel:
So if you want real 64-bit support, have fun getting your hands on a 5070! But even on sm_120, things like 64-bit × immediate 32-bit take a UIMAD.WIDE.U32 + UIMAD + UIADD3 sequence, so the support isn't all that complete.
(I've been looking into the specifics of CUDA integer arithmetic for some time now, since I've had the mad idea of doing 'horizontal' 448-bit integer arithmetic by storing one word in each thread and using the warp-shuffle instructions to send carries up and down. Given that the underlying arithmetic is all 32-bit, it doesn't make any sense to store more than 31 bits per thread. Then again, I don't know whether this mad idea makes any sense in the first place, until I implement and profile it.)
Agentlien 33 days ago [-]
I'm less concerned about it being 32-bit and more about them being exclusively scalar instructions, no vector instructions. Meaning only useful for uniforms, not thread-specific data.
[Update: I remembered and double checked. While there are only scalar 32-bit integer instructions you can use 24-bit integer vector instructions. Essentially ignoring the exponent part of the floats.]
ryao 33 days ago [-]
The programming model is that all threads in the warp / thread block run the same instruction (barring masking for branch divergence). Having SIMD instructions at the thread level is a rarity given that the way SIMD is implemented is across warps / thread blocks (groups of warps). It does exist, but only within 32-bit words and really only for limited use cases, since the proper way to do SIMD on the GPU is by having all of the threads execute the same instruction:
Note that I am using the Nvidia PTX documentation here. I have barely looked at the AMD RDNA documentation, so I cannot cite it without doing a bunch of reading.
Agentlien 33 days ago [-]
I know all of that. I was talking about RDNA2, which is AMD. There, instructions come in two flavours:
1. Scalar - run once per thread group, only acting on shared memory. So these won't be SIMD.
2. Vector - run across all threads, each threads accesses its own copy of the variables. This is what you typically think of GPU instructions doing.
LegionMammal978 33 days ago [-]
That does sound like it would be a pretty big limitation. But there appear to be plenty of vector instructions for 32-bit integers in RDNA2 and RDNA3 [0] [1]. They're named V_*_U32 or V_*_I32 (e.g., V_ADD3_U32), even including things like a widening multiply V_MAD_U64_U32. The only thing missing is integer division, which is apparently emulated using floating-point instructions.
It's always been less of a big deal than it used to be -- at least on "big" GPUs -- but the article isn't really about avoiding branches.
The code presented is already branchless.
The people giving out the advice seem to think they are avoiding branches as optimisation but their understanding of what branching code is is apparently based on if they can see some sort of conditional construct.
“If you consult the internet about writing a branch of a GPU, you might think they open the gates of hell and let demons in. They will say you should avoid them at all costs, and that you can avoid them by using the ternary operator or step() and other silly math tricks. Most of this advice is outdated at best, or just plain wrong.
Let’s correct that.”
magicalhippo 34 days ago [-]
Processors change, compilers change. If you care about such details, best to ship multiple variants and pick the fastest one at runtime.
As I've mentioned here several times before, I've made code significantly faster by removing the hand-rolled assembly and replacing it with plain C or similar. While the assembly might have been faster a decade or two ago, things have changed...
Amadiro 33 days ago [-]
I think figuring out the fastest version of a shader at runtime is very non-trivial, I'm not aware of any game or engine that can do this.
I think it'd be possible in principle, because most APIs (D3D, GL, Vulkan etc) expose performance counters (which may or may not be reliable depending on the vendor), and you could in principle construct a representative test scene that you replay a couple times to measure different optimizations. But a lot of games are quite dynamic, having dynamically generated scenes and also dynamically generated shaders, so the number of combinations you might have to test seems like an obstacle. Also you might have to ask the user to spend time waiting on the benchmark to finish.
You could probably just do this ahead of time with a bunch of different GPU generations from each vendor if you have the hardware, and then hard-code the most important decision. So not saying it'd be impossible, but yeah I'm not aware of any existing infrastructure for this.
hansvm 33 days ago [-]
The last time I did anything like this (it was for CPU linear algebra code designed to run in very heterogeneous clusters), I first came up with a parameterization that approximated how I'd expect an algorithm to perform. Then, once for each hardware combination, you sweep through the possible parameterization space. I used log-scaled quantization to make it cheap to index into an array of function pointers based on input specifics.
The important thing to note is that you can do that computation just once, like when you install the game, and it isn't that slow. Your parameterization won't be perfect, but it's not bad to create routines that are much faster than any one implementation on nearly every architecture.
ijustlovemath 33 days ago [-]
you'd only have to test worst/median case scenes, which you could find with a bit of profiling!
dist-epoch 34 days ago [-]
Funnily enough, this is sort of what the NVIDIA drivers do: they intercept game shaders and replace them by custom ones optimized by NVIDIA. Which is why you see stuff like this in NVIDIA drivers changelog: "optimized game X, runs 40% faster"
Cieric 33 days ago [-]
I don't work on the nvidia side of things but it's likely to be the same. Shader replacement is only one of a whole host of things we can do to make games run faster. It's actually kind of rare for use to do them since it boats the size of the driver so much. A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.
chrisjj 33 days ago [-]
> > A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.
That will break code sufficienly reliant on the behaviour of sungle precision, though.
Cieric 33 days ago [-]
In the case that does happen, then we don't apply that setting. Most of the changes applied are extensively tested and toggles like that are more often used for already broken shaders.
chrisjj 31 days ago [-]
> In the case that does happen
You meant "in the case we find where that does happen"?
Little consolation for the cases you don't find because e.g. you cannot afford to play a game to the end.
Cieric 29 days ago [-]
Fair enough, I can't say anything I've done has ever caused an issue like that (A new ticket would have been made and sent to me.) But I also can't say that it has never happened, so I'm not really in a position to disagree. We do have a good QA team though and we have an "open" beta program that also catches a lot of issues before they become more widely public.
chrisjj 29 days ago [-]
I for one would be most unhappy to find my game failing on card X due to drivers perverting my float type.
david-gpu 33 days ago [-]
Obviously, which is the reason you don't do something like that without appropriate amounts of testing.
chrisjj 33 days ago [-]
Testing on platforms that perverts the precision is outside appropriate, I would say.
Dylan16807 33 days ago [-]
Code depending on specific low precision is very likely to be so fragile it won't make it anywhere near release.
account42 33 days ago [-]
"Fragile shader code won't make it anywhwere near release" is a pretty bold claim.
If it happens to work with the Nvidia driver it's getting shipped.
Dylan16807 32 days ago [-]
But since the premise is that any difference breaks the code, will it work the exact same way across different devices? That's what I'm skeptical of.
Unless every dev and tester that touches this shader is using the same hardware, which seems like an obvious mistake to avoid...
Cieric 32 days ago [-]
I will note, half of the customer facing bugs I get are "works on nvidia." Only to find out that it is a problem with the game and not the driver. Nvidia allows you to ignore a lot of the spec and it causes game devs to miss a lot of obvious bugs. A few examples:
1) Nvidia allows you to write to read only textures, game devs will forget to transition them to writable and will appear as corruption on other cards.
2) Nvidia automatically work with diverging texture reads, so devs will forget to mark them as a nonuniform resource index, which shows up as corruption on other cards.
3) Floating point calculations aren't IEEE compliant, one bug I fixed was x/width*width != x, On Nvidia this ends up a little higher and on our cards a little lower. The game this happened on ended up flooring that value and doing a texture read, which as you can guess, showed up as corruption on our cards.
1 and 2 are specifically required by the microsoft directx 12 spec, but most game devs aren't reading that and bugs creep in. 3 is a difference in how the ALU is designed, our cards being a little closer to IEEE compliant. A lot of these issue are related to how the hardware works, so stays pretty consistent between the different gpus of a manufacturer.
Side note: I don't blame the devs for #3, the corruption was super minor and the full calculation was spread across multiple functions (assumed by reading the dxil). The only reason it sticks out in my brain though is because the game devs were legally unable to ever update the game again, so I had to fix it driver side. That game was also Nvidia sponsored, so it's likely our cards weren't tested till very late into the development. (I got the ticket a week before the game was to release.) That is all I'm willing to say on that, I don't want to get myself in trouble.
Cieric 31 days ago [-]
> Floating point calculations aren't IEEE compliant
To late to edit, but I want to half retract this statement, they are IEEE compliant, but due to optimizations that can be applied by the driver developers they aren't guaranteed to be. This is assuming that the accuracy of a multiply and divide are specified in the IEEE floating point spec, I'm seeing hints that it is, but I can't find anything concrete.
chrisjj 31 days ago [-]
> the game devs were legally unable to ever update the game again, so I had to fix it driver side.
Could they not intercept the calls to inject a fix?
Cieric 29 days ago [-]
I'm just going off what I was told there, I was forced to make the fix since the game developers no longer were partnered to the company that owned the license to the content.
chrisjj 31 days ago [-]
PS Why is /width*width not eliminated on compilation?
Cieric 29 days ago [-]
Good question, I'm assuming it's due to the calculation happen across a memory barrier of some kind or due to all the branches in between so llvm is probably avoiding the optimization. It was quite a while ago so it is something I could re-investigate and actually try and fix. I would have to wait for downtime with all the other tickets I'm getting though. It's also just something that dxc itself should be doing, but I have no control over that.
chrisjj 31 days ago [-]
> But since the premise is that any difference breaks the code,
Not any.
> will it work the exact same way across different devices
Yes, where they run IEEE FP format.
Dylan16807 31 days ago [-]
> Not any.
Just about any. It's pretty difficult to write code where changing the rounding of the last couple bits breaks it (as happens if you use wider types during the calculation), but other changes don't break it.
What real code have you seen with that behavior?
chrisjj 29 days ago [-]
> but other changes don't break it.
Why should that be a requirement? Obviously the driver can make other changes that can break any code.
Dylan16807 28 days ago [-]
> Why should that be a requirement?
Originally I said "the premise is that any difference breaks the code".
You replied with "Not any."
That is where the requirement comes from, your own words. This is your scenario, and you said not all differences would break the hypothetical code.
This is your choice. Are we talking about code where any change breaks it (like a seeded/reproducible RNG), or are we talking about code where there are minor changes that don't break it but using extra precision breaks it? (I expect this category to be super duper rare)
chrisjj 27 days ago [-]
> Originally I said "the premise is that any difference breaks the code".
> You replied with "Not any."
> That is where the requirement comes from, your own words.
I wasn't stating a requirement. I was disputing your report of the premise.
> or are we talking about code where there are minor changes that don't break it but using extra precision breaks it?
Yup.
Dylan16807 24 days ago [-]
> > or are we talking about code where there are minor changes that don't break it but using extra precision breaks it?
> Yup
Then I will ask again, have you ever seen code that falls into this category? I haven't. And an RNG would not fall into this category.
chrisjj 33 days ago [-]
Consider an RNG.
Dylan16807 33 days ago [-]
I consider a floating point RNG in a shader that is seeded for reproducibility to be a bad idea.
chrisjj 31 days ago [-]
Why, pray?
Dylan16807 31 days ago [-]
GPUs often have weird sizes for optimal behavior and you have to depend on math-based optimizations never being applied to your code.
In my opinion, floating point shaders should be treated as a land of approximations.
You asked in another comment why /width*width isn't optimized out by the compiler. But it's changes just like that that will break an RNG!
chrisjj 29 days ago [-]
> In my opinion, floating point shaders should be treated as a land of approximations.
Fine, but that leaves you responsible for the breakage the shader of an author that holds the opposite opinion, as he is entitled to do. Precision =/= accuracy.
Dylan16807 28 days ago [-]
But do you want /width*width to be optimized? Or associative and commutative operations in general? Then you have to reject the opposite opinion.
chrisjj 26 days ago [-]
> But do you want /width*width to be optimized?
I do. But I don't see such optimisation as anything to with your changing float type.
Dylan16807 24 days ago [-]
Changes like that will break just as much code as adding extra precision will. Because it will change how things round, and not much else, just like adding extra precision. They're both slightly disruptive, and they tend to disrupt the same kind of thing, unlike removing precision which is very disruptive all over.
SpaghettiCthulu 33 days ago [-]
> A lot of our options do change how shaders work though, like forcing a shader to use double precision floats instead of the single it was compiled with.
What benefit would that give? Is double precision faster than single on modern hardware?
Cieric 33 days ago [-]
That's specifically because gpus aren't IEEE compliant, and calculations will drift differently on different gpus. Double precision can help avoid divide by zero errors in some shaders because most don't guard against that and NANs propagate easily and show up as visual corruption.
dahart 32 days ago [-]
What GPUs aren’t IEEE compliant? I don’t think that’s been true for more than a decade.
Cieric 31 days ago [-]
After a bunch of testing and looking around I think I should actually change my statement. GPUs do offer IEEE floating point compliance by default, but don't strictly adhere to it. Multiple optimizations that can be applied by the driver developers can massively effect the floating point accuracy.
This is all kind of on the assumption that the accuracy of floating point multiplication and division is in the IEEE spec, I was told before that it was but searching now I can't seem to find it one way or the other.
I believe one of the optimizations done by nvidia is to drop f32 variables down to f16 in a shader. Which would technically break the accuracy requirement (as before if it exists). I don't have anything I can offer as proof of that due to NDA sadly though. I will note that most of my testing and work is done in PIX for Windows, and most don't have anti-cheat so they're easy to capture.
dahart 31 days ago [-]
What shaders (presumably GLSL & HLSL) do precision wise isn’t an IEEE compliance issue, it’s either a DX/Vulkan spec issue, OR a user compiler settings issue. Dropping compliance is and should be allowed when the code asks for it. This is why GLSL has lowp, mediump, and highp settings. I think all GPUs are IEEE compliant and have been for a long time.
Cieric 29 days ago [-]
I agree on the dropping compliance when asked for aspect, the problem I'm referring to more is the driver dropping compliance without the game asking for it. If the underlying system can randomly drop compliance when ever it thinks it's fine without telling the user and without the user asking, I would not consider that compliant.
dahart 28 days ago [-]
That is fair, if true. But I’m very skeptical any modern drivers are dropping compliance without being asked. The one possibility I could buy is that you ran into a case of someone having dropped precision intentionally in a specific game in order to “optimize” a shader. Otherwise, precision and IEEE compliance is the prerogative of the compiler. The compiler is sometimes in the driver, but it never decides on it’s own what precision to use, it uses either default or explicit precision settings. The only reason it would not produce IEEE compliant code is if it was being asked to.
dahart 33 days ago [-]
Only more precision. But no, doubles are not faster. At best they’re the same instruction latency & throughput as singles, and that’s only on a few expensive pro/datacenter GPUs. Even if they are technically the same instruction speed, they’re still 2x the memory & register usage, which can compromise perf in other ways. Doubles on consumer GPUs are typically anywhere from 16 to 64 times slower than singles.
FWIW, I’ve never heard of shader replacement to force doubles. It’d be interesting to hear when that’s been used and why, and surprising to me if it was ever done for a popular game.
kevingadd 33 days ago [-]
I would guess it's done for compatibility or quality reasons
crazygringo 33 days ago [-]
Wow, how did they pick which games to optimize?
Does the studio pay them to do it? Because Nvidia wouldn't care otherwise?
Does Nvidia do it unasked, for competitive reasons? To maximize how much faster their GPU's perform than competitors' on the same games? And therefore decide purely by game popularity?
Or is it some kinda of alliance thing between Nvidia and studios, in exchange for something like the studios optimizing for Nvidia in the first place, to further benefit Nvidia's competitive lead?
Cieric 33 days ago [-]
I can't give details on how we do our selections (not nvidia but another gpu manufacturer). But we do have direct contacts into a lot of studios and we do try and help them fix their game if possible before ever putting something in the driver to fix it. Studios don't pay us, it's mutually benefital for us to improve the performance of the games. It also help the game run better on our cards by avoiding some of the really slow stuff.
In general if our logo is in the game, we helped them by actually writing code for them, if it's not then we might have only given them directions on how to fix issues in their game or put something in the driver to tweak how things execute. From an outside perspective (but still inside on the gpu space) nvidia does give advice to keep their competitive advantage. In my experience so far ignoring barriers that are needed as per the spec, defaulting to massive numbers when the gpu isn't known ("batman and tessellation" should be enough to find that), and then doing out right weird stuff that doesn't look like something any sane person would do in writing shaders (I have a thought in my head for that one, but it's not considered public knowledge. )
flohofwoe 33 days ago [-]
AFAIK NVIDIA and AMD do this unasked for popular game releases because it gives them a competitive advantage if 'popular game X' runs better on NVIDIA than AMD (and vice versa). If you're an AAA studio you typically also have a 'technical liason' at the GPU vendors though.
It's basically an arms race. This is also the reason why graphics drivers for Windows are so frigging big (also AFAIK).
diegoperini 33 days ago [-]
Double "AFAIK" makes me trust you more :)
I think this is very accurate. The exception is probably those block buster games. Those probably get direct consultancy from NVIDIA during the development to make them NVIDIA-ready from day 1.
account42 33 days ago [-]
Even smaller games often get some amount of consultation from Nvidia/AMD - they are typically in the credits in some form.
esperent 34 days ago [-]
I'd love to read more about this, what kind of changes they make and how many games they do it for. Do they ever release technical articles about it?
sigmoid10 33 days ago [-]
The other commenter makes it sound a bit more crazy than it is. "Intercept shaders" sounds super hacky, but in reality, games simply don't ship with compiled shaders. Instead they are compiled by your driver for your exact hardware. Naturally that allows the compiler to perform more or less aggressive optimisations, similar to how you might be able to optimise CPU programs by shipping C code and only compiling everything on the target machine once you know the exact feature sets.
flohofwoe 33 days ago [-]
Graphics drivers on Windows definitely do plenty of 'optimizations' for specific game executables, from replacing entire shaders to 'massaging/fixing' 3D-API calls.
And AFAIK Proton does things like this too, but for different reasons (fixing games that don't adhere to the D3D API documentation and/or obviously ignored D3D validation layer messages).
crazygringo 33 days ago [-]
I don't know -- if that other commenter is correct, it does sound pretty crazy.
Improving your compiler for everybody's code is one thing.
But saying, if the shader that comes in is exactly this code from this specific game, then use this specific precompiled binary, or even just apply these specific hand-tuned optimizations that aren't normally applied, that does seem pretty crazy to me.
But I don't know which it is?
Cieric 33 days ago [-]
Finger printing based on shaders is quite rare, really most of the time we detect things like the exe name calling us or sometime, very rarely they will give us a better name through an extension. (unreal engine does this automatically). From there all the options are simple, but full shader replacements are one. In the api I work on the shaders have a built in hash value, so that along with the game identified means we know exsactly what shader it is. Most of the replacements aren't complicated though, it's just replacing slow things with faster things for our specific hardware. In the end we are the final compiler so us tweaking things to work better should be expected to a degree.
chrisjj 33 days ago [-]
> most of the time we detect things like the exe name calling us
What could possibly go wrong? :)
macbr 33 days ago [-]
In the case of the Minecraft mod Sodium, which replaces much of Minecraft's rendering internals, Nvidia optimisations caused the game to crash.
So the mod devs had to implement workarounds to stop the driver from detecting that Minecraft is running... (changing the window title among other things)
And plain Minecraft did the opposite by adding -XX:JavaHeapDumpPath=MojangTricksIntelDriversForPerformance_javaw.exe_minecraft.exe.heapdump to the command line, when they changed the way the game started so that it wasn't detected as minecraft.exe any more. (Side effect: if you trigger a heap dump, it gets that name)
chrisjj 31 days ago [-]
Yuuuk.
Benanov 33 days ago [-]
quack.exe performing noticeably worse in benchmarks on certain cards than quake.exe in the late 90's/early 2000's?
> or games running like crap because they're not properly tested on our cards.
The decision on what game testing is proper surely lies with the game dev, not the card dev.
Plus game devs generally know that non-testing cannot cause a game to run like crap. Card devs ought to know that too.
Card devs also ought to respect that a game dev may have a good business reason for leaving his game running like crap on certain cards.
Cieric 29 days ago [-]
I would gauge properly testing on our card to be running the game occasionally on our cards throughout the whole development process. Currently it's been a lot of games 2 weeks out from release dropping a ton of issues on us to try and figure out. I don't even care if they don't do any optimization for our cards as long as they give us a fighting chance to figure out the issues on our side.
> Card devs also ought to respect that a game dev may have a good business reason for leaving his game running like crap on certain cards.
I can't agree with this though, business decisions getting in the way of gamers enjoying games should never be something that is settled for. If the hardware is literally to old and you just can't get it to work fine, but when a top of the line card is running below 60 fps that's just negligence.
chrisjj 29 days ago [-]
> business decisions getting in the way of gamers enjoying games should never be something that is settled for.
That's between the game dev and his customer. None of a card dev's business.
snicker7 34 days ago [-]
Imagine being the dev competing game Y and seeing the changelog.
surajrmal 34 days ago [-]
It wouldn't be surprising to find out Nvidia talks directly with game developers to give them hints as to how to optimize their games
account42 33 days ago [-]
It wouldn't be surprising at all seeing as they are even in the credits for most games.
chrisjj 33 days ago [-]
> "optimized game X, runs 40% faster"
... and looks 4O% crappier? E.g. stuttery, because the driver does not get to see the code ahead of time.
alexvitkov 33 days ago [-]
This would be acceptable if it meant adding one more shader, but with "modern" graphics APIs forcing us to sometimes have thousands of permutations for the same shader, every variant you add multiplies that count by 2x.
We also don't have an infinite amount of time to work on each shader. You profile on the hardware you care about, and if the choice you've made is slower on some imaginary future processor, so be it - hopefully that processor is faster enough that this doesn't matter.
account42 33 days ago [-]
Graphics APIs don't force your to have thousands of shaders. The abstraction in your engine might.
qwery 33 days ago [-]
Some of the mistakes/confusion being pointed out in the article is being replicated here, it seems.
The article is not claiming that conditional branches are free. In fact, the article is not making any point about the performance cost of branching code, as far as I can tell.
The article is pointing out that conditional logic in the form presented does not get compiled into conditionally branching code.
And that people should not continue to propagate the harmful advice to cover up every conditional thing in sight[0].
Finally, on actually branching code: that branching code is more complicated to execute is self-evident. There are no free branches. Avoiding branches is likely (within reason) to make any code run faster. Luckily[1], the original code was already branchless. As always, there is no universal metric to say whether optimisation is worthwhile.
[0] the "in sight" is important -- there's no interest in the generated code, just in the source code not appearing to include conditional anythings.
[1] No luck involved, of course ... (I assume people wrote to IQ to suggest apparently glaringly obvious (and wrong) improvements to their shader code, lol)
londons_explore 34 days ago [-]
So why isn't the compiler smart enough to see that the 'optimised' version is the same?
Surely it understands "step()" and can optimize the "step()=0.0" and "step()==1.0" cases separately?
This is presumably always worth it, because you would at least remove one multiplication (usually turning it into a conditional load/store/something else)
NohatCoder 33 days ago [-]
It may very well be, it is the type of optimisation where it is quite possible that some compilers may do it some of the time, but it is definitely also possible to write a version that the compiler can't grok.
mbel 33 days ago [-]
Yup, they most likely do. After all everything is LLVM based nowadays.
account42 33 days ago [-]
That's not true for shader compilers included in drivers - some use LLVM but definitely not all of them.
Cieric 32 days ago [-]
The other part of the optimization issue is that you can't take to long to try anything and everything. Most of the optimizations happen on the driver side, and anything that takes to long will show up as shader compilation stutter. I can't say currently if this is or isn't done, it's just always something you have to think about.
ttoinou 34 days ago [-]
Thanks Inigo !
The second wrong thing with the supposedly optimizer version is that it actually runs much slower than the original version. The reason is that the step() function is actually implemented like this:
float step( float x, float y )
{
return x < y ? 1.0 : 0.0;
}
How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives ?
Const-me 34 days ago [-]
The only way is do what OP did – compile your shader, disassemble, and read the assembly.
I do that quite often with my HLSL shaders, learned a lot about that virtual instruction set. For example, it’s interesting GPUs have instruction sincos, but inverse trigonometry is emulated while compiling.
account42 33 days ago [-]
Why are you supposed to know?
Because you care about performance?
step being implemented as a libray function on top of a conditional doesn't really say anything about its performance vs being a dedicated instruction. Don't worry about the implementation.
Because you are curious about GPU architectures?
Look at disassembly, (open source) driver code (including LLVM) and/or ISA documentation.
SideQuark 33 days ago [-]
I’ve never seen a GPU with special primitives for any functions than you’d see in pc style assembly. Every time I’ve looked at a decompiled shader, it’s always been pretty much what you think of in C.
Aldo specs like OpenGL specify many intrinsic behavior, which is then implemented as the spec, using standard assembly instructions.
Find an online site that decompiles to various architectures.
Waterluvian 34 days ago [-]
This is a great question that I see everywhere in programming and I think it is core to why you measure first when optimizing.
You generally shouldn’t know or care how a built in is implemented. If do care, you’re probably thinking about optimization. At that point the answer is “measure and find out what works better.”
TeMPOraL 33 days ago [-]
EDIT: I see my source of confusion must be that "branch" must have a well-understood hardware-specific meaning that goes beyond the meaning I grew up with, which is that a conditional is a branch, because the path control takes (at the machine code level) is chosen at runtime. This makes a conditional jump a branch by definition.
> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives?
To me the problem was obvious, but then again I'm having trouble with both your and author's statements about it.
The problem I saw was, obviously by going for a step() function, people aren't turning logic into arithmetic, they're just hiding logic in a library function call. Just because step() is a built-in or something you'd find used in mathematical paper doesn't mean anything; the definition of step() in mathematics is literally a conditional too.
Now, the way to optimize it properly to have no conditionals, is you have to take a continuous function that resembles your desired outcome (which in the problem in question isn't step() but the thing it was used for!), and tune its parameters to get as close as it can to your target. I.e. typically you'd pick some polynomial and run the standard iterative approximation on it. Then you'd just have an f(x) that has no branching, just a bunch of extra additions and multiplications and some "weirdly specific" constants.
Where I don't get the author is in insisting that conditional move isn't "branching". I don't see how that would be except in some special cases, where lack of branching is well-known but very special implementation detail - like where the author says:
> also note that the abs() call does not become a GPU instruction and instead becomes an instruction modifier, which is free.
That's because we standardized on two's complement representation for ints, which has the convenient quality of isolating sign as the most significant bit, and for floats the representation (IEEE-754) was just straight up designed to achieve the same. So in both cases, abs() boils down to unconditionally setting the most significant bit to 0 - or, equivalently, masking it off for the instruction that's reading it.
step() isn't like that, nor any other arbitrary ternary operation construct, and nor is - as far as I know - a conditional move instruction.
As for where I don't get 'ttoinou:
> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives
The basics like abs() and sqrt() and basic trigonometry are standard knowledge, the rest... does it even matter? step() obviously has to branch somewhere; whether you do it yourself, let a library do it, or let the hardware do it, shouldn't change the fundamental nature.
dahart 33 days ago [-]
> the meaning I grew up with, which is that a conditional is a branch
A conditional jump is a branch. But a branch has always had a different meaning than a generic “conditional”. There are conditional instructions that don’t jump, e.g. CMP, and the distinction is very important. Branch or conditional jump means the PC can be set to something other than ‘next instruction’. A conditional, such a conditional select or conditional move, one that doesn’t change the PC, is not a branch.
> take a continuous function […] Then you’d just have an f(x) that has no branching
One can easily implement conditional functions without branching. You can use a compare instruction followed by a Heaviside function on the result, evaluate both sides of the result, and sum it up with a 2D dot product (against the compare result and its negation). That is occasionally (but certainly not always) faster on a GPU than using if/else, but only if the compiler is otherwise going to produce real branch instructions.
dkersten 33 days ago [-]
Maybe I’m misunderstanding why branching is slow on a GPU. My understanding was that it’s because both sides of the branch are always executed, just one is masked out (I know the exact mechanics of this have changed), so that the different cores in the group can use the same program counter. Something to that effect, at least.
But in this case, would calculating both sides and then using a way to conditionally set the result not perform the same amount of work? Whether you’re calculating the result or the core masks the instructions out, it’s executing instructions for both sides of the branch in both cases, right?
On a CPU, the performance killer is often branch prediction and caches, but on the GPU itself executing a mostly linear set of instructions, or is my understanding completely off? I guess I don’t really understand what it’s doing, especially for loops.
dahart 32 days ago [-]
The primary concern is usually over the masking you’re talking about, the issue being simply that you’re proportionally cutting down the number of threads doing useful work. Using Nvidia terminology, if only one thread in a warp is active during a branch, the GPU throughput is 32x slower than it could be with a full warp.
Not all GPU branches are compiled in a straight line without jumps, so branching on a GPU does sometimes share the same instruction cache churn that the CPU has. That might be less of a big deal than thread masking, but GPU stalls still take dozens of cycles. And GPUs waiting on memory loads, whether it’s to fill the icache or anything else, are up to 32x more costly than CPU stalls, since all threads in the warp stall.
Loops are just normal branches, if they’re not unrolled. The biggest question with a loop is will all threads repeat the same number of times, because if not, the threads that exit the loop have to wait until the last thread is done. You can imagine what that might do for perf if there’s a small number of long-tail threads.
mymoomin 33 days ago [-]
A "branch" here is a conditional jump. This has the issues the article mentions, which branchless programming avoids:
Again, there is no branching - the instruction pointer isn't manipulated, there's no branch prediction involved, no instruction cache to invalidate, no nothing.
This has nothing to do with whether the behaviour of some instruction depends on its arguments. Looking at the Microsoft compiler output from the article, the iadd (signed add) instruction will get different results depending on its arguments, and the movc (conditional move) will store different values depending on its arguments, but after each the instruction pointer will just move onto the next instruction, so there are no branches.
chrisjj 33 days ago [-]
> a conditional is a branch, because the path control takes (at the machine code level) is chosen at runtime. This makes a conditional jump a branch by definition.
s/conditional is a/conditional jump is a/
Problem solved.
Non-jump conditionals have been a thing for decades.
ttoinou 33 days ago [-]
It kinda does when you’re wondering what’s going on in backstage and working with shaders on multiple OS, drivers and hardware.
Now, the way to optimize it properly to have no conditionals, is you have to take a continuous
I suspect that we shaders authors really like Clean Math and that’s also why we like to think such “optimizations” with the step function is a nice modification :-)
burch45 33 days ago [-]
Branching is different instruction paths, so it requires reading the instructions from different memory that causes a delay jumping to those new instructions rather than plowing ahead on the current stream of instructions. So a conditional jump is a branch but a conditional move is just an instruction that moves one of two values into a register but doesn’t affect what code is executed next.
mirsadm 34 days ago [-]
I've been caught by this. Even Claude/ChatGPT will suggest it as an optimisation. Every time I've measured a performance drop doing this. Sometimes significant.
WJW 34 days ago [-]
Is that weird? LLMs will just repeat what is in their training corpus. If most of the internet is recommending something wrong (like this conditional move "optimization") then that is what they will recommend too.
xbar 33 days ago [-]
Not weird but important to note.
diath 33 days ago [-]
> Even Claude/ChatGPT will suggest it as an optimisation.
LLMs just repeat what people on the internet say, and people are often wrong.
cwillu 33 days ago [-]
Hmm, godbolt is showing branches in the vulkan output:
Vulcan opcode shader lang is not executed. It’s a platform neutral intermediate language, so won’t have the special purpose optional instructions most GPUs do since GPUs aren’t required to.
It likely compiles down on the relevant platforms as the original article did.
toredo1729_2 34 days ago [-]
Unrelated, but somehow similar: I really hate it that it's not possible to force gcc to transform things like this into a conditional move:
x > c ? y : 0.;
It annoyed me many times and it still does.
fweimer 34 days ago [-]
What do you mean? Do you want to annotate the condition as unpredictable, so that the compiler always assumes that a conditional move is beneficial?
(Compilers obviously do this transformation, including GCC, but it is not always beneficial, especially on x86-64.)
toredo1729_2 33 days ago [-]
Yes, that would be great. It's not always benefical, but in some (rare, but for me important) cases it's better. Currently, the only way to ensure a conditional move is used, is to use inline assembly. This is not portable and also less maintainable than a "proper" solution.
tjalfi 33 days ago [-]
clang has the __builtin_unpredictable() intrinsic[0] for this purpose.
And it's not always possible! E.g. most RISC-V CPUs don't support it yet.
dzaima 33 days ago [-]
Eh, it takes ~3-4 instrs to do a branchless "x ? y : z" on baseline rv64i (depending on the format you have the condition in) via "y^((y^z)&x)", and with Zicond that only goes down to 3 instrs (they really don't want to standardize GPR instrs with 3 operands so what Zicond adds is "x ? y : 0" and "x ? 0 : y" ¯\_(ツ)_/¯; might bring the latency down by an instr or two though).
IshKebab 33 days ago [-]
It's more about removing branches than instruction counts or latency.
dzaima 33 days ago [-]
The "y^((y^z)&x)" method is already branchless, and close in performance to the Zicond variant, is my point; i.e. Zicond doesn't actually add much.
IshKebab 33 days ago [-]
Are you sure? As soon as you add actual computations in you're heading through the whole execution pipeline & forwarding network, tying up ALUs, etc. Zicond can probably be handled without all that.
Also that isn't actually equivalent since `x` needs to be all 1s or all 0s surely? Neither GCC nor Clang use that method, but they do use Zicond.
dzaima 33 days ago [-]
Zicond's czero.eqz & czero.nez (& the `or` to merge those together for the 3-instr impl of the general `x?y:z`) still have to go through the execution pipeline, forwarding network, an ALU, etc just as much as an xor or and need to. It's just that there's a shorter dependency chain and maybe one less instr.
Indeed you may need to negate `x` if you have only the LSB set in it; hence "3-4 instrs ... depending on the format you have the condition in" in my original message.
I assume gcc & clang just haven't bothered considering the branchless baseline impl, rather than it being particularly bad.
Note that there's another way some RISC-V hardware supports doing branchless conditional stores - a jump over a move instr (or in some cases, even some arithmetic instructions), which they internally convert to a branchless update.
For such simple expression I would expect the compiler to pick the right output pattern based on the target CPU though...
dzaima 33 days ago [-]
Not always - https://www.godbolt.org/z/zYxeahf3T. And for any modern (as in, made in the last two decades) x86 processor the branchless version will be hilariously better if the condition is unpredictable (which is a thing the compiler can't know by itself, hence wanting to have an explicit way to request a conditional move instr) and the per-branch code takes less than like multiple dozens of cycles.
dzaima 33 days ago [-]
Worse, doing one of the idioms for a conditional move ends up getting gcc to actually produce a conditional move, but clang doesn't, even with its __builtin_unpredictable: https://www.godbolt.org/z/bq9axzvjG
ryao 33 days ago [-]
You want to pass -mllvm -x86-cmov-converter=false. I assume LLVM has a pass to undo conditional moves on x86 whenever a heuristic determines skipping a calculation by branching is cheaper than doing the calculation and using a conditional move.
Unfortunately, the heuristic that calculates the expense often gets things wrong. That is why OpenZFS passes -mllvm -x86-cmov-converter=false to Clang for certain files where the LLVM heuristic was found to do the wrong thing:
The issue explains why __builtin_unpredictable() does not address the problem. In short, the metadata is dropped when an intermediate representation is generated inside clang since the IR does not have a way to preserve the information.
dzaima 33 days ago [-]
Yeah, the need of the "-mllvm -x86-cmov-converter=false" hack is stupid; forgot to check with it. In my mind I guess I equivocated that being fixed with https://reviews.llvm.org/D118118 (and indeed __builtin_unpredictable does work with ?:), but, no, that flag still improves things today for the rest of the cases.
33 days ago [-]
ryao 33 days ago [-]
It just occurred to me that Clang and GCC did not necessarily fail to use conditional moves in your examples. While they failed to use explicit cmov instructions, cmp/jmp 1 instruction/mov is actually an idiom for an implicit cmov. Some CPU instruction decoders are able to turn it into a cmov without an explicit cmov instruction. In the case of RISC-V, the designers are philosophically opposed to explicit cmov instructions and expect compilers to generate this idiom for CPUs that support cmov. I asked them to implement cmov virtual instructions to be nice to people reading RISC-V assembly, but I am not sure if anything will come of it:
I do not know if any x86 CPUs recognize the implicit cmov idiom offhand, but if any do, then while an extra instruction was used, the conditional move would still be done on those that recognize the idiom.
By the way, I just noticed a case where you really don’t want the compiler to generate a cmov, explicit or otherwise since it would risk division by zero:
Interestingly, Clang correctly does not generate a cmov (implicit or explicit) for the outer ternary operation, while it does generate an explicit cmov for the inner ternary operator in MIN() without -mllvm -x86-cmov-converter=false. Passing -mllvm -x86-cmov-converter=false to Clang does not change the output, which makes Clang’s behavior correct.
GCC will not generate cmov for either ternary operator, which while also technically correct, is slow. This could still have been an implicit conditional move had GCC not avoided the implicit cmov idiom.
Using GCC’s __builtin_expect_with_probability() in MIN() does not cause GCC to change its output. If we remove the outer ternary, GCC will happily generate a cmov instruction. Given that GCC generally assumes that undefined behavior is not invoked to make code faster and will happily generate the cmov when there is a division by 0 bug, it is odd that upon seeing a check that verifies the assumption GCC made is true, GCC decides to stop generating a cmov. I am sure the way GCC does things is much more complicated than my interpretation of the output, but the behavior is odd enough to merit a comment.
dzaima 33 days ago [-]
I haven't heard of anything outside RISC-V having jump-over-mov as an idiom (though I've heard of potentially some CPUs having the ability to unwind only necessary parts on mispredictions over small bits of code or something; still some misprediction penalty though I believe; and even with -march=haswell behavior doesn't change).
I find the RISC-V solution (which fwiw I mentioned in a sibling thread[0]) rather sad; there's no way to check whether it's implemented, and even where it is I could imagine it being problematic (i.e. if the instructions cross a fetch block or cacheline or something and it gets ran as a branch, or some instrs around it break the fusion pattern checking), and where it's unsupported or otherwise doesn't work properly it'll "work" but be horrifically slow.
fwiw I haven't ever seen __builtin_expect_with_probability actually do anything for unpredictable branches; I just included it in my compiler explorer link for completeness.
Using a version of MIN that caches the X/Y computations gets gcc to produce a cmov, but makes clang's output longer: https://www.godbolt.org/z/6h8obxKG8
You might want to give feedback to the risc-v developers (although it might be too late at this point). What is the way to check if implicit cmov instructions are implemented in the CPU instruction decoder?
If AMD did not implement this in Zen 5, maybe we could ask them to add it in Zen 7 or 8. I assume it would be too late to ask them to add this in Zen 6.
Thanks for the caching tip.
dzaima 33 days ago [-]
There's of course no "way" to check, as it's a microarchitectural property. Your best bet is comparing performance of the same code on predictable vs unpredictable branches.
I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).
Oh, another thing I don't like about fusing condjump+mv - it makes it stupidly more non-trivial to intentionally use branches on known-predictable conditions for avoiding the dependency on both branches.
ryao 33 days ago [-]
> There's of course no "way" to check, as it's a microarchitectural property. Your best bet is comparing performance of the same code on predictable vs unpredictable branches.
I was afraid the answer to my question would be that, but since my read of your previous comment “there's way to check whether it's implemented” seemed to suggest you knew a way I did not, I had my fingers crossed. At least, it had been either that you knew a trick I did not, or that a typo had deleted the word “no”.
> I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).
Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding logic to detect the idiom to the instruction decoder is the way to go. x86 cores spend enormous amounts of silicon on situational tricks to make code run faster. I doubt spending silicon on one more trick would be terrible, especially since the a number of other tricks to extract more performance from things likely apply to even more obscure situations. As for what happens in the x86 core, the instruction decoder would presumably emit what it emits for the explicit version when it sees the implicit version. I have no idea what that is inside a x86 core. I suspect that there are some corner cases involving the mov instruction causing a fault to handle (as you would want the cpu to report that the mov triggered the fault, not the jmp), but it seems doable given that they already had to handle instruction faults in other cases of fusion.
Also, if either of us were sufficiently motivated, we might be able to get GCC to generate better code through a plugin that will detect the implicit cmov idiom and replace it with an explicit cmov:
Note that I have not confirmed whether their plugins are able to hook the compiler backend where they would need to hook to do this.
Of course, such plugins won’t do anything for all of the existing binaries that have the implicit idiom or any new binaries built without the plugins, but they could at least raise awareness of the issue. It is not a full solution since compilers don’t emit the implicit cmov idiom in all cases where a cmov would be beneficial, but it would at least address the cases where they do.
dzaima 33 days ago [-]
> since my read of your previous comment seemed to suggest you knew a way I did not, I had my fingers crossed.
Whoops, typo! edited.
> Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding this to the instruction decoder is the way to go
From what I've read, the thing they're against (or at least is a major blocker) is having a standard GPR instruction that takes 3 operands, as all current GPR instrs take a max of two. I cannot imagine there being any way that fusing instructions is less silicon than a new instruction whatsoever; if anything else, it'd be not wanting to waste opcode space, or being fine with the branchy version (which I'm not).
Zen 4, at least as per Agner's microarchitecture optimization guide, only fuses nops and cmp/test/basic_arith+jcc; not that many tricks, only quite necessary ones (nops being present in code alignment, and branches, well, being basically mandatory every couple instructions).
No need for a plugin; it is possible to achieve branchess moves on both as-is: https://www.godbolt.org/z/eojqMseqs. A plugin wouldn't be any more stable than that mess. (also, huh, __builtin_expect_with_probability actually helped there!)
I'd imagine a major problem for the basic impls is that the compiler may early on lose the info that the load can be ran in both cases, at which point doing it unconditionally would be an incorrect transformation.
ryao 33 days ago [-]
I had suggested the virtual instructions to the RISC-V developers to eliminate branchy cmov assembly, as I am not happy with it either. It is surprising to realize that x86 cores are not making more use of macro-ops fusion, contrary to my expectation, but I guess it makes sense now that I think about it. Their designers have plenty of other knobs for tuning performance and the better their branch predictor becomes, the less this actually matters outside of the cases where developers go out of their way to use cmov.
A plugin would handle cases where the implicit idiom is emitted without needing the developer to explicitly try to force this. As far as I know, most people don’t ever touch conditional moves on the CPU and the few that do (myself included), only bother with it for extremely hot code paths, which leaves some dangling fruit on the table, particularly when the compiler is emitting the implicit version by coincidence. The safety of the transformation as a last pass in the compiler backend is not an issue since the output would be no more buggy than it previously was (as both branches are already calculated). Trying to handle all cases (the non-low dangling fruit) is where you have to worry about incorrect transformations.
dzaima 33 days ago [-]
Ah, your gcc example does have the branchful branch that could be done branchlessly; I was thinking about my original example with a load, which can't be transformed back.
I think the core problem is that when writing code like this you need experience be sure that it won’t have a conditional branch. How many operations past the conditional cause a branch? Which operations can the compiler elide to bring the total below this count? I’m all for writing direct code and relying on smart compilers but it’s often hard to know if and where I’m going to get bitten. Do I always have to inspect the assembly? Do I need a performance testing suit to check for accidental regressions? I find it much easier if I can give the compiler a hint on what I expect it to do, this would be similar to a @tailcall annotation. That way I can explore the design space without worry that I’ll accidentally overstep a some hard to reason about boundary that will tank the performance.
DrNosferatu 33 days ago [-]
This should be quantified and generalized for a full set of cases - that way the argument would stand far more clearly.
It's funny because I rarely seen this (wrong approach) done anywhere else but I pick it up by myself (like a lot did I presume) and still am the first to do it everytime I see the occasion, not so for optimizations (while I admit I thought it wouldn't hurt) but for the flow and natural look of it. It feels somehow more right to me to compose effect by signals interpolations rather than clear ternary branch instructions.
Now I'll have to change my ways in fear of being rejected socially for this newly approved bad practice.
At least in WebGPU's WGSL we have the `select` instruction that does that ternary operation hidden as a method, so there is that.
grumpy_coder 33 days ago [-]
I believe the conclusion is correct in 2025, but the article in a way just perpetuates the 'misinformation', making it seem like finding if your code will compile to a dynamic branch or not is easier than it is.
The unfortunate truth with shaders is that they are compiled by the users machine at the point of use. So compiling it on just your machine isn't nearly good enough. NVIDIA pricing means large numbers of customers are running 10 year old hardware. Depending on target market you might even want the code to run on 10 year old integrated graphics.
Does 10 year old integrated graphics across the range of drivers people actually have running prefer conditional moves over more arithmetic ops.. probably, but I would want to keep both versions around and test on real user hardware if this shader was used a lot.
mahkoh 34 days ago [-]
So, if you ever see somebody proposing this
float a = mix( b, c, step( y, x ) );
The author seems unaware of
float a = mix( b, c, y > x );
which encodes the desired behavior and also works for vectors:
The variants of mix where a is genBType select which vector each returned component comes from. For a component of a that is false, the corresponding component of x is returned. For a component of a that is true, the corresponding component of y is returned.
Thorrez 34 days ago [-]
The author doesn't seem to say that mix should be avoided. Just that you shouldn't replace a ternary with step+mix. In your quote, you left out the 2nd half of the sentence: "as an optimization to [ternary]".
mahkoh 33 days ago [-]
The author frames his post to be about education:
please correct them for me. The misinformation has been around for 20 years
But his education will fail as soon as you're operating on more than scalars. It might in fact do more harm than good since it leads the uneducated to believe that mix is not the right tool to choose between two values.
dahart 33 days ago [-]
If you only pass a boolean 0 or 1 for the “a” mix parameter, when is using mix better than a ternary? Can you give an example? I’m not sure mix is ever the right tool to choose between two values. It’s a great tool for blending two values, for linear interpolation when “a” is between 0 and 1. But if “a” is only 0 or 1, I don’t think mix will help you, and it could potentially hurt if the two values you mix are expensive function calls.
mahkoh 33 days ago [-]
a can be a vector of booleans.
Thorrez 33 days ago [-]
Does ternary work for that? Inigo is saying "if ternary is working, leave it as ternary". I think you're talking about a situation where ternary wouldn't work, so Inigo isn't saying anything about that situation.
dahart 32 days ago [-]
So what? That’s a language convenience, not a performance feature. Calling mix with a vec4 is no faster than using 4 scalar mix calls, and possibly much slower than using 4 ternaries. If your x & y params are expensive functions, and you use a vector of booleans, then you might be really shooting yourself in the foot.
sva_ 33 days ago [-]
I'm going to go on a limb and claim the author is aware:
I don’t know enough about these implementations to know if this can be interpreted as a blanket ‘conditionals are fine’ or, rather, ‘ternary operations which select between two themselves non-branching expressions are fine’.
Like does this apply if one of the two branches of a conditional is computationally much more expensive? My (very shallow) understanding was that having, eg, a return statement on one branch and a bunch of work on the other would hamstring the GPU’s ability to optimize execution.
dahart 33 days ago [-]
A GPU/SIMT branch works by running both sides, unless all threads in the thread group (warp/wavefront) make the same branch decision. As long as both paths have at least one thread, the GPU will run both paths sequentially and simply set the active mask of threads for each side of the branch. In other words, the threads that don’t take a given branch sit idle while the active threads do their work. (Note “sit idle” might involve doing all the work and throwing away the result.)
If you have two branches, and one is trivial while the other is expensive, and if the compiler doesn’t optimize away the branch already, it may be better for performance to write the code to take both branches unconditionally, and use a conditional assignment at the end.
It’s worth knowing that often there are clever techniques to completely avoid branching. Sometimes these techniques are simple, and sometimes they’re invasive and difficult to implement. It’s easy (for me, anyway) to get stuck thinking in a single-threaded CPU way and not see how to avoid branching until you’ve bumped into and seen some of the ways smart people solve these problems.
TinkersW 34 days ago [-]
A real branch is useful if you can realistically skip a bunch of work, but this requires all the lanes to agree, on a GPU that means 32 to 64 lanes need to all agree, also for something basic like a few arithmetic ops there is no point.
CountHackulus 33 days ago [-]
I love seeing the codegen output, makes it easy to understand the issue, but claiming that it's faster or slower without actual benchmarks is a bit disappointing.
leeoniya 33 days ago [-]
this.
why waste brain cells on theory when you should simply bench both versions and validate without buying into any kind of micro-optimization advice at face value.
ryao 34 days ago [-]
Do shader compilers have optimization passes to undo this mistake and if not, could they be added?
DRAGONERO 34 days ago [-]
I’d expect most vendors do, at least in their closed source drivers.
You could also check in the mesa project if this is implemented but it’s definitely possible to do
ryao 33 days ago [-]
Shader compilers tend to be very latency sensitive, so “it takes too long to run” would be a valid reason why it is not done if it is not done.
DRAGONERO 33 days ago [-]
Shader compilers mostly use LLVM even though runtime is a constraint, if the pattern is common enough it’s definitely easy to match (it’s just two intrinsics after all) meaning you can do it for cheap in instcombine which you’re going to be running anyway
ryao 33 days ago [-]
For some reason, I feel like this is harder to implement than you expect. The way to find out would be to get a bunch of examples of people doing this “optimizations in shader code, look at the IR generated compared to the optimal version and figure out a set of rules to detect the bad versions and transform it into a good versions. Keep in mind that in the example, the addition operators could be replaced with logical OR operators, so there are definitely multiple variations that need to be detected and corrected.
DRAGONERO 33 days ago [-]
I've checked and on "certain vendors" the mix + step is actually (slightly) better: same temp usage, lower instructions/cycles.
mgaunard 33 days ago [-]
"of course real branches happen in GPU code"
My understanding was that they don't. All executions inside a "branch" always get executed, they're simply predicated to do nothing if the condition to enter is not true.
ack_complete 33 days ago [-]
That's only if execution is incoherent. If all threads in a warp follow the branch the same way, then all of the instructions in the not taken branch are skipped.
lerp-io 32 days ago [-]
I've been doing this from day 1 becausee I just assumed you are not supposed to have loops or if else blocks in your shader code. Now I know better, thanks iq ur g.
arbitrandomuser 33 days ago [-]
What is the AMD and Microsoft cshader compiler , how do I generate and inpect these intermediate codes on my computer?
> For the record, of course real branches do happen in GPU code
Well, for some definition of "real". There are hardware features (on some architectures) that implement semantics that evaluate the same way that "branched" scalar code would. There is no branching at the instruction level, and can't be on SIMD (because the other parallel shaders being evaluated by the same instructions might not have taken the same branch!)
account42 33 days ago [-]
There is real branching on the hardware level, but yes it needs to take the same branch for the whole workgroup and anything else needs to be "faked" in some form.
ajross 32 days ago [-]
Yeah, exactly: I argue that a global backwards-only branch used to implement loops by checking all lanes for an "end" state is not actually a "real" branch.
It's a semantic argument, but IMHO an important one. Way, way too many users of GPUs don't understand how the code generation actually works, leading to articles like this one. Ambiguous use of terms like "branch" are the problem.
leguminous 32 days ago [-]
Using `mix()` isn't necessarily bad. Using boolean operations like `lessThan()` is probably better than `step()`. I just tested two ways of converting linear RGB to sRGB. On AMD, they compile to the same assembly.
It is weird how long misinformation like this sticks around, the conditional move/select approach has been superior for decades on both CPU & GPU, but somehow some people still write the other approach as an "optimization".
Lockal 33 days ago [-]
"Conditional move is superior on CPU" is an oversimplification, in reality it was explained in 2007 by Linus[1] and nothing changed since then. Or in fact, branch predictors are constantly improving[2], while cmov data dependency problem can't be solved. Yes, cmov is better in unpredictable branches, but the universal truth is "programmers are notoriously bad at predicting how their programs actually perform"[3]
I really only care about performance in a SIMD sense, as not using SIMD means you aren't targeting performance to begin with(GPU == SIMD, CPU SSE/AVX == SIMD)
Branches fall off vs cmov/select as your lane count increases, they are still useful but only when you are certain the probability all lanes agreeing is reasonably high.
account42 33 days ago [-]
> cmov data dependency problem can't be solved
Can't is a pretty strong word.
flowzai4 34 days ago [-]
[dead]
yahya_6666608 33 days ago [-]
[flagged]
torginus 33 days ago [-]
I'm not going to second guess IQ, who is one of the greatest modern authorities on shaders, but I do have some counterarguments.
- Due to how SIMD works, it's quite likely both paths of the conditional statement get executed, so its a wash
- Most importantly, if statements look nasty on the screen. Having an if statement means a discontinuity in visuals, which means jagged and ugly pixels on the output. Of course having a step function doesnt change this, but that means the code is already in the correct form to replace it with smoothstep, which means you can interpolate between the two variations, which does look good.
torginus 33 days ago [-]
Why does this keep getting downvoted? This is fundamentally true, and good advice borne of experience. At least somebody would care to weight in as to why they disagree?
grg0 32 days ago [-]
I am not 100% sure nor did I downvote, but it doesn't look like a counter-argument at all.
> Due to how SIMD works, it's quite likely both paths of the conditional statement get executed, so its a wash
It's not just quite likely, it's what IQ is showing in the disassembly. For a ternary op like this one with trivial expressions on each side, the GPU evals both and then masks the result given the result of the condition.
> a step function doesnt change this, but that means the code is already in the correct form to replace it with smoothstep, which means you can interpolate between the two variations, which does look good.
A smoothstep does smooth interpolation of two values. It seems unrelated to the issue in the post. step() relates to the ternary op in the sense that both can be used to express conditionals. The post explains why you wouldn't necessarily want to use step() vs ternary op. smoothstep is related to step in some sense, but not in a way that relates to the article? i.e., going from step() to smoothstep() will entirely change the semantics of the program precisely because of the 'smooth' part.
torginus 32 days ago [-]
Step vs ternary op vs if statement are equivalent - you are correct in that. What I'm saying is not about optimization or assembly - optimization doesn't matter when the end result looks bad.
What I'm saying that no matter how you express it in code, abrupt transitions of values introduce aliasing (or 'edge shimmer'), which looks unpleaseant. The way you get rid of it by smoothly blending between 2 values with smoothstep for example.
"The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower than the original version [...] wasting two multiplications and one or two additions. [...] But don't take my word for it, let's look at the generated machine code for the relevant part of the shader"
—then proceeds to show only one codegen: the one containing no multiplications or additions. That proves the good version is fine; it doesn't yet prove the bad version is worse.
Showing the other generated version would only show that it's longer. It is not expected to have a branch either. So I don't think it would have added much value
(I don't know enough about GPU compilers to say whether they implement such an optimization, but if step() abuse is as popular as the post says, then they probably should.)
He's writing an essay on why they are wrong.
"But here's the problem - when seeing code like this, somebody somewhere will invariably propose the following "optimization", which replaces what they believe (erroneously) are "conditional branches" by arithmetical operations."
Hence his branchless codegen samples are sufficient.
Further, regarding.the side-issue "The second wrong thing with the supposedly optimizer [sic] version is that it actually runs much slower", no amount of codegen is going to show lower /speed/.
https://shader-playground.timjones.io/5d3ece620f45091678dcee...
I do like that the most obvious v = x > y ? a : b; actually works, but it's also concerning that we have syntax where an if is some times a branch and some times not. In a context where you really can't branch, you'd almost like branch-if and non-branching-if to be different keywords. The non-branching one would fail compilation if the compiler couldn't do it without branching. The branching one would warn if it could be done with branching.
And the reason for that is the confusing documentation from NVidia and its cg/CUDA compilers. I believe they did not want to scare programmers at first and hid the execution model, talking about "threads" and then they kept using that abstraction to hype up their GPUs ("it has 100500 CUDA threads!"). The result is people coding for GPUs with some bizarre superstitions though.
You actually want branches in the the code. Those are quick. The problem is that you cannot have a branch off a SIMD way so, instead of a branch the compiler will emit code for both branches and the results will be masked out based on the branch's condition.
So, to answer your question - any computation based on shader inputs (vertices, computer shader indices and what not) cannot and won't branch. It will all be executed sequentially with masking. Even in the TFA example, both values of ? operator are computed, the same happens with any conditional on an SIMD value. There can be shortcut branches emitted by the compiler to quickly bypass computations when all ways are the same value but in general case everything will be computed for every condition being true as well as being false.
Only conditionals based on scalar registers (shader constants/unform values) will generate branches and those are super quick.
It can do an actual branch if the condition ends up the same for the entire workgroup - or to be even more pedantic, for the part of the workgroup that is still alive.
You can also check that explicitly to e.g. take a faster special case branch if possible for the entire workgroup and otherwise a slower general case branch but also for the entire workgroup instead of doing both and then selecting.
This reminds me that people who believe that GPU is not capable of branches do stupid things like writing multiple shaders instead of branching off a shader constant e.g. you have some special mode, say x-ray vision, in a game and instead of doing a branch in your materials, you write an alternative version of every shader.
(I'm not a grapics programmer, mind you, so please correct any misunderstandings on my end)
What you want is to use smoothstep which blends a bit between these two values and for that you need to compute both paths anyway.
The observation relates to pixel shaders, and even within that, it relates to values that vary based on pixel-level data. In these cases having if statements without any sort of interpolation introduces aliasing, which tends to look very noticeable.
Now you might be fine with that, or have some way of masking it, so it might be fine in your use case, but most in the most common, naive case the issue does show up.
Nobody ever talked about clamping - and it's not even relavant to the discussion as it doesn't introduce discontinuity that can cause aliasing.
What I'm referring to is shader aliasing, which MSAA does nothing about - MSAA is for geometry aliasing.
To illustrate what I'm talking about with, an example that draws a red circle on a quad:
The bad version:
The good version: The first version has a hard boundary for the circle which has an ugly aliased and pixelated contour, while the latter version smooths it. This example might not be egregious, but this can and does show up in some circumstances.That's true on scalar CPUs too though. The CMOV instruction arrived with the P6 core in 1995, for example. Branches are expensive everywhere, even in scalar architectures, and compilers do their best to figure out when they should use an alternative strategy. And sometimes get it wrong, but not very often.
cmov also has dependencies on all three inputs, so if there's a high level of bias towards the unlikely input having a much higher latency than the likely one a cmov can cost a fair amount of waiting.
Finally cmov were absolutely terrible on P4 (10-ish cycles), and it's likely that a lot of their lore dates back to that.
The GPU will almost always execute f and g due to GPU differences vs CPU.
You can avoid the f vs g if you can ensure a scalar Boolean / if statement that is consistent across the warp. So it's not 'always' but requires incredibly specific coding patterns to 'force' the optimizer + GPU compiler into making the branch.
You could also have some fun stuff, where f and g return a boolean, because thanks to short circuit evaluation && || are actually also conditionals in disguise.
> The reason people do potentially more expensive mix/lerps is because while it might cost a tiny overhead, they are scared of making it a branch.
This is a problem, though. People shouldn't do things potentially, they should look at the actual code that is generated and executed.
> it's also concerning that we have syntax where an if is some times a branch and some times not.
It would be more concerning if we didn't. We might get a branch on one GPU and none on another.
The best way is to profile the code. Time is what we are after, so measure that.
Same story for bit extraction and other integer ops - we used to emulate them with float math because it was faster, but now every GPU has fast integer ops.
Is that true and to what extent? Looking at the ISA for RDNA2[0] for instance - which is the architecture of both PS5 and Xbox Series S|X - all I can find is 32-bit scalar instructions for integers.
[0] https://www.amd.com/content/dam/amd/en/documents/radeon-tech...
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
It needs to support 64-bit integer arithmetic for handling 64-bit address calculations efficiently. The SASS ISA since Volta has explicit 32I suffixed integer instructions alongside the regular integer instructions, so I would expect the regular instructions to be 64-bit, although the documentation leave something to be desired:
https://docs.nvidia.com/cuda/cuda-binary-utilities/index.htm...
(I've been looking into the specifics of CUDA integer arithmetic for some time now, since I've had the mad idea of doing 'horizontal' 448-bit integer arithmetic by storing one word in each thread and using the warp-shuffle instructions to send carries up and down. Given that the underlying arithmetic is all 32-bit, it doesn't make any sense to store more than 31 bits per thread. Then again, I don't know whether this mad idea makes any sense in the first place, until I implement and profile it.)
[Update: I remembered and double checked. While there are only scalar 32-bit integer instructions you can use 24-bit integer vector instructions. Essentially ignoring the exponent part of the floats.]
https://docs.nvidia.com/cuda/parallel-thread-execution/index...
Note that I am using the Nvidia PTX documentation here. I have barely looked at the AMD RDNA documentation, so I cannot cite it without doing a bunch of reading.
1. Scalar - run once per thread group, only acting on shared memory. So these won't be SIMD.
2. Vector - run across all threads, each threads accesses its own copy of the variables. This is what you typically think of GPU instructions doing.
[0] https://www.amd.com/content/dam/amd/en/documents/radeon-tech..., p. 259, Table 83, "VOP3A Opcodes"
[1] https://www.amd.com/content/dam/amd/en/documents/radeon-tech..., p. 160, Table 85, "VOP3 Opcodes"
“If you consult the internet about writing a branch of a GPU, you might think they open the gates of hell and let demons in. They will say you should avoid them at all costs, and that you can avoid them by using the ternary operator or step() and other silly math tricks. Most of this advice is outdated at best, or just plain wrong.
Let’s correct that.”
As I've mentioned here several times before, I've made code significantly faster by removing the hand-rolled assembly and replacing it with plain C or similar. While the assembly might have been faster a decade or two ago, things have changed...
I think it'd be possible in principle, because most APIs (D3D, GL, Vulkan etc) expose performance counters (which may or may not be reliable depending on the vendor), and you could in principle construct a representative test scene that you replay a couple times to measure different optimizations. But a lot of games are quite dynamic, having dynamically generated scenes and also dynamically generated shaders, so the number of combinations you might have to test seems like an obstacle. Also you might have to ask the user to spend time waiting on the benchmark to finish.
You could probably just do this ahead of time with a bunch of different GPU generations from each vendor if you have the hardware, and then hard-code the most important decision. So not saying it'd be impossible, but yeah I'm not aware of any existing infrastructure for this.
The important thing to note is that you can do that computation just once, like when you install the game, and it isn't that slow. Your parameterization won't be perfect, but it's not bad to create routines that are much faster than any one implementation on nearly every architecture.
That will break code sufficienly reliant on the behaviour of sungle precision, though.
You meant "in the case we find where that does happen"?
Little consolation for the cases you don't find because e.g. you cannot afford to play a game to the end.
If it happens to work with the Nvidia driver it's getting shipped.
Unless every dev and tester that touches this shader is using the same hardware, which seems like an obvious mistake to avoid...
1) Nvidia allows you to write to read only textures, game devs will forget to transition them to writable and will appear as corruption on other cards.
2) Nvidia automatically work with diverging texture reads, so devs will forget to mark them as a nonuniform resource index, which shows up as corruption on other cards.
3) Floating point calculations aren't IEEE compliant, one bug I fixed was x/width*width != x, On Nvidia this ends up a little higher and on our cards a little lower. The game this happened on ended up flooring that value and doing a texture read, which as you can guess, showed up as corruption on our cards.
1 and 2 are specifically required by the microsoft directx 12 spec, but most game devs aren't reading that and bugs creep in. 3 is a difference in how the ALU is designed, our cards being a little closer to IEEE compliant. A lot of these issue are related to how the hardware works, so stays pretty consistent between the different gpus of a manufacturer.
Side note: I don't blame the devs for #3, the corruption was super minor and the full calculation was spread across multiple functions (assumed by reading the dxil). The only reason it sticks out in my brain though is because the game devs were legally unable to ever update the game again, so I had to fix it driver side. That game was also Nvidia sponsored, so it's likely our cards weren't tested till very late into the development. (I got the ticket a week before the game was to release.) That is all I'm willing to say on that, I don't want to get myself in trouble.
To late to edit, but I want to half retract this statement, they are IEEE compliant, but due to optimizations that can be applied by the driver developers they aren't guaranteed to be. This is assuming that the accuracy of a multiply and divide are specified in the IEEE floating point spec, I'm seeing hints that it is, but I can't find anything concrete.
Could they not intercept the calls to inject a fix?
Not any.
> will it work the exact same way across different devices
Yes, where they run IEEE FP format.
Just about any. It's pretty difficult to write code where changing the rounding of the last couple bits breaks it (as happens if you use wider types during the calculation), but other changes don't break it.
What real code have you seen with that behavior?
Why should that be a requirement? Obviously the driver can make other changes that can break any code.
Originally I said "the premise is that any difference breaks the code".
You replied with "Not any."
That is where the requirement comes from, your own words. This is your scenario, and you said not all differences would break the hypothetical code.
This is your choice. Are we talking about code where any change breaks it (like a seeded/reproducible RNG), or are we talking about code where there are minor changes that don't break it but using extra precision breaks it? (I expect this category to be super duper rare)
> You replied with "Not any."
> That is where the requirement comes from, your own words.
I wasn't stating a requirement. I was disputing your report of the premise.
> or are we talking about code where there are minor changes that don't break it but using extra precision breaks it?
Yup.
> Yup
Then I will ask again, have you ever seen code that falls into this category? I haven't. And an RNG would not fall into this category.
In my opinion, floating point shaders should be treated as a land of approximations.
You asked in another comment why /width*width isn't optimized out by the compiler. But it's changes just like that that will break an RNG!
Fine, but that leaves you responsible for the breakage the shader of an author that holds the opposite opinion, as he is entitled to do. Precision =/= accuracy.
I do. But I don't see such optimisation as anything to with your changing float type.
What benefit would that give? Is double precision faster than single on modern hardware?
This is all kind of on the assumption that the accuracy of floating point multiplication and division is in the IEEE spec, I was told before that it was but searching now I can't seem to find it one way or the other.
I believe one of the optimizations done by nvidia is to drop f32 variables down to f16 in a shader. Which would technically break the accuracy requirement (as before if it exists). I don't have anything I can offer as proof of that due to NDA sadly though. I will note that most of my testing and work is done in PIX for Windows, and most don't have anti-cheat so they're easy to capture.
FWIW, I’ve never heard of shader replacement to force doubles. It’d be interesting to hear when that’s been used and why, and surprising to me if it was ever done for a popular game.
Does the studio pay them to do it? Because Nvidia wouldn't care otherwise?
Does Nvidia do it unasked, for competitive reasons? To maximize how much faster their GPU's perform than competitors' on the same games? And therefore decide purely by game popularity?
Or is it some kinda of alliance thing between Nvidia and studios, in exchange for something like the studios optimizing for Nvidia in the first place, to further benefit Nvidia's competitive lead?
In general if our logo is in the game, we helped them by actually writing code for them, if it's not then we might have only given them directions on how to fix issues in their game or put something in the driver to tweak how things execute. From an outside perspective (but still inside on the gpu space) nvidia does give advice to keep their competitive advantage. In my experience so far ignoring barriers that are needed as per the spec, defaulting to massive numbers when the gpu isn't known ("batman and tessellation" should be enough to find that), and then doing out right weird stuff that doesn't look like something any sane person would do in writing shaders (I have a thought in my head for that one, but it's not considered public knowledge. )
It's basically an arms race. This is also the reason why graphics drivers for Windows are so frigging big (also AFAIK).
I think this is very accurate. The exception is probably those block buster games. Those probably get direct consultancy from NVIDIA during the development to make them NVIDIA-ready from day 1.
And AFAIK Proton does things like this too, but for different reasons (fixing games that don't adhere to the D3D API documentation and/or obviously ignored D3D validation layer messages).
Improving your compiler for everybody's code is one thing.
But saying, if the shader that comes in is exactly this code from this specific game, then use this specific precompiled binary, or even just apply these specific hand-tuned optimizations that aren't normally applied, that does seem pretty crazy to me.
But I don't know which it is?
What could possibly go wrong? :)
https://github.com/CaffeineMC/sodium/issues/1486
https://github.com/CaffeineMC/sodium/issues/1816
https://web.archive.org/web/20230819072628/https://techrepor...
[1] https://www.neowin.net/news/yandex-alleges-amds-windows-driv...
The decision on what game testing is proper surely lies with the game dev, not the card dev.
Plus game devs generally know that non-testing cannot cause a game to run like crap. Card devs ought to know that too.
Card devs also ought to respect that a game dev may have a good business reason for leaving his game running like crap on certain cards.
> Card devs also ought to respect that a game dev may have a good business reason for leaving his game running like crap on certain cards.
I can't agree with this though, business decisions getting in the way of gamers enjoying games should never be something that is settled for. If the hardware is literally to old and you just can't get it to work fine, but when a top of the line card is running below 60 fps that's just negligence.
That's between the game dev and his customer. None of a card dev's business.
... and looks 4O% crappier? E.g. stuttery, because the driver does not get to see the code ahead of time.
We also don't have an infinite amount of time to work on each shader. You profile on the hardware you care about, and if the choice you've made is slower on some imaginary future processor, so be it - hopefully that processor is faster enough that this doesn't matter.
The article is not claiming that conditional branches are free. In fact, the article is not making any point about the performance cost of branching code, as far as I can tell.
The article is pointing out that conditional logic in the form presented does not get compiled into conditionally branching code. And that people should not continue to propagate the harmful advice to cover up every conditional thing in sight[0].
Finally, on actually branching code: that branching code is more complicated to execute is self-evident. There are no free branches. Avoiding branches is likely (within reason) to make any code run faster. Luckily[1], the original code was already branchless. As always, there is no universal metric to say whether optimisation is worthwhile.
[0] the "in sight" is important -- there's no interest in the generated code, just in the source code not appearing to include conditional anythings.
[1] No luck involved, of course ... (I assume people wrote to IQ to suggest apparently glaringly obvious (and wrong) improvements to their shader code, lol)
Surely it understands "step()" and can optimize the "step()=0.0" and "step()==1.0" cases separately?
This is presumably always worth it, because you would at least remove one multiplication (usually turning it into a conditional load/store/something else)
I do that quite often with my HLSL shaders, learned a lot about that virtual instruction set. For example, it’s interesting GPUs have instruction sincos, but inverse trigonometry is emulated while compiling.
Because you care about performance? step being implemented as a libray function on top of a conditional doesn't really say anything about its performance vs being a dedicated instruction. Don't worry about the implementation.
Because you are curious about GPU architectures? Look at disassembly, (open source) driver code (including LLVM) and/or ISA documentation.
Aldo specs like OpenGL specify many intrinsic behavior, which is then implemented as the spec, using standard assembly instructions.
Find an online site that decompiles to various architectures.
You generally shouldn’t know or care how a built in is implemented. If do care, you’re probably thinking about optimization. At that point the answer is “measure and find out what works better.”
> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives?
To me the problem was obvious, but then again I'm having trouble with both your and author's statements about it.
The problem I saw was, obviously by going for a step() function, people aren't turning logic into arithmetic, they're just hiding logic in a library function call. Just because step() is a built-in or something you'd find used in mathematical paper doesn't mean anything; the definition of step() in mathematics is literally a conditional too.
Now, the way to optimize it properly to have no conditionals, is you have to take a continuous function that resembles your desired outcome (which in the problem in question isn't step() but the thing it was used for!), and tune its parameters to get as close as it can to your target. I.e. typically you'd pick some polynomial and run the standard iterative approximation on it. Then you'd just have an f(x) that has no branching, just a bunch of extra additions and multiplications and some "weirdly specific" constants.
Where I don't get the author is in insisting that conditional move isn't "branching". I don't see how that would be except in some special cases, where lack of branching is well-known but very special implementation detail - like where the author says:
> also note that the abs() call does not become a GPU instruction and instead becomes an instruction modifier, which is free.
That's because we standardized on two's complement representation for ints, which has the convenient quality of isolating sign as the most significant bit, and for floats the representation (IEEE-754) was just straight up designed to achieve the same. So in both cases, abs() boils down to unconditionally setting the most significant bit to 0 - or, equivalently, masking it off for the instruction that's reading it.
step() isn't like that, nor any other arbitrary ternary operation construct, and nor is - as far as I know - a conditional move instruction.
As for where I don't get 'ttoinou:
> How are we supposed to know what OpenGL functions are emulated rather than calling GPU primitives
The basics like abs() and sqrt() and basic trigonometry are standard knowledge, the rest... does it even matter? step() obviously has to branch somewhere; whether you do it yourself, let a library do it, or let the hardware do it, shouldn't change the fundamental nature.
A conditional jump is a branch. But a branch has always had a different meaning than a generic “conditional”. There are conditional instructions that don’t jump, e.g. CMP, and the distinction is very important. Branch or conditional jump means the PC can be set to something other than ‘next instruction’. A conditional, such a conditional select or conditional move, one that doesn’t change the PC, is not a branch.
> take a continuous function […] Then you’d just have an f(x) that has no branching
One can easily implement conditional functions without branching. You can use a compare instruction followed by a Heaviside function on the result, evaluate both sides of the result, and sum it up with a 2D dot product (against the compare result and its negation). That is occasionally (but certainly not always) faster on a GPU than using if/else, but only if the compiler is otherwise going to produce real branch instructions.
But in this case, would calculating both sides and then using a way to conditionally set the result not perform the same amount of work? Whether you’re calculating the result or the core masks the instructions out, it’s executing instructions for both sides of the branch in both cases, right?
On a CPU, the performance killer is often branch prediction and caches, but on the GPU itself executing a mostly linear set of instructions, or is my understanding completely off? I guess I don’t really understand what it’s doing, especially for loops.
Not all GPU branches are compiled in a straight line without jumps, so branching on a GPU does sometimes share the same instruction cache churn that the CPU has. That might be less of a big deal than thread masking, but GPU stalls still take dozens of cycles. And GPUs waiting on memory loads, whether it’s to fill the icache or anything else, are up to 32x more costly than CPU stalls, since all threads in the warp stall.
Loops are just normal branches, if they’re not unrolled. The biggest question with a loop is will all threads repeat the same number of times, because if not, the threads that exit the loop have to wait until the last thread is done. You can imagine what that might do for perf if there’s a small number of long-tail threads.
s/conditional is a/conditional jump is a/
Problem solved.
Non-jump conditionals have been a thing for decades.
LLMs just repeat what people on the internet say, and people are often wrong.
It likely compiles down on the relevant platforms as the original article did.
x > c ? y : 0.;
It annoyed me many times and it still does.
(Compilers obviously do this transformation, including GCC, but it is not always beneficial, especially on x86-64.)
[0] https://clang.llvm.org/docs/LanguageExtensions.html#builtin-...
Also that isn't actually equivalent since `x` needs to be all 1s or all 0s surely? Neither GCC nor Clang use that method, but they do use Zicond.
Indeed you may need to negate `x` if you have only the LSB set in it; hence "3-4 instrs ... depending on the format you have the condition in" in my original message.
I assume gcc & clang just haven't bothered considering the branchless baseline impl, rather than it being particularly bad.
Note that there's another way some RISC-V hardware supports doing branchless conditional stores - a jump over a move instr (or in some cases, even some arithmetic instructions), which they internally convert to a branchless update.
https://www.godbolt.org/z/ffEvvjhz8
PS: and it also doesn't matter whether a ternary is used or a traditional if (as one would expect):
https://www.godbolt.org/z/zjb4KdqvK
(the float version also appears to not use branches: https://www.godbolt.org/z/98bdheKK4)
For such simple expression I would expect the compiler to pick the right output pattern based on the target CPU though...
Unfortunately, the heuristic that calculates the expense often gets things wrong. That is why OpenZFS passes -mllvm -x86-cmov-converter=false to Clang for certain files where the LLVM heuristic was found to do the wrong thing:
https://github.com/openzfs/zfs/commit/677c6f8457943fe5b56d7a...
There is an open LLVM issue regarding this:
https://github.com/llvm/llvm-project/issues/62790
The issue explains why __builtin_unpredictable() does not address the problem. In short, the metadata is dropped when an intermediate representation is generated inside clang since the IR does not have a way to preserve the information.
https://github.com/riscv/riscv-bitmanip/issues/185
I do not know if any x86 CPUs recognize the implicit cmov idiom offhand, but if any do, then while an extra instruction was used, the conditional move would still be done on those that recognize the idiom.
By the way, I just noticed a case where you really don’t want the compiler to generate a cmov, explicit or otherwise since it would risk division by zero:
https://github.com/openzfs/zfs/commit/f47f6a055d0c282593fe70...
Here is a godbolt link showing some output:
https://www.godbolt.org/z/4daKTKqfr
Interestingly, Clang correctly does not generate a cmov (implicit or explicit) for the outer ternary operation, while it does generate an explicit cmov for the inner ternary operator in MIN() without -mllvm -x86-cmov-converter=false. Passing -mllvm -x86-cmov-converter=false to Clang does not change the output, which makes Clang’s behavior correct.
GCC will not generate cmov for either ternary operator, which while also technically correct, is slow. This could still have been an implicit conditional move had GCC not avoided the implicit cmov idiom.
Using GCC’s __builtin_expect_with_probability() in MIN() does not cause GCC to change its output. If we remove the outer ternary, GCC will happily generate a cmov instruction. Given that GCC generally assumes that undefined behavior is not invoked to make code faster and will happily generate the cmov when there is a division by 0 bug, it is odd that upon seeing a check that verifies the assumption GCC made is true, GCC decides to stop generating a cmov. I am sure the way GCC does things is much more complicated than my interpretation of the output, but the behavior is odd enough to merit a comment.
I find the RISC-V solution (which fwiw I mentioned in a sibling thread[0]) rather sad; there's no way to check whether it's implemented, and even where it is I could imagine it being problematic (i.e. if the instructions cross a fetch block or cacheline or something and it gets ran as a branch, or some instrs around it break the fusion pattern checking), and where it's unsupported or otherwise doesn't work properly it'll "work" but be horrifically slow.
fwiw I haven't ever seen __builtin_expect_with_probability actually do anything for unpredictable branches; I just included it in my compiler explorer link for completeness.
Using a version of MIN that caches the X/Y computations gets gcc to produce a cmov, but makes clang's output longer: https://www.godbolt.org/z/6h8obxKG8
[0]: https://news.ycombinator.com/item?id=42992533
If AMD did not implement this in Zen 5, maybe we could ask them to add it in Zen 7 or 8. I assume it would be too late to ask them to add this in Zen 6.
Thanks for the caching tip.
I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).
Oh, another thing I don't like about fusing condjump+mv - it makes it stupidly more non-trivial to intentionally use branches on known-predictable conditions for avoiding the dependency on both branches.
I was afraid the answer to my question would be that, but since my read of your previous comment “there's way to check whether it's implemented” seemed to suggest you knew a way I did not, I had my fingers crossed. At least, it had been either that you knew a trick I did not, or that a typo had deleted the word “no”.
> I don't think there's any need for x86 cores to try to handle this; it's just a waste of silicon for something doable in one instruction anyway (I'd imagine that additionally instruction fusion is a pretty hot path, especially with jumps involved; and you'll get into situations of conflicting fusions as currently cmp+jcc is fused, so there's the question of whether cmp+jcc+mov becomes (cmp+jcc)+mov or cmp+(jcc+mov), or if you have a massive three-instruction four-input(?) fusion).
Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding logic to detect the idiom to the instruction decoder is the way to go. x86 cores spend enormous amounts of silicon on situational tricks to make code run faster. I doubt spending silicon on one more trick would be terrible, especially since the a number of other tricks to extract more performance from things likely apply to even more obscure situations. As for what happens in the x86 core, the instruction decoder would presumably emit what it emits for the explicit version when it sees the implicit version. I have no idea what that is inside a x86 core. I suspect that there are some corner cases involving the mov instruction causing a fault to handle (as you would want the cpu to report that the mov triggered the fault, not the jmp), but it seems doable given that they already had to handle instruction faults in other cases of fusion.
Also, if either of us were sufficiently motivated, we might be able to get GCC to generate better code through a plugin that will detect the implicit cmov idiom and replace it with an explicit cmov:
https://gcc.gnu.org/onlinedocs/gccint/Plugins.html
A similar plugin likely could be written for LLVM:
https://llvm.org/docs/WritingAnLLVMNewPMPass.html#registerin...
Note that I have not confirmed whether their plugins are able to hook the compiler backend where they would need to hook to do this.
Of course, such plugins won’t do anything for all of the existing binaries that have the implicit idiom or any new binaries built without the plugins, but they could at least raise awareness of the issue. It is not a full solution since compilers don’t emit the implicit cmov idiom in all cases where a cmov would be beneficial, but it would at least address the cases where they do.
Whoops, typo! edited.
> Interestingly, the RISC-V guys seem to think that adding an explicit instruction is a waste of silicon while adding this to the instruction decoder is the way to go
From what I've read, the thing they're against (or at least is a major blocker) is having a standard GPR instruction that takes 3 operands, as all current GPR instrs take a max of two. I cannot imagine there being any way that fusing instructions is less silicon than a new instruction whatsoever; if anything else, it'd be not wanting to waste opcode space, or being fine with the branchy version (which I'm not).
Zen 4, at least as per Agner's microarchitecture optimization guide, only fuses nops and cmp/test/basic_arith+jcc; not that many tricks, only quite necessary ones (nops being present in code alignment, and branches, well, being basically mandatory every couple instructions).
No need for a plugin; it is possible to achieve branchess moves on both as-is: https://www.godbolt.org/z/eojqMseqs. A plugin wouldn't be any more stable than that mess. (also, huh, __builtin_expect_with_probability actually helped there!)
I'd imagine a major problem for the basic impls is that the compiler may early on lose the info that the load can be ran in both cases, at which point doing it unconditionally would be an incorrect transformation.
A plugin would handle cases where the implicit idiom is emitted without needing the developer to explicitly try to force this. As far as I know, most people don’t ever touch conditional moves on the CPU and the few that do (myself included), only bother with it for extremely hot code paths, which leaves some dangling fruit on the table, particularly when the compiler is emitting the implicit version by coincidence. The safety of the transformation as a last pass in the compiler backend is not an issue since the output would be no more buggy than it previously was (as both branches are already calculated). Trying to handle all cases (the non-low dangling fruit) is where you have to worry about incorrect transformations.
On fusion, https://dougallj.github.io/applecpu/firestorm.html mentions ones that Apple's M1 does - arith+branch, and very specialized stuff.
https://doliveira4.github.io/gpuconditionals/
(no warranty)
Now I'll have to change my ways in fear of being rejected socially for this newly approved bad practice.
At least in WebGPU's WGSL we have the `select` instruction that does that ternary operation hidden as a method, so there is that.
The unfortunate truth with shaders is that they are compiled by the users machine at the point of use. So compiling it on just your machine isn't nearly good enough. NVIDIA pricing means large numbers of customers are running 10 year old hardware. Depending on target market you might even want the code to run on 10 year old integrated graphics.
Does 10 year old integrated graphics across the range of drivers people actually have running prefer conditional moves over more arithmetic ops.. probably, but I would want to keep both versions around and test on real user hardware if this shader was used a lot.
https://www.shadertoy.com/user/iq
Like does this apply if one of the two branches of a conditional is computationally much more expensive? My (very shallow) understanding was that having, eg, a return statement on one branch and a bunch of work on the other would hamstring the GPU’s ability to optimize execution.
If you have two branches, and one is trivial while the other is expensive, and if the compiler doesn’t optimize away the branch already, it may be better for performance to write the code to take both branches unconditionally, and use a conditional assignment at the end.
It’s worth knowing that often there are clever techniques to completely avoid branching. Sometimes these techniques are simple, and sometimes they’re invasive and difficult to implement. It’s easy (for me, anyway) to get stuck thinking in a single-threaded CPU way and not see how to avoid branching until you’ve bumped into and seen some of the ways smart people solve these problems.
why waste brain cells on theory when you should simply bench both versions and validate without buying into any kind of micro-optimization advice at face value.
My understanding was that they don't. All executions inside a "branch" always get executed, they're simply predicated to do nothing if the condition to enter is not true.
Well, for some definition of "real". There are hardware features (on some architectures) that implement semantics that evaluate the same way that "branched" scalar code would. There is no branching at the instruction level, and can't be on SIMD (because the other parallel shaders being evaluated by the same instructions might not have taken the same branch!)
It's a semantic argument, but IMHO an important one. Way, way too many users of GPUs don't understand how the code generation actually works, leading to articles like this one. Ambiguous use of terms like "branch" are the problem.
Method 1.
Method 2:[1] https://yarchive.net/comp/linux/cmov.html
[2] https://chipsandcheese.com/p/zen-5s-2-ahead-branch-predictor...
[3] https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index...
Branches fall off vs cmov/select as your lane count increases, they are still useful but only when you are certain the probability all lanes agreeing is reasonably high.
Can't is a pretty strong word.
- Due to how SIMD works, it's quite likely both paths of the conditional statement get executed, so its a wash
- Most importantly, if statements look nasty on the screen. Having an if statement means a discontinuity in visuals, which means jagged and ugly pixels on the output. Of course having a step function doesnt change this, but that means the code is already in the correct form to replace it with smoothstep, which means you can interpolate between the two variations, which does look good.
> Due to how SIMD works, it's quite likely both paths of the conditional statement get executed, so its a wash
It's not just quite likely, it's what IQ is showing in the disassembly. For a ternary op like this one with trivial expressions on each side, the GPU evals both and then masks the result given the result of the condition.
> a step function doesnt change this, but that means the code is already in the correct form to replace it with smoothstep, which means you can interpolate between the two variations, which does look good.
A smoothstep does smooth interpolation of two values. It seems unrelated to the issue in the post. step() relates to the ternary op in the sense that both can be used to express conditionals. The post explains why you wouldn't necessarily want to use step() vs ternary op. smoothstep is related to step in some sense, but not in a way that relates to the article? i.e., going from step() to smoothstep() will entirely change the semantics of the program precisely because of the 'smooth' part.
What I'm saying that no matter how you express it in code, abrupt transitions of values introduce aliasing (or 'edge shimmer'), which looks unpleaseant. The way you get rid of it by smoothly blending between 2 values with smoothstep for example.