- High allocation rate. I think his benchmarks do have high allocation rate so this part is fine.
- Large heap. I think splay has a large heap so this part is fine.
- Lots of objects in that large heap that simply survive one GC after another, while the allocation rate is mostly due to objects that die immediately. This is the part that splay doesn’t have. Splay churns all of its large heap.
Empirically, really big software written in GC’d languages have all three of these qualities. They have heaps that are large enough for GenGC to be profitable. They allocate at a high enough rate. And most of the allocated objects die almost immediately, while most of the objects that survive GC survive for many GC cycles.
You need that kind of test for it to be worth it, and you would have such a test if you had big enough software to run.
oorza 33 days ago [-]
I remember back in the before times... when escape analysis debuted for the JVM - which allows it to scalar replace or stack allocate small enough short-lived objects that never escape local scope and therefore bypass GC altogether - our Spring servers spent something like 60% less time in garbage collection. Saying enterprise software allocates a ton of short lived objects is quite an understatement.
titzer 33 days ago [-]
Another thing big programs have is a huge ramp-up phase where they construct the main guts of their massive heap. Splay, for example, benefits from have an enormous nursery in the beginning because for the startup phase the generational hypothesis doesn't really hold; most objects survive. So the first copy or couple of copies are a waste. V8 had at one time a "high promotion mode" that would kick in for such programs at the start, using various heuristics. In high promotion mode, entire pages are simply promoted en masse to the old gen, and nursery size gets quickly enlarged.
kazinator 32 days ago [-]
In these systems it might make sense to have the image start with GC disabled, and the application will turn it on past a certain point in its initialization, and do a full GC at that point to compact everything to the mature generation.
fweimer 33 days ago [-]
That's probably it. I found a weird Java version of the benchmark here:
At least for some parameters, non-generational Shenandoah is faster than generational G1 (both overall run time and time spent in the benchmark phase, and overall CPU usage). But I expect it's possible to get wildly varying benchmark behavior depending on the parameter choices (the defaults are of course much too low).
CyberDildonics 32 days ago [-]
This is exactly why so much garbage collection optimization is apples and oranges. It is trying to mitigate the performance problems from things you shouldn't do in the first place if you care about speed.
Heap allocation, heap deallocation and the pointer chasing that comes along with the memory fragmentation are possibly the biggest killers of speed outside of scripting languages but java was designed in a way that it forces people into way of programming. As a result, enormous research has gone into trying to deal with something that could have been avoided completely.
dan-robertson 33 days ago [-]
So maybe it would be interesting if allocations could be tracked better for the test? Maybe see the distribution of allocation life in terms of how many minor GCs are survived (perhaps you need to hack in some whole-heap liveness checking each time there’s a minor gc, for the purposes of the experiment). I think it would be nice to be able to rule out ‘most objects are promoted but have short lives’ as a cause.
hinkley 33 days ago [-]
One of the hallmarks of pure functional languages is that each “edit” of a data collection creates a new structure that refers back to part of the original. Which should be showcased well in a reasonable benchmark.
But I haven’t looked at Scheme in 32 years and I was angry about it then so I’m not going to start today.
So I will just agree that something is up.
caspper69 33 days ago [-]
His blog post is based upon the premise that his generational gc allocator doesn't seem to provide the performance benefits that generational gc is claimed to provide vs other more traditional gc approaches (e.g. mark and sweep- also his implementation).
My initial take is that either his implementations are deficient in some way (or not sufficiently modern), there's some underlying latent issue that arises from the Scheme to C compiler or the kind of C code it generates, or perhaps the benchmarks he is using are not indicative of real-world workloads.
But- I am out of my depth to analyze those things critically, and he seems to write about GC quite a bit, so maybe he's very in tune with the SOTA and he has uncovered an unexpected truth about generational gc.
It certainly wouldn't be the first time that an academic approach failed to deliver the benefits (or perhaps I should say the benefits weren't as great in as many scenarios as originally opined).
As an idiot programmer, my understanding is that Java, .NET & Go all have generational GC that is quite performant compared to older approaches, and that steady progress is made regularly across those ecosystems' gc (multiple gcs in the case of Java).
P.S. and now I see in a comment below (or maybe above now) that Go doesn't use a generational gc. I'm surprised.
neonsunset 33 days ago [-]
Go has somewhat "exotic" non-generational design optimized for latency. As a result it has poor throughput and performs quite badly outside the scenarios it was designed for.
But on moderate to light allocation traffic it is really nice. Just not very general-purpose.
Java has by far the most advanced GC implementations (it lives and dies by GC perf/efficiency after all) with .NET being a very close competitor.
schmichael 33 days ago [-]
Go optimizing for latency over throughput means the GC very rarely interferes with my applications SLO at the cost (literally $) of presumably requiring more total compute than a GC that allows more fine tuned tradeoffs between latency and throughput.
As someone who is not directly paying the bills but has wasted far too much of my life staring at JVM GC graphs and carefully tuning knobs, I vastly prefer Go’s opinionated approach. Obviously not a universally optimal choice but I’m so thankful it works for me! I don’t miss pouring over GC docs and blog posts for days trying to save my services P99 from long pauses.
gf000 33 days ago [-]
You are comparing very old Java if you had to touch anything else than heap size.
Especially that Java's GCs are by far the very very best, everything else is significantly behind (partially because other platforms may not be as reliant on object allocation, but it depends on your usecase)
neonsunset 33 days ago [-]
.NET is not far behind at all. It is also better at heap size efficiency and plays nicely with interop. Plus the fact that the average allocation traffic is much lower in .NET applications on comparable code than in Java also helps.
schmichael 33 days ago [-]
This is absolutely true! I haven’t stared at a JVM GC graph is over 8 years.
naasking 33 days ago [-]
I wonder how much of that is truly GC improvements vs. increased hardware speed dropping pause times.
adgjlsfhk1 33 days ago [-]
With traditional low latency GC designs (e.g. Shenandoah/G1), faster hardware provides almost no benefit because the GC pause is based around the time for core to core communication which hasn't decreased much (since we keep adding extra cores so the fight has to be to keep it from getting slower)
naasking 33 days ago [-]
The benchmarks don't necessarily use all of the cores though, and memory bandwidth has increased considerably since those original benchmarks, so you should expect pause times to decrease from that alone.
hiddew 33 days ago [-]
You can use GC defaults, but tuning can provide valuable throughput or latency improvements for the application if you tune the GC parameters according to the workload. Especially latency sensitive applications may benefit from generational ZGC in modern JVMs.
neonsunset 33 days ago [-]
> the GC very rarely interferes with my applications SLO
Somewhat random data point, but coraza-waf, a WAF component for e.g. Caddy, severely regresses on larger payloads and the GC scaling issues are a major contributor to this. In another data point, Twitch engineering back in the day had to do incredibly silly tricks like doing huge allocations at the application start to balloon the heap size and avoid severe allocation throttling. There is no free lunch!
Go's GC also does not scale with cores linearly at all, which both Java and .NET do quite happily, up to very high core counts (another platform that does it well - BEAM, thanks to per-process isolated GCs).
The way Java GCs require an upfront configuration is not necessarily the only option. .NET approach is quite similar to Go's - it tries to provide the best defaults out of box. It also tries to adapt to workload profile automatically as much as possible. The problem with Go here is that it offers no escape hatches whatsoever - you cannot tune heap sizes beyond just limits, memory watermark, collection aggressiveness and frequency, latency/throughput tradeoff and other knobs to fit your use case the best. It's either Go's way or the highway.
Philosophically, I think there's an issue where if you have a GC or another feature that is very misuse-resistant, this allows badly written code to survive until it truly bites you. This was certainly an issue that caused a lot of poorly written async code in .NET back in the day to not be fixed until the community went into "over-correction". So in both Java and C# spaces developers just expect the GC to deal with whatever they throw at it, which can be orders of magnitude more punishing than what Go's GC can work with.
cyberax 33 days ago [-]
It's not that Go doesn't provide escape hatches out of malice, it just doesn't really _have_ them. Its GC is very simplistic and non-generational, so pretty much all you can control is the frequency of collections.
bob1029 33 days ago [-]
The .NET GC is impressive in its ability to keep things running longer than they probably should.
In most cases with a slow memory leak I've been able to negotiate an interim solution where the process is bounced every day/week/month. Not ideal, but buys time and energy to rewrite using streams or spans or whatever.
The only thing that I don't like about the .NET GC is the threshold for the large object heap. Every time a byte array gets to about 10k long, a little voice in my head starts to yell. The #1 place this comes up for me is deserialization of large JSON documents. I've been preferring actual SQL columns over JSON blobs to avoid hitting LOH. I also keep my ordinary blobs in their own table so that populating a row instance will not incur a large allocation by default.
How much of the .NET GC's performance is attributable to hard coding the threshold at 85k? If we made this configurable in the csproj file, would we suffer a severe penalty?
neonsunset 33 days ago [-]
> I've been preferring actual SQL columns over JSON blobs to avoid hitting LOH. I also keep my ordinary blobs in their own table so that populating a row instance will not incur a large allocation by default.
Are you using Newtonsoft.Json? I found System.Text.Json to be very well-behaved in terms of GC (assuming you are not allocating a >85K string). Also 10k element byte array is just ~10KB still. If you are taking data in larger chunks, you may want to use array pool. Regardless, even if you are hitting LOH, it should not pose much issues under Server GC. The only way to cause problems is if there's something which permanently roots objects in Gen2 or LOH in a way that, beyond leaking, causes high heap fragmentation, forcing non-concurrent Gen2/LOH collections under high memory pressure, which .NET really tries to avoid but sometimes has no choice but doing.
> How much of the .NET GC's performance is attributable to hard coding the threshold at 85k? If we made this configurable in the csproj file, would we suffer a severe penalty?
You could try it and see, it should not be a problem unless the number is unreasonable. It's important to consider whether large objects will indeed die in Gen0/1 and not just be copied around generations unnecessarily. Alternate solutions include segmented lists/arrays, pooling, or using more efficient data structures. LOH allocations themselves are never a source of the leak and if there is a bug in implementation, it must be fixed instead. It's quite easy to get a dump with 'dotnet-dump' and then feeding it into Visual Studio, dotMemory or plain 'dotnet-dump' analyze.
zozbot234 33 days ago [-]
Other way of putting it is that Golang optimizes for latency over throughput because it would suck at latency if it only optimized for throughput. That can only be called a sensible choice.
Weak point of Golang though is its terrible interop with C and all C-compatible languages. Means you can't optimize parts of a Golang app to dispense with GC altogether, unless using a totally separate toolchain w/ "CGo".
coder543 33 days ago [-]
> unless using a totally separate toolchain w/ "CGo".
CGo is built into the primary Go toolchain... it's not a 'totally separate toolchain' at all, unless you're referring to the C compiler used by CGo for the C code... but that's true of every language that isn't C or C++ when it is asked to import and compile some C code. You could also write assembly functions without CGo, and that avoids invoking a C compiler.
> Means you can't optimize parts of a Golang app to dispense with GC altogether
This is also not true... by default, Go stack allocates everything. Things are only moved to the heap when the compiler is unable to prove that they won't escape the current stack context. You can write Go code that doesn't heap allocate at all, and therefore will create no garbage at all. You can pass a flag to the compiler, and it will emit its escape analysis. This is one way you can see whether the code in a function is heap allocating, and if it is, you can figure out why and solve that. 99.99% of the time, no one cares, and it just works. But if you need to "dispense with GC altogether", it is possible.
You can also disable the GC entirely if you want, or just pause it for a critical section. But again... why? When would you need to do this?
Go apps typically don't have much GC pressure in my experience because short-lived values are usually stack allocated by the compiler.
neonsunset 33 days ago [-]
> You can write Go code that doesn't heap allocate at all
In practice this proves to be problematic because there is no guarantee whether escape analysis will in fact do what you want (as in, you can't force it, and you don't control dependencies unless you want to vendor). It is pretty good, but it's very far from being bullet-proof. As a result, Go applications have to resort to sync.Pool.
Go is good at keeping allocation profile at bay, but I found it unable to compete with C# at writing true allocation-free code.
coder543 33 days ago [-]
As I mentioned in my comment, you can also observe the escape analysis from the compiler and know whether your code will allocate or not, and you can make adjustments to the code based on the escape analysis. I was making the point that you technically can write allocation-free code, it is just extremely rare for it to matter.
sync.Pool is useful, but it solves a larger class of problems. If you are expected to deal with dynamically sized chunks of work, then you will want to allocate somewhere. sync.Pool gives you a place to reuse those allocations. C# ref structs don't seem to help here, since you can't have a dynamically sized ref struct, AFAIK. So, if you have a piece of code that can operate on N items, and if you need to allocate 2*N bytes of memory as a working set, then you won't be able to avoid allocating somewhere. That's what sync.Pool is for.
Oftentimes, sync.Pool is easier to reach for than restructuring code to be allocation-free, but sync.Pool isn't the only option.
neonsunset 33 days ago [-]
> sync.Pool is useful, but it solves a larger class of problems. If you are expected to deal with dynamically sized chunks of work, then you will want to allocate somewhere. sync.Pool gives you a place to reuse those allocations. C# ref structs don't seem to help here, since you can't have a dynamically sized ref struct, AFAIK. So, if you have a piece of code that can operate on N items, and if you need to allocate 2*N bytes of memory as a working set, then you won't be able to avoid allocating somewhere. That's what sync.Pool is for.
Ref structs (which really are just structs that can hold 'ref T' pointers) are only one feature of the type system among many which put C# in the same performance weight class as C/C++/Rust/Zig. And they do help. Unless significant changes happen to Go, it will remain disadvantaged against C# in writing this kind of code.
Only the whiskers are touching, and the same applies to several other languages too. Yes, the median is impressively low… for anything other than those three. And it is still separate.
C# has impressive performance, but it is categorically separate from those three languages, and it is disingenuous to claim otherwise without some extremely strong evidence to support that claim.
My interpretation is supported not just by the Benchmarks Game, but by all evidence I’ve ever seen up to this point, and I have never once seen anyone make claim that about C# until now… because C# just isn’t in the same league.
> Ref structs (which really are just structs that can hold 'ref T' pointers)
A ref struct can hold a lot more than that. The uniquely defining characteristic of a ref struct is that the compiler guarantees it will not leave the stack, ever. A ref struct can contain a wide variety of different values, not just ref T, but yes, it can also contain other ref T fields.
igouy 33 days ago [-]
> just isn’t in the same league
I wonder what would be needed to even accept that programs were comparable?
(Incidentally, at-present the measurements summarized on the box plots are C# jit not C# naot.)
coder543 33 days ago [-]
I’m saying C# as a whole, not C# on one example. But, I have already agreed that C#’s performance has become pretty impressive. I also still believe that idiomatic Rust is going to be faster than idiomatic C#, even if C# now supports really advanced (and non-idiomatic) patterns that let you rewrite chunks of code to be much faster when needed.
It would be interesting to see the box plot updated to include the naot results — I had assumed that it was already.
igouy 32 days ago [-]
The benchmarks game's half-dozen tiny tiny examples have nothing to say about "C# as a whole".
coder543 32 days ago [-]
They have more to say than a single cherry picked benchmark from the half dozen.
I had asked the other person for additional benchmarks that supported their cause. They refused to point at a single shred of evidence. I agree the Benchmarks Game isn’t definitive. But it is substantially more useful than people making completely unsupported claims.
I find most discussions of programming language performance to be pointless, but some discussions are even more pointless than others.
igouy 32 days ago [-]
I curate the benchmarks game, so to me it's very much an initial step, a starting point. It's a pity that those next steps always seem like too much effort.
neonsunset 33 days ago [-]
This is a distribution of submissions. I suggest you look at the actual implementations and how they stack-up performance wise and what kind of patterns each respective language enables. You will quickly find out that this statement is incorrect and they behave rather closely on optimized code. Another good exercise will be to actually use a disassembler for once and see how it goes with writing performant algorithm implementation. It will be apparent that C# for all intents and purposes must be approached quite similarly with practically identical techniques and data structures as the systems programming family of languages and will produce a comparable performance profile.
> No…? https://learn.microsoft.com/en-us/dotnet/csharp/language-ref...
A ref struct can hold a lot more than that. What’s unique about a ref struct is that the compiler guarantees it will not leave the stack, ever. A ref struct can contain all sorts of different stack-allocatable values, not just references.
Do you realize this is not a mutually exclusive statement? Ref structs are just structs which can hold byref pointers aka managed references. This means that, yes, because managed references can only ever be placed on the stack (but not the memory they point to), a similar restriction is placed on ref structs alongside the Rust-like lifetime analysis to enforce memory safety. Beyond this, their semantics are identical to regular structs.
I.e.
> C# ref structs don't seem to help here, since you can't have a dynamically sized ref struct, AFAIK
Your previous reply indicates you did not know the details until reading the documentation just now. This is highly commendable because reading documentation as a skill seems to be in short supply nowadays. However, it misses the point that memory (including dynamic, whatever you mean by this, I presume reallocations?) can originate from anywhere - stackalloc buffers, malloc, inline arrays, regular arrays or virtually any source of memory, which can be wrapped into Span<T>'s or addressed with unsafe byref arithmetics (or pinning and using raw pointers).
Ref structs help with this a lot and enable many data structures which reference arbitrary memory in a generalized way (think writing a tokenizer that wraps a span of chars, much like you would do in C but retaining GC compatibility without the overhead of carrying the full string like in Go).
You can also trivially author fully identical Rust-like e.g. Vec<T>[0] with any memory source, even on top of Jemalloc or Mimalloc (which has excellent pure C# reimplementation[1] fully competitive with the original implementation in C).
None of this is even remotely possible in any other GC-based language.
People have had a long time to submit better C# implementations. You are still providing no meaningful evidence.
> Do you realize this is not a mutually exclusive statement?
It doesn’t have to be mutually exclusive. You didn’t seem to understand why people care about ref structs, since you chose to focus on something that is an incidental property, not the reason that ref structs exist.
caspper69 33 days ago [-]
Brigading you is not my intent, so please don't take my comments that way.
I just want to add that C# is getting pretty fast, and it's not just because people have had a long time to submit better implementations to a benchmark site.
The language began laying the groundwork for AOT and higher performance in general with the introduction of Span<T> and friends 7 or so years ago. Since then, they have been making strides on a host of fronts to allow programmers the freedom to express most patterns expected of a low level language, including arbitrary int pointers, pointer arithmetic, typed memory regions, and an unsafe subset.
In my day-to-day experience, C# is not as fast as the "big 3" non-GCed languages (C/C++/Rust), especially in traditional application code that might use LINQ, code generation or reflection (which are AOT unfriendly features- i.e. AOT LINQ is interpreted at runtime), but since I don't tend to re-write the same code across multiple languages simultaneously I can't quantify the extent of the current speed differences.
I can say, however, that C# has been moving forward every release, and those benchmarks demonstrate that it is separating itself from the Java/Go tier (and I consider Go to be a notch or two above JITed Java, but no personal experience with GraalVM AOT yet) and it definitely feels close to the C/C++/Rust tier.
It may not ever attain even partial parity on that front, for a whole host of reasons (its reliance on its own compiler infrastructure and not a gcc or llvm based backend is a big one for me), but the language itself has slowly implemented the necessary constructs for safe (and unsafe) arbitrary memory manipulation, including explicit stack & heap allocation, and the skipping of GC, which are sort of the fundamental "costs of admission" for consideration as a high performance systems language.
I don't expect anyone to like or prefer C#, nor do I advocate forcing the language on anyone, and I really hate being such a staunch advocate here on HN (I want to avoid broken record syndrome), but as I have stated many times here, I am a big proponent of programmer ergonomics, and C# really seems to be firing on all cylinders right now (recent core library CVEs notwithstanding).
coder543 33 days ago [-]
> C# really seems to be firing on all cylinders right now
I just don’t like seeing people make bold claims without supporting evidence… those tend to feel self-aggrandizing and/or like tribalism. It also felt like a bad faith argument, so I stopped responding to that other person when there was nothing positive I could say. If the evidence existed, then they should have provided evidence. I asked for evidence.
I like C#, just as I like Go and Rust. But languages are tools, and I try to evaluate tools objectively.
> I can say, however, that C# has been moving forward every release, and those benchmarks demonstrate that it is separating itself from the Java/Go tier
I also agree. I have been following the development of C# for a very long time. I like what I have seen, especially since .NET Core. As I have mentioned in this thread already, C#’s performance is impressive. I just don’t accept a general claim that it’s as fast as Rust at this point, but not every application needs that much performance. I wish I could get some real world experience with C#, I just haven’t found any interesting tech jobs that use C#… and the job market right now doesn’t seem great, unfortunately.
caspper69 32 days ago [-]
I have hopes that adoption outside of stodgy enterprises will pick up, which would of course help the job situation (in due time of course).
Sometimes it's hard to shake the rep you have when you're now a ~25 year old language.
Awareness takes time. People need to be told, then they need to tinker here and there. Either they like what they see when they kick the tires or they don't.
I'm pretty language agnostic tbh, but I would like to see it become a bit more fashionable given its modernization efforts, cross-platform support, and MIT license.
neonsunset 33 days ago [-]
Please read through the description and follow-up articles on the BenchmarksGame website. People did submit benchmarks, but submitting yet another SIMD+unsafe+full parallelization implementation is not the main goal of the project. However, this is precisely the subject (at which Go is inadequate) that we are discussing here. And for it, my suggestions in the previous comment stand.
Your comment posted as I was making my edit re: Go.
Thank you for the detailed clarification.
whitehexagon 33 days ago [-]
I spent quite some years performance tuning large Java systems, pre-warming server JVMs and then many hours staring at visualgc, a small Sun engineers tool for watching the various memory pools including generational GC, it was very satisfying work, and also pretty useful for uncovering bugs and race conditions when the systems were under load.
The GC advances that came along helped a lot, especially over the stop-the-world early days of Java GC, along with all the JVM tuning parameters that were gradually exposed for tweaking. But visualgc allowed for a real feel for how the generational GC was running just by watching the saw-tooth shaped graphs. Interestingly most of the garbage was string copying, especially with one companies system that had more abstractions than a Spring Oak has leaves, say no more least I have nightmares.
citrin_ru 30 days ago [-]
Go creates less garbage (more allocated on stack) so generational GC is less essential.
sfink 33 days ago [-]
My understanding is that splay is more of a benchmark that you try to make generational GC not hurt too much. It allocates lots of long-lived objects that it holds onto for the full test, and for JS at least it then hangs strings off of those objects. It's pretty easy to make a generational GC just add an extra copy for no benefit with splay.
In SpiderMonkey, we had to add pretenuring in order to avoid slowing down splay too much. As in, identify specific allocation sites that create long-lived allocations, and allocate them directly from the older generation instead of having them go through the nursery. It's sort of selectively disabling generational GC on an allocation site granularity. (Also, make sure you're not storing nursery strings inside of tenured objects. While it's possible they'll be quickly overwritten by different strings, it's much more likely that they're going to last as long as the object does.)
Within Octane, the RegExp subtest had the biggest gain from allocating strings in the nursery. (But that's going from generational objects -> generational objects and strings. Non-generational objects -> generational objects might show up more on something else.)
bjourne 33 days ago [-]
Wouldn't be very difficult to know a priori whether the object allocated at a callsite is likely to be long-lived? The object has to be allocated before you can make some other object reference it. So when you do, even if you know that the reference is from an old-gen object "the mistake has already been made" and the new object has been allocated in the nursery.
rcxdude 32 days ago [-]
I think the idea is you can then adjust future allocations from the same callsite, assuming that object lifetime is fairly well correlated with allocating callsite.
sfink 32 days ago [-]
Exactly. During minor GC, you update the promotion rate of each callsite. If it gets too high, you start allocating things from that callsite directly from the tenured heap, since your statistics are telling you that the things allocated at that callsite are usually going to end up living long enough that you're just wasting time scanning and copying them from the nursery to the tenured heap.
And yes, it assumes that lifetime is correlated pretty well with allocating callsite. Also that the correlation persists over time. The latter part often doesn't hold, so it's helpful to have a mechanism for changing the decision if you're finding a lot of dead stuff in the tenured heap that came from a pretenured callsite. But it's a delicate tradeoff: tracking the allocation sites costs time and memory, and so you might only want to do that for nursery allocations, but then you don't have a way to associate dead tenured objects with their allocation sites in case you need to stop pretenuring... lots of possibilities in the design space.
ComputerGuru 33 days ago [-]
C# (and .NET at large) use generational GV to great effect and there are some good writeups on the different modes you can run the GC in and its performance profiles. .NET has been generational GC since forever and you can’t swap out the GC engine so you won’t be able to find an analogous comparison, but it’s probably the current generational GC SOTA for a one-size-fits-all (with admittedly two different allocation/cleanup roles: server and workstation; and with (new to .NET 9) a new option to configure the GC to act as if it were running alone on the system (the old default, i.e. not sharing resources with other apps on the same os) or to more cooperatively try to manage allocation patterns.
Just mentioning this in case someone wants to read up on a different GC and play around with benchmarks. There is a fair amount of inner workings info and real-world results to dive into.
Thanks for sharing that; now I have something to add to my list of things to play with when I get a chance!
rurban 33 days ago [-]
I was in his FOSDEM GC talk about whippet and found out that he doesn't know anything about the really good GC strategies. He only knows and implemented about dirt-slow mark-sweep, which is only needed for pointer stability from C callbacks, and then inmix which is the slow variant of a copying GC, just without the double heap requirement. But a conventional Cheney copying collector with 1-2 generations for minor sweeps is far better than those.
Until he adds a proper GC to whippet I don't trust anything he says about GC's. He even has the luxury for a precise GC because scheme carries the types along with its values.
And he doesn't know about colored pointers, using 2-5 bits for the GC state, nor nan-tagging. Probably neither about forwarding pointers.
milesrout 33 days ago [-]
You can see from his blog archives (and this post) that the author is well aware of GC strategies other than mark-sweep including Cheney copying.
He has posted about conservative GC (contrary to your implication he has only implemented precise GC).
Maybe "forwarding pointers" is overloaded but given he has a post about a semispace collectors which uses something he refers to as forwarding pointers I don't see how he can be said not to know about it.
He also has posts referring to NaN-tagging and pointer tagging unless my memory betrays me.
Well, that would be good, because when we asked him in the Q&A at FOSDEM he had no idea about colored pointers and did not mention semi-space collectors at all, and had no plan to add it.
I didn't attend the talk and don't know him at all so wouldn't want to speak for him but from personal experience one can find oneself saying some bizarre things under stress/pressure when giving a presentation in front of lots of people. And then one looks back and goes: "why did I say that??"
naasking 33 days ago [-]
Immix is not a slow variant of a copying GC. It's pretty state of the art for copying GCs in fact, do claiming he doesn't know anything and hasn't implemented a "proper" GC is just incorrect.
NeutralForest 33 days ago [-]
I was at his talk as well and I found it interesting. Maybe you could open an issue in the Whippet repo to point at possible improvements?
cesarb 33 days ago [-]
I wonder how much the perceived advantages of generational GC are only due to Java's tendency to generate a lot of short-lived garbage.
mananaysiempre 33 days ago [-]
“A lot” is relative—I seem to remember a discussion long ago mentioning that Haskell/Scala-style programs were a problem for the JVM because the JVM is not designed for so many objects dying so quickly.
gf000 33 days ago [-]
The JVM is more than happy with short-lived objects, they are barely more expensive than a normal stack allocation (mostly by the header only).
The JVM has a so-called Thread-Local Allocation Buffer, which is basically a simple buffer with a pointer pointing to the next free slot. It can be simply bumped up on a new object's allocation and later on the still alive objects will be evacuated to another generation and the whole thing can be cleared. This is faster than any malloc implementation CPP/rust/whatever might use, besides they also doing arena allocations.
Haskell has a few tricks up its sleeve that can help with its kind of allocation patterns, but speaking specifically about Haskell, it's mostly the laziness that makes it quite different.
lowbloodsugar 33 days ago [-]
But it’s not as fast as just allocating a structure on the stack instead of checks graph 20 objects on the heap, all pointing to the next with 64bit pointers. I’ve got Rust struct that are 80 bytes and the same structure in Java is 800 bytes.
gf000 33 days ago [-]
What's the difference between the stack and a thread-local, constantly hot/in-cache other region of the same memory? Especially that in many cases the objects are allocated right next to each other, so fetching happens from cache again. Not saying that expertly written Rust code can't beat naive Java one, and especially with the object headers Java will have a performance disadvantage in many cases - but that's a tiny disadvantage in most real-world use cases, multiple times offset by its advantages.
(Besides, Java has on-stack replacement which can do the exact same thing as rust does, but the JIT-compiler Gods have to be in the correct alignment for that)
bjourne 33 days ago [-]
Deallocating stack-allocated objects is completely free. Allocating in a thread-local buffer maybe can be fast as allocating on the stack, but the allocations still increase the frequency with which the gc has to collect garbage.
gf000 33 days ago [-]
Not quite - the still reachable objects can be evacuated concurrently, and then the whole buffer is reset with literally zero work (just setting the pointer to the buffer's start).
bjourne 32 days ago [-]
Even if collection is run concurrently it is still the same amount of work.
lowbloodsugar 33 days ago [-]
What’s the difference…?
1. The size.
2. I don’t see that happening and I look at the output of the C2 compiler when I’m dealing with hotspots.
In aware of the cool shit the JVM can do but done right, areas of code can be as fast as C or Rust. It in general, My experience is that Rust is much faster. Sometimes 10x faster where it matters to me (sustained server load).
munksbeer 29 days ago [-]
>Sometimes 10x faster where it matters to me (sustained server load).
I don't think that is a reasonable stat. What do you mean by faster? I'd be extremely surprised if this is representative of any real-world examples where someone has taken moderate care to write good Java code.
lowbloodsugar 24 days ago [-]
Well just the fact that a structure can fit in one cache line vs ten is a start. A lot of the systems I deal with aren’t “ephemeral” meaning the data gets copied out of Eden into survivor and then tenured just in time to die. What else. Interfaces uses that don’t always point to a single type so they don’t get hard coded. Interfaces are quite a bit worse than even abstract base classes in Java if they actually have to be decided upon. Most of the time you’ll be doing that same behavior in Rust using enums rather than dyn. Lets see. “Clever bit tricks”. So like fitting a Set implementation into a cache line if it only has 8 element and using a regular hash set if it’s bigger. So much opportunity.
hinkley 33 days ago [-]
A lot of Lisps don’t even have a stack. It’s effectively a second heap with only call frames on it. That’s one way to implement tail call elimination for instance.
hinkley 33 days ago [-]
When Java got thread local allocation pools, stdlib’s malloc was still a blocking operation that didn’t scale with thread count. It was out a couple of years before I heard solid stories of someone pushing for a concurrent malloc implementation to be added to stdlob.
So however wide the gap was between stack and heap, its even wider once you had concurrency.
kgeist 33 days ago [-]
I remember long time ago JBullet, a physics simulation library, had to maintain an error-prone pool of vectors and matrices because all the calculations involving the creation of temporary vectors and matrices put significant stress on the GC. Today, Minecraft allocates around 600 MB/s for a similar reason (Position3D objects & friends); however, on my PC, I don't notice any pauses.
gf000 33 days ago [-]
Is that really error-prone for a physics simulation engine? That's pretty common to only refer to data by indices in other places - sure, there are languages that can deal with it a bit better (e.g. operator overloading, manual memory management and/or value types), but as far as I know this is even often done in Rust, which has the previous properties (it's done to avoid the borrow checker there).
Also, Minecraft was famously written famously inefficiently.
kgeist 33 days ago [-]
By "error-prone," I mean that if you accidentally stored a reference to a pooled vector and used it after it was returned to the pool, all sorts of unexpected things could happen since everything is by reference. You had to be very careful.
hinkley 33 days ago [-]
I had a friend/former mentor who proved that object pools are slower in Java once GenGC was introduced because they create pointers from the old generation to the new. He also authored the fasted XML parser of that era and previously helped me land a project everyone but us and our boss said was impossible because Java Is Too Slow, so I take him at his word.
So unless those pools are full of primitive data types, you’re going to cause more frequent full GC pauses, and longer due to all the back pointers.
ickyforce 32 days ago [-]
> So unless those pools are full of primitive data types, you’re going to cause more frequent full GC pauses, and longer due to all the back pointers.
If you have enough objects pooled to get low allocation rate then you never trigger full GC. That's what "low latency Java" aims for.
I think that the main downside of object pools in Java is that it's very easy to accidentally forget to return an object to a pool or return it twice. It turns out that it's hard to keep track of what "owns" the object and should return it if the code is not a simple try{}finally{}. This was a significant source of bugs at my previous job (algo trading).
hinkley 32 days ago [-]
You only get low full GC as long as the entire app makes no allocations, not just the hot path. If you allocate garbage at all eventually you will trigger one.
fiddlerwoaroof 33 days ago [-]
If you could add some sort of hint to the object pool instance that it should always be in the new generation, then you wouldn’t have this problem. This could probably be provided generically as a class named something like NurseryCollection.
However, I thought the people that really were serious about performance would allocate the pool in memory the GC doesn’t control (back in the day using internal APIs, but I think there are new official APIs for this) such that the object pool and every object in it would just be ignored by the GC.
hinkley 32 days ago [-]
Looks like an arena allocator is available in JDK 20. But I’m fairly sure it’s been in the RT spec for a lot longer.
steveklabnik 33 days ago [-]
I have long suspected that there is some sort of deeper truth here. It’s not really about Java. If you squint hard enough, a C program has the stack as its nursery, and the heap as a singular old generation. The stack is even arena allocated, of sorts!
This is of course a bit hand-wavy, I haven’t had the time to truly try and investigate this in a more rigorous way.
jerf 32 days ago [-]
In general, memory management needs to be understood as a continuum and not a binary "uses GC or uses manual memory management". Per the very discussion we're having, GC is not an atomic thing, and on the other end, what is commonly called "manual memory management" is itself not. malloc still makes non-trivial decisions, as evidenced by the fact that some programs can be greatly benefited or harmed by switching out the malloc it uses. Beyond what most people call "manual memory management" lies "arena allocation", and in the really extreme case, fully static memory layout with no allocation at all. The latter is how NES games were generally written and sufficiently small embedded projects still get written that way today.
Further pushing this into a "continuum" rather than a binary flag is that programs can and do freely mix and match. Nothing stops a program from using two mallocs, some arenas, GC for certain values, and static allocations for yet others. The Memory Management Police do not come along and arrest you for having Impure Designs or anything. There are languages like (I believe) Zig where the ability to cleanly do this is more-or-less a first-class language feature.
pfdietz 33 days ago [-]
Generational GC was perceived as useful even before Java existed, for example in Lisp.
actuallyalys 33 days ago [-]
Funnily enough, my impression is that Clojure, the most popular Lisp on the JVM, creates a lot of garbage—probably more than Java—but I believe that’s due to it using immutable collections and not due to it being a Lisp.
fiddlerwoaroof 33 days ago [-]
Yeah, immutable collections have to use structure-sharing to be implemented efficiently and this frequently results in common operations allocating lots of small objects rather than updating some object in-place.
zozbot234 33 days ago [-]
You can use a borrow checker to replace immutable objects with update-in-place ("ephemeral") ones where feasible.
fiddlerwoaroof 33 days ago [-]
And, escape analysis and "dynamic extent" hints can have similar effects by stack-allocating intermediate values. Also, you can design your transformations to compose before handing off to a result builder, which is how modern Clojure reduces this problem.
davidgay 33 days ago [-]
You've clearly never observed the allocation behaviour of a continuation-based compiler (every function call causes a short-lived heap allocation - the heap is the "stack")...
wbl 33 days ago [-]
And the stack the heap with Cheney on the MTA.
bjourne 33 days ago [-]
Maybe it is about the safe points? In a conventional generational gc every thread has its own nursery (plus semi-space for copying collection), so you can collect one thread's garbage without stopping any other thread (so data threads share must not be allocated in the nursery).
> So, for this test with eight threads, on my 8-core Ryzen 7 7840U laptop, the nursery is 16MB including the copy reserve, which happens to be the same size as the L3 on this CPU.
Sounds way too small. Speed of copying collection is proportional to the number of survivors and with only 16 mb you risk having lots of false survivors.
mike_hearn 33 days ago [-]
It does work, at least for imperative languages that allocate things on the heap a lot. Look at the evolution of Java's open source pauseless GCs. Both ZGC and Shenandoah started out non-generational for ease of implementation reasons. They both now went fully generational, with big improvements in real world and benchmark performance (throughput as they were pauseless already).
kazinator 33 days ago [-]
To see a benefit from generational garbage collection, you need a lot of mature data in the heap of your test case. If there's not a big difference between a full traversal and a partial traversal, the benefit won't be realized.
I suspect it's a huge win in most modern manage application settings, because application images are huge with a lot of cruft in them. If your algorithm is churning away allocating lots of temporary objects, and it's sitting in the middle of a behemoth image, generational GC is a big win.
In dynamic languages, the functions themselves are objects that can be garbage collected and the function bindings have to be traversed in a full GC. It functions rarely change. Quickly recede into the mature generation.
You can investigate generational vs simple mark-and-sweep with my TXR Lisp. At compile time you can configurate to disable generational GC. This is not being regularly tested but I did check it for regressions fairly recently.
It is not a copying collector. Objects stay where they are. The nursery is just a global array of pointers. There are two other arrays that help with the old pointing to young mutation. Resetting these arrays costs almost nothing: we just set their fill indices to zero. You can easily tune their sizes and there's a configuration for a small memory which changes several defaults at the same time.
This, along with other sensible tradeoffs, is why Go ate a large slice of network software market.
gf000 33 days ago [-]
"Optimized" as in stops the user thread from doing useful work, when a lot of allocation happens?
At least a couple of years ago Go had a very simplistic GC, but even today it is absolutely nowhere close to how good Java's GCs are (there is ZGC, for example, whose pause times are completely decoupled from heap size, so it can actually keep sub-millisecond pause times - the OS causes bigger pauses than that).
At most Go puts slightly less work on its GC due to having value types.
Looking at only the CPU numbers from this benchmark is misleading. This site requires the use of default configurations for each language runtime, and JVMs tend to have a much larger default heap than the Go runtime. Tracing GCs tend to have a CPU/memory tradeoff built into them [1]. Compare the memory footprint of the best Go and best Java programs in terms of wall time [2] (the site doesn't make it easy, you have to go back and forth between the two links) and the difference is enormous (these Go programs are running with much smaller total heap sizes, so much less runway for the GC).
If you use GOGC and GOMEMLIMIT to even the playing field (and note, use a Go program that isn't using sync.Pool) the difference in wall time is far less stark (though it's still there, maybe 5-15%; don't quote me, it's been a long time since I measured this and I don't remember exactly). (The difference in total CPU time is bigger.)
And finally, keep in mind this benchmark is hammering as hard as it can on the GC. How it impacts real applications depends on how much the application relies on the heap.
(But stuff on those pages will be moving around for a few days.)
mknyszek 33 days ago [-]
I just meant that you cannot easily see wall time and memory use together on the same page. (I would love to be wrong about that.)
igouy 32 days ago [-]
Are you reading on your phone? Portrait orientation may lose columns.
On the page you referenced:
— secs is elapsed seconds aka wall time aka wall clock
— mem is memory use
— gz is source code size
— cpu secs is cpu seconds
secs, mem, cpu secs as reported by BenchExec
mknyszek 32 days ago [-]
Yep! I was on my phone, sorry about that.
Did it ever used to only show wall time? Or am I just completely misremembering?
igouy 25 days ago [-]
It's been showing both for at-least 15 years (and before that claimed to be showing CPU secs, not Elapsed secs).
gf000 33 days ago [-]
I can see a ~3x difference in memory usage in case of JIT-compiled Java vs Go, but AOT-compiled Java comes in at less memory usage while still being faster than Go. I believe Java AOT uses the Serial GC, while the "normal" version defaults to G1, so there is that (the GC code is actually reused between Graal and OpenJDK, so we could remove this "variable")
Don't forget that the JVM has to allocate memory for all its subsystems as well, like the JIT compiler, so that 3x memory is not entirely heap usable by the program.
And I deliberately linked this benchmark, as the topic at hand is the GC itself.
mknyszek 33 days ago [-]
Fascinating. I could see the Serial GC, if it's generational, just totally crush this particular benchmark. I wonder what the heap size heuristic is for the Serial GC.
> Don't forget that the JVM has to allocate memory for all its subsystems as well, like the JIT compiler, so that 3x memory is not entirely heap usable by the program.
That's fair. I recall doing my due diligence here and confirming it is actually using mostly heap memory, but again it's been a while and I could be wrong. (Also if the actual heap size is only ~100s of MiB and the rest of the subsystems need north of a GiB, that's much more than I would have anticipated.)
> And I deliberately linked this benchmark, as the topic at hand is the GC itself.
Sure. Not trying to suggest the benchmark doesn't have any utility, just that even for just GC performance, it doesn't paint a complete picture, especially if you're only looking at wall time.
gf000 33 days ago [-]
> That's fair. I recall doing my due diligence here and confirming it is actually using mostly heap memory, but again it's been a while and I could be wrong.
No, I think you are right - I'm not trying to claim that most of that 1.7 GB is used by the core JVM, just that that's another factor (besides simply the different base heuristics on how much space to claim).
The only fair thing would probably be to re-run the benchmark multiple times with different available RAM (via cgroup) and see the graph. Though I'm fairly confident that Java would beat Go in this particular benchmark at any heap size.
neonsunset 33 days ago [-]
Binary-trees is very interesting. Showcases how Java can easily compete with native code that uses hand-managed arena allocators. While investigating why it is so much better than .NET* in this case I reached a conclusion that one of the major contributing factors is that it uses TLAB which "inlines" the GC allocation code completely into the callers, making the allocations indeed just thread-local pointer bumps. .NET has something similar (allocation context) but you do have to go through a call. I assume TLAB allocations and TLAB refills are just this much better regardless.
(* in all the number crunching ones we have every other GC language, including Java and Go, confidently beat <3 )
gf000 33 days ago [-]
Yeah, it's probably a bit unfair given that GC research itself is just Java.
Also, there might be some philosophical difference at play here, Java (the JVM) tends to expose only a very limited API (no structs, pointers, etc), but this allows more flexibility on the runtime part, while .NET let's the developer touch/control everything, but that might means less room for doing some advanced stuff on the runtime side.
The age-old generic-specific balance.
(Also, I like this playful competition between languages, much better than pointless flamewars :D)
kgeist 33 days ago [-]
IIRC Go has "GC assist" which forces goroutines that allocate too frequently to assist in GC work (share CPU time, i.e. they're slowed down), and I've never seen anything like that described for Java/C#. Interesting approach.
gf000 33 days ago [-]
In other words, it slows down the user's code randomly. I wouldn't call that a necessarily good tradeoff.
kgeist 33 days ago [-]
It's a back-pressure mechanism. I think slightly slower code on average is better than unexpected long GC pauses.
gf000 33 days ago [-]
But an important code that has to allocate a lot will be disproportionately affected by this. Also, it doesn't solve "unexpectedly long GC pauses" just somewhat lowers their frequency.
Java's ZGC is multiple generations ahead, and I believe is a fairer tradeoff (overall lower throughput (due to read barriers over write barriers), but guaranteed <1ms pause time due to basically everything happening concurrently)
kgeist 33 days ago [-]
Go's garbage collector is concurrent, too. GC assist makes sure the GC can keep up with allocations.
gf000 33 days ago [-]
Correct me if I'm wrong, but concurrent GC only means that certain phases of GC can be done in parallel.
GC pauses are a necessity exactly for the part that can't be done in parallel, and Go is definitely not using ZGC's next-generation design that would decouple the pause's length from the heap size, so we still have our "arbitrary long pause" problem on the table.
kgeist 33 days ago [-]
From Go's docs:
>The Go GC avoids making the length of any global application pauses proportional to the size of the heap, and that the core tracing algorithm is performed while the application is actively executing.
>Brief stop-the-world pauses when the GC transitions between the mark and sweep phases.
I didn't benchmark it myself, though. Go is mostly used in stateless microservices where huge heap sizes are very uncommon. We usually scale horizontally, not vertically. Also, Go has value objects which can be allocated on the stack or embedded inside other objects, so it allocates far less garbage than Java. So practically for me Go's GC has been a non-issue, I barely remember it even exists.
yvdriess 32 days ago [-]
Go GC's stop the world pauses are essentially synchronization barriers at the start and end of a GC phase. That pause length is a function of the number of worker threads, not the size of the heap. You can check this yourself by running the tile38 and cockroachdb benchmarks with GODEBUG=gctrace=1 .
gf000 32 days ago [-]
It's the function of both, as per even the sibling comment.
Obviously there is not much to synchronize when there is only a single thread.
yvdriess 32 days ago [-]
Any concurrent GC will have an unpredictable impact on application threads. The more obvious is the necessary synchronization on GC startup and shutdown, e.g. to turn on write barriers during GC. Less obvious but perhaps more impactful is the cache trashing incurred by the GC threads' heap traveral.
nobodyandproud 33 days ago [-]
My guess: In real world scenarios, there’s overhead such swap space and virtual memory, analogous to a cache miss.
A full mark—and-sweep GC would have deal with this extra overhead.
Whereas a generational GC would reduce the amount of a miss because it only considers a subset of objects; and the least likely bit of memory in the disk/slower swap space are the most recently allocated objects.
milesrout 33 days ago [-]
The issue of the generational hypothesis is interesting. Of course if a benchmark doesn't exhibit "generational" behaviour then it won't be a good test of a generational collector. But taken too far, that creates a bias: you are selecting benchmarks that fit what you know a generational collector is good at.
The question then is: are real world programs as generational as we think? This might depend on whether we assume are are using a generational collector. If you assume shortlived object allocation is ~free then you will produce software conforming to the generational hypothesis but that is not good evidence that the generational hypothesis is true.
It is a bit like saying "caches are important because of locality of reference" and exhibiting as evidence code optimised to make good use of caches.
We have to be careful that we are not just measuring things designed around themselves.
mjburgess 33 days ago [-]
I'd imagine generational GCs are perform better on OO languages, especially ones where almost everything is boxed.
I have recently prototyped my own non-OO, multifn-style polymorphic language using tagged pointers (60 bit payload, 4 bit tag) -- you can fit a float32, int/uint, etc. into 60bits -- as well as a case-insensitive 12 char alphanumeric string. Where various unboxed container types are available (eg., matrix of doubles, etc.) --- you can do a lot with just allocating to a local arena for a given scope, then resetting the arena. In this case, I lean to wards a very simple GC for all the heap stuff, since its probably going to be mostly large long-lived objects.
GMoromisato 32 days ago [-]
BTW, I'm using NaN boxing, and it's great--I can fit a float64 into 8 bytes and have bits left over for tagging pointers, ints, etc.
Basically, the idea is to encode all non-float values into the NaN representation. The NaN value in a float64 leaves 52-bits undefined. You can use those to encode the other values.
barrkel 32 days ago [-]
Generational GC is based on the assumption that most objects die young, and objects that survive are around for a long time - a kind of bathtub object lifetime curve.
This assumption is approximately true for GUI apps, and a lot more true for server apps. The idea is that all the objects allocated during an event dispatch or a server request are dead when control returns to the event loop.
To make those assumptions fit the program better, the size of the generations should be tuned (manually or automatically) such that for the rate of allocation this assumption holds true.
You need three generations: the first, where objects are born and only a few survive long enough to get out; the second, for objects which are destined to die young, but happen to still be alive when you collect the first generation; and the third, for objects that hopefully never die, so you never have to collect it.
The first generation should be sized to fit in something close to a CPU cache, so it’s really quick to scan.
The second generation should be sized so that it is not collected often enough to have live objects in it - it should outlive the longest event / request dispatch you’ve got, and then a little bit more.
If your application doesn’t feel like a GUI or a server app that dispatches events or requests, generational GC probably isn’t the right choice for you.
snikeris 33 days ago [-]
I think the idea is to reduce the set of memory that needs to be frequently collected. Long lived objects age to the old generation which can be large and infrequently collected. I've used this kind of collector in the past for applications which held a large and mostly static dataset in memory.
lern_too_spel 33 days ago [-]
You need to understand the object lifetime distribution.
GenGC is an improvement if you have:
- High allocation rate. I think his benchmarks do have high allocation rate so this part is fine.
- Large heap. I think splay has a large heap so this part is fine.
- Lots of objects in that large heap that simply survive one GC after another, while the allocation rate is mostly due to objects that die immediately. This is the part that splay doesn’t have. Splay churns all of its large heap.
Empirically, really big software written in GC’d languages have all three of these qualities. They have heaps that are large enough for GenGC to be profitable. They allocate at a high enough rate. And most of the allocated objects die almost immediately, while most of the objects that survive GC survive for many GC cycles.
You need that kind of test for it to be worth it, and you would have such a test if you had big enough software to run.
https://github.com/newspeaklanguage/benchmarks/blob/master/S...
At least for some parameters, non-generational Shenandoah is faster than generational G1 (both overall run time and time spent in the benchmark phase, and overall CPU usage). But I expect it's possible to get wildly varying benchmark behavior depending on the parameter choices (the defaults are of course much too low).
Heap allocation, heap deallocation and the pointer chasing that comes along with the memory fragmentation are possibly the biggest killers of speed outside of scripting languages but java was designed in a way that it forces people into way of programming. As a result, enormous research has gone into trying to deal with something that could have been avoided completely.
But I haven’t looked at Scheme in 32 years and I was angry about it then so I’m not going to start today.
So I will just agree that something is up.
My initial take is that either his implementations are deficient in some way (or not sufficiently modern), there's some underlying latent issue that arises from the Scheme to C compiler or the kind of C code it generates, or perhaps the benchmarks he is using are not indicative of real-world workloads.
But- I am out of my depth to analyze those things critically, and he seems to write about GC quite a bit, so maybe he's very in tune with the SOTA and he has uncovered an unexpected truth about generational gc.
It certainly wouldn't be the first time that an academic approach failed to deliver the benefits (or perhaps I should say the benefits weren't as great in as many scenarios as originally opined).
As an idiot programmer, my understanding is that Java, .NET & Go all have generational GC that is quite performant compared to older approaches, and that steady progress is made regularly across those ecosystems' gc (multiple gcs in the case of Java).
P.S. and now I see in a comment below (or maybe above now) that Go doesn't use a generational gc. I'm surprised.
But on moderate to light allocation traffic it is really nice. Just not very general-purpose.
Java has by far the most advanced GC implementations (it lives and dies by GC perf/efficiency after all) with .NET being a very close competitor.
As someone who is not directly paying the bills but has wasted far too much of my life staring at JVM GC graphs and carefully tuning knobs, I vastly prefer Go’s opinionated approach. Obviously not a universally optimal choice but I’m so thankful it works for me! I don’t miss pouring over GC docs and blog posts for days trying to save my services P99 from long pauses.
Especially that Java's GCs are by far the very very best, everything else is significantly behind (partially because other platforms may not be as reliant on object allocation, but it depends on your usecase)
Somewhat random data point, but coraza-waf, a WAF component for e.g. Caddy, severely regresses on larger payloads and the GC scaling issues are a major contributor to this. In another data point, Twitch engineering back in the day had to do incredibly silly tricks like doing huge allocations at the application start to balloon the heap size and avoid severe allocation throttling. There is no free lunch!
Go's GC also does not scale with cores linearly at all, which both Java and .NET do quite happily, up to very high core counts (another platform that does it well - BEAM, thanks to per-process isolated GCs).
The way Java GCs require an upfront configuration is not necessarily the only option. .NET approach is quite similar to Go's - it tries to provide the best defaults out of box. It also tries to adapt to workload profile automatically as much as possible. The problem with Go here is that it offers no escape hatches whatsoever - you cannot tune heap sizes beyond just limits, memory watermark, collection aggressiveness and frequency, latency/throughput tradeoff and other knobs to fit your use case the best. It's either Go's way or the highway.
Philosophically, I think there's an issue where if you have a GC or another feature that is very misuse-resistant, this allows badly written code to survive until it truly bites you. This was certainly an issue that caused a lot of poorly written async code in .NET back in the day to not be fixed until the community went into "over-correction". So in both Java and C# spaces developers just expect the GC to deal with whatever they throw at it, which can be orders of magnitude more punishing than what Go's GC can work with.
In most cases with a slow memory leak I've been able to negotiate an interim solution where the process is bounced every day/week/month. Not ideal, but buys time and energy to rewrite using streams or spans or whatever.
The only thing that I don't like about the .NET GC is the threshold for the large object heap. Every time a byte array gets to about 10k long, a little voice in my head starts to yell. The #1 place this comes up for me is deserialization of large JSON documents. I've been preferring actual SQL columns over JSON blobs to avoid hitting LOH. I also keep my ordinary blobs in their own table so that populating a row instance will not incur a large allocation by default.
How much of the .NET GC's performance is attributable to hard coding the threshold at 85k? If we made this configurable in the csproj file, would we suffer a severe penalty?
Are you using Newtonsoft.Json? I found System.Text.Json to be very well-behaved in terms of GC (assuming you are not allocating a >85K string). Also 10k element byte array is just ~10KB still. If you are taking data in larger chunks, you may want to use array pool. Regardless, even if you are hitting LOH, it should not pose much issues under Server GC. The only way to cause problems is if there's something which permanently roots objects in Gen2 or LOH in a way that, beyond leaking, causes high heap fragmentation, forcing non-concurrent Gen2/LOH collections under high memory pressure, which .NET really tries to avoid but sometimes has no choice but doing.
> How much of the .NET GC's performance is attributable to hard coding the threshold at 85k? If we made this configurable in the csproj file, would we suffer a severe penalty?
You could try it and see, it should not be a problem unless the number is unreasonable. It's important to consider whether large objects will indeed die in Gen0/1 and not just be copied around generations unnecessarily. Alternate solutions include segmented lists/arrays, pooling, or using more efficient data structures. LOH allocations themselves are never a source of the leak and if there is a bug in implementation, it must be fixed instead. It's quite easy to get a dump with 'dotnet-dump' and then feeding it into Visual Studio, dotMemory or plain 'dotnet-dump' analyze.
Weak point of Golang though is its terrible interop with C and all C-compatible languages. Means you can't optimize parts of a Golang app to dispense with GC altogether, unless using a totally separate toolchain w/ "CGo".
CGo is built into the primary Go toolchain... it's not a 'totally separate toolchain' at all, unless you're referring to the C compiler used by CGo for the C code... but that's true of every language that isn't C or C++ when it is asked to import and compile some C code. You could also write assembly functions without CGo, and that avoids invoking a C compiler.
> Means you can't optimize parts of a Golang app to dispense with GC altogether
This is also not true... by default, Go stack allocates everything. Things are only moved to the heap when the compiler is unable to prove that they won't escape the current stack context. You can write Go code that doesn't heap allocate at all, and therefore will create no garbage at all. You can pass a flag to the compiler, and it will emit its escape analysis. This is one way you can see whether the code in a function is heap allocating, and if it is, you can figure out why and solve that. 99.99% of the time, no one cares, and it just works. But if you need to "dispense with GC altogether", it is possible.
You can also disable the GC entirely if you want, or just pause it for a critical section. But again... why? When would you need to do this?
Go apps typically don't have much GC pressure in my experience because short-lived values are usually stack allocated by the compiler.
In practice this proves to be problematic because there is no guarantee whether escape analysis will in fact do what you want (as in, you can't force it, and you don't control dependencies unless you want to vendor). It is pretty good, but it's very far from being bullet-proof. As a result, Go applications have to resort to sync.Pool.
Go is good at keeping allocation profile at bay, but I found it unable to compete with C# at writing true allocation-free code.
sync.Pool is useful, but it solves a larger class of problems. If you are expected to deal with dynamically sized chunks of work, then you will want to allocate somewhere. sync.Pool gives you a place to reuse those allocations. C# ref structs don't seem to help here, since you can't have a dynamically sized ref struct, AFAIK. So, if you have a piece of code that can operate on N items, and if you need to allocate 2*N bytes of memory as a working set, then you won't be able to avoid allocating somewhere. That's what sync.Pool is for.
Oftentimes, sync.Pool is easier to reach for than restructuring code to be allocation-free, but sync.Pool isn't the only option.
Ref structs (which really are just structs that can hold 'ref T' pointers) are only one feature of the type system among many which put C# in the same performance weight class as C/C++/Rust/Zig. And they do help. Unless significant changes happen to Go, it will remain disadvantaged against C# in writing this kind of code.
Only the whiskers are touching, and the same applies to several other languages too. Yes, the median is impressively low… for anything other than those three. And it is still separate.
C# has impressive performance, but it is categorically separate from those three languages, and it is disingenuous to claim otherwise without some extremely strong evidence to support that claim.
My interpretation is supported not just by the Benchmarks Game, but by all evidence I’ve ever seen up to this point, and I have never once seen anyone make claim that about C# until now… because C# just isn’t in the same league.
> Ref structs (which really are just structs that can hold 'ref T' pointers)
No…? https://learn.microsoft.com/en-us/dotnet/csharp/language-ref...
A ref struct can hold a lot more than that. The uniquely defining characteristic of a ref struct is that the compiler guarantees it will not leave the stack, ever. A ref struct can contain a wide variety of different values, not just ref T, but yes, it can also contain other ref T fields.
I wonder what would be needed to even accept that programs were comparable?
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
(Incidentally, at-present the measurements summarized on the box plots are C# jit not C# naot.)
It would be interesting to see the box plot updated to include the naot results — I had assumed that it was already.
I had asked the other person for additional benchmarks that supported their cause. They refused to point at a single shred of evidence. I agree the Benchmarks Game isn’t definitive. But it is substantially more useful than people making completely unsupported claims.
I find most discussions of programming language performance to be pointless, but some discussions are even more pointless than others.
> No…? https://learn.microsoft.com/en-us/dotnet/csharp/language-ref... A ref struct can hold a lot more than that. What’s unique about a ref struct is that the compiler guarantees it will not leave the stack, ever. A ref struct can contain all sorts of different stack-allocatable values, not just references.
Do you realize this is not a mutually exclusive statement? Ref structs are just structs which can hold byref pointers aka managed references. This means that, yes, because managed references can only ever be placed on the stack (but not the memory they point to), a similar restriction is placed on ref structs alongside the Rust-like lifetime analysis to enforce memory safety. Beyond this, their semantics are identical to regular structs.
I.e.
> C# ref structs don't seem to help here, since you can't have a dynamically sized ref struct, AFAIK
Your previous reply indicates you did not know the details until reading the documentation just now. This is highly commendable because reading documentation as a skill seems to be in short supply nowadays. However, it misses the point that memory (including dynamic, whatever you mean by this, I presume reallocations?) can originate from anywhere - stackalloc buffers, malloc, inline arrays, regular arrays or virtually any source of memory, which can be wrapped into Span<T>'s or addressed with unsafe byref arithmetics (or pinning and using raw pointers).
Ref structs help with this a lot and enable many data structures which reference arbitrary memory in a generalized way (think writing a tokenizer that wraps a span of chars, much like you would do in C but retaining GC compatibility without the overhead of carrying the full string like in Go).
You can also trivially author fully identical Rust-like e.g. Vec<T>[0] with any memory source, even on top of Jemalloc or Mimalloc (which has excellent pure C# reimplementation[1] fully competitive with the original implementation in C).
None of this is even remotely possible in any other GC-based language.
[0]: https://github.com/neon-sunset/project-anvil/blob/master/Sou... (pluggable allocators ala Zig, generics are fully monomorphized here, performance is about on par with Rust)
[1]: https://github.com/terrafx/terrafx.interop.mimalloc (disclaimer: outdated description, it used to be just a bindings library, just open the src folder to verify it's no longer the case)
> Do you realize this is not a mutually exclusive statement?
It doesn’t have to be mutually exclusive. You didn’t seem to understand why people care about ref structs, since you chose to focus on something that is an incidental property, not the reason that ref structs exist.
I just want to add that C# is getting pretty fast, and it's not just because people have had a long time to submit better implementations to a benchmark site.
The language began laying the groundwork for AOT and higher performance in general with the introduction of Span<T> and friends 7 or so years ago. Since then, they have been making strides on a host of fronts to allow programmers the freedom to express most patterns expected of a low level language, including arbitrary int pointers, pointer arithmetic, typed memory regions, and an unsafe subset.
In my day-to-day experience, C# is not as fast as the "big 3" non-GCed languages (C/C++/Rust), especially in traditional application code that might use LINQ, code generation or reflection (which are AOT unfriendly features- i.e. AOT LINQ is interpreted at runtime), but since I don't tend to re-write the same code across multiple languages simultaneously I can't quantify the extent of the current speed differences.
I can say, however, that C# has been moving forward every release, and those benchmarks demonstrate that it is separating itself from the Java/Go tier (and I consider Go to be a notch or two above JITed Java, but no personal experience with GraalVM AOT yet) and it definitely feels close to the C/C++/Rust tier.
It may not ever attain even partial parity on that front, for a whole host of reasons (its reliance on its own compiler infrastructure and not a gcc or llvm based backend is a big one for me), but the language itself has slowly implemented the necessary constructs for safe (and unsafe) arbitrary memory manipulation, including explicit stack & heap allocation, and the skipping of GC, which are sort of the fundamental "costs of admission" for consideration as a high performance systems language.
I don't expect anyone to like or prefer C#, nor do I advocate forcing the language on anyone, and I really hate being such a staunch advocate here on HN (I want to avoid broken record syndrome), but as I have stated many times here, I am a big proponent of programmer ergonomics, and C# really seems to be firing on all cylinders right now (recent core library CVEs notwithstanding).
I actually agree completely: https://news.ycombinator.com/item?id=42898372
I just don’t like seeing people make bold claims without supporting evidence… those tend to feel self-aggrandizing and/or like tribalism. It also felt like a bad faith argument, so I stopped responding to that other person when there was nothing positive I could say. If the evidence existed, then they should have provided evidence. I asked for evidence.
I like C#, just as I like Go and Rust. But languages are tools, and I try to evaluate tools objectively.
> I can say, however, that C# has been moving forward every release, and those benchmarks demonstrate that it is separating itself from the Java/Go tier
I also agree. I have been following the development of C# for a very long time. I like what I have seen, especially since .NET Core. As I have mentioned in this thread already, C#’s performance is impressive. I just don’t accept a general claim that it’s as fast as Rust at this point, but not every application needs that much performance. I wish I could get some real world experience with C#, I just haven’t found any interesting tech jobs that use C#… and the job market right now doesn’t seem great, unfortunately.
Sometimes it's hard to shake the rep you have when you're now a ~25 year old language.
Awareness takes time. People need to be told, then they need to tinker here and there. Either they like what they see when they kick the tires or they don't.
I'm pretty language agnostic tbh, but I would like to see it become a bit more fashionable given its modernization efforts, cross-platform support, and MIT license.
Thank you for the detailed clarification.
The GC advances that came along helped a lot, especially over the stop-the-world early days of Java GC, along with all the JVM tuning parameters that were gradually exposed for tweaking. But visualgc allowed for a real feel for how the generational GC was running just by watching the saw-tooth shaped graphs. Interestingly most of the garbage was string copying, especially with one companies system that had more abstractions than a Spring Oak has leaves, say no more least I have nightmares.
In SpiderMonkey, we had to add pretenuring in order to avoid slowing down splay too much. As in, identify specific allocation sites that create long-lived allocations, and allocate them directly from the older generation instead of having them go through the nursery. It's sort of selectively disabling generational GC on an allocation site granularity. (Also, make sure you're not storing nursery strings inside of tenured objects. While it's possible they'll be quickly overwritten by different strings, it's much more likely that they're going to last as long as the object does.)
Within Octane, the RegExp subtest had the biggest gain from allocating strings in the nursery. (But that's going from generational objects -> generational objects and strings. Non-generational objects -> generational objects might show up more on something else.)
And yes, it assumes that lifetime is correlated pretty well with allocating callsite. Also that the correlation persists over time. The latter part often doesn't hold, so it's helpful to have a mechanism for changing the decision if you're finding a lot of dead stuff in the tenured heap that came from a pretenured callsite. But it's a delicate tradeoff: tracking the allocation sites costs time and memory, and so you might only want to do that for nursery allocations, but then you don't have a way to associate dead tenured objects with their allocation sites in case you need to stop pretenuring... lots of possibilities in the design space.
Just mentioning this in case someone wants to read up on a different GC and play around with benchmarks. There is a fair amount of inner workings info and real-world results to dive into.
Until he adds a proper GC to whippet I don't trust anything he says about GC's. He even has the luxury for a precise GC because scheme carries the types along with its values.
And he doesn't know about colored pointers, using 2-5 bits for the GC state, nor nan-tagging. Probably neither about forwarding pointers.
He has posted about conservative GC (contrary to your implication he has only implemented precise GC).
Maybe "forwarding pointers" is overloaded but given he has a post about a semispace collectors which uses something he refers to as forwarding pointers I don't see how he can be said not to know about it.
He also has posts referring to NaN-tagging and pointer tagging unless my memory betrays me.
https://wingolog.org/tags/garbage%20collection
I initially thought about using whippet, but I will stay clear and rather use MPS https://memory-pool-system.readthedocs.io/en/latest/
I didn't attend the talk and don't know him at all so wouldn't want to speak for him but from personal experience one can find oneself saying some bizarre things under stress/pressure when giving a presentation in front of lots of people. And then one looks back and goes: "why did I say that??"
The JVM has a so-called Thread-Local Allocation Buffer, which is basically a simple buffer with a pointer pointing to the next free slot. It can be simply bumped up on a new object's allocation and later on the still alive objects will be evacuated to another generation and the whole thing can be cleared. This is faster than any malloc implementation CPP/rust/whatever might use, besides they also doing arena allocations.
Haskell has a few tricks up its sleeve that can help with its kind of allocation patterns, but speaking specifically about Haskell, it's mostly the laziness that makes it quite different.
(Besides, Java has on-stack replacement which can do the exact same thing as rust does, but the JIT-compiler Gods have to be in the correct alignment for that)
1. The size.
2. I don’t see that happening and I look at the output of the C2 compiler when I’m dealing with hotspots.
In aware of the cool shit the JVM can do but done right, areas of code can be as fast as C or Rust. It in general, My experience is that Rust is much faster. Sometimes 10x faster where it matters to me (sustained server load).
I don't think that is a reasonable stat. What do you mean by faster? I'd be extremely surprised if this is representative of any real-world examples where someone has taken moderate care to write good Java code.
So however wide the gap was between stack and heap, its even wider once you had concurrency.
Also, Minecraft was famously written famously inefficiently.
So unless those pools are full of primitive data types, you’re going to cause more frequent full GC pauses, and longer due to all the back pointers.
If you have enough objects pooled to get low allocation rate then you never trigger full GC. That's what "low latency Java" aims for.
I think that the main downside of object pools in Java is that it's very easy to accidentally forget to return an object to a pool or return it twice. It turns out that it's hard to keep track of what "owns" the object and should return it if the code is not a simple try{}finally{}. This was a significant source of bugs at my previous job (algo trading).
However, I thought the people that really were serious about performance would allocate the pool in memory the GC doesn’t control (back in the day using internal APIs, but I think there are new official APIs for this) such that the object pool and every object in it would just be ignored by the GC.
This is of course a bit hand-wavy, I haven’t had the time to truly try and investigate this in a more rigorous way.
Further pushing this into a "continuum" rather than a binary flag is that programs can and do freely mix and match. Nothing stops a program from using two mallocs, some arenas, GC for certain values, and static allocations for yet others. The Memory Management Police do not come along and arrest you for having Impure Designs or anything. There are languages like (I believe) Zig where the ability to cleanly do this is more-or-less a first-class language feature.
> So, for this test with eight threads, on my 8-core Ryzen 7 7840U laptop, the nursery is 16MB including the copy reserve, which happens to be the same size as the L3 on this CPU.
Sounds way too small. Speed of copying collection is proportional to the number of survivors and with only 16 mb you risk having lots of false survivors.
I suspect it's a huge win in most modern manage application settings, because application images are huge with a lot of cruft in them. If your algorithm is churning away allocating lots of temporary objects, and it's sitting in the middle of a behemoth image, generational GC is a big win.
In dynamic languages, the functions themselves are objects that can be garbage collected and the function bindings have to be traversed in a full GC. It functions rarely change. Quickly recede into the mature generation.
You can investigate generational vs simple mark-and-sweep with my TXR Lisp. At compile time you can configurate to disable generational GC. This is not being regularly tested but I did check it for regressions fairly recently.
It is not a copying collector. Objects stay where they are. The nursery is just a global array of pointers. There are two other arrays that help with the old pointing to young mutation. Resetting these arrays costs almost nothing: we just set their fill indices to zero. You can easily tune their sizes and there's a configuration for a small memory which changes several defaults at the same time.
So, for example, network code gets better p99.
This, along with other sensible tradeoffs, is why Go ate a large slice of network software market.
At least a couple of years ago Go had a very simplistic GC, but even today it is absolutely nowhere close to how good Java's GCs are (there is ZGC, for example, whose pause times are completely decoupled from heap size, so it can actually keep sub-millisecond pause times - the OS causes bigger pauses than that).
At most Go puts slightly less work on its GC due to having value types.
Sure, it's a "microbenchmark", but it might be eye-opening how big the difference is: https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
If you use GOGC and GOMEMLIMIT to even the playing field (and note, use a Go program that isn't using sync.Pool) the difference in wall time is far less stark (though it's still there, maybe 5-15%; don't quote me, it's been a long time since I measured this and I don't remember exactly). (The difference in total CPU time is bigger.)
And finally, keep in mind this benchmark is hammering as hard as it can on the GC. How it impacts real applications depends on how much the application relies on the heap.
[1] https://go.dev/doc/gc-guide#Understanding_costs
[2] https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
? Go binary-trees programs shown together with Java programs here:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
(But stuff on those pages will be moving around for a few days.)
On the page you referenced:
— secs is elapsed seconds aka wall time aka wall clock
— mem is memory use
— gz is source code size
— cpu secs is cpu seconds
secs, mem, cpu secs as reported by BenchExec
Did it ever used to only show wall time? Or am I just completely misremembering?
Don't forget that the JVM has to allocate memory for all its subsystems as well, like the JIT compiler, so that 3x memory is not entirely heap usable by the program.
And I deliberately linked this benchmark, as the topic at hand is the GC itself.
> Don't forget that the JVM has to allocate memory for all its subsystems as well, like the JIT compiler, so that 3x memory is not entirely heap usable by the program.
That's fair. I recall doing my due diligence here and confirming it is actually using mostly heap memory, but again it's been a while and I could be wrong. (Also if the actual heap size is only ~100s of MiB and the rest of the subsystems need north of a GiB, that's much more than I would have anticipated.)
> And I deliberately linked this benchmark, as the topic at hand is the GC itself.
Sure. Not trying to suggest the benchmark doesn't have any utility, just that even for just GC performance, it doesn't paint a complete picture, especially if you're only looking at wall time.
No, I think you are right - I'm not trying to claim that most of that 1.7 GB is used by the core JVM, just that that's another factor (besides simply the different base heuristics on how much space to claim).
The only fair thing would probably be to re-run the benchmark multiple times with different available RAM (via cgroup) and see the graph. Though I'm fairly confident that Java would beat Go in this particular benchmark at any heap size.
(* in all the number crunching ones we have every other GC language, including Java and Go, confidently beat <3 )
Also, there might be some philosophical difference at play here, Java (the JVM) tends to expose only a very limited API (no structs, pointers, etc), but this allows more flexibility on the runtime part, while .NET let's the developer touch/control everything, but that might means less room for doing some advanced stuff on the runtime side.
The age-old generic-specific balance.
(Also, I like this playful competition between languages, much better than pointless flamewars :D)
Java's ZGC is multiple generations ahead, and I believe is a fairer tradeoff (overall lower throughput (due to read barriers over write barriers), but guaranteed <1ms pause time due to basically everything happening concurrently)
GC pauses are a necessity exactly for the part that can't be done in parallel, and Go is definitely not using ZGC's next-generation design that would decouple the pause's length from the heap size, so we still have our "arbitrary long pause" problem on the table.
>The Go GC avoids making the length of any global application pauses proportional to the size of the heap, and that the core tracing algorithm is performed while the application is actively executing.
>Brief stop-the-world pauses when the GC transitions between the mark and sweep phases.
I didn't benchmark it myself, though. Go is mostly used in stateless microservices where huge heap sizes are very uncommon. We usually scale horizontally, not vertically. Also, Go has value objects which can be allocated on the stack or embedded inside other objects, so it allocates far less garbage than Java. So practically for me Go's GC has been a non-issue, I barely remember it even exists.
Obviously there is not much to synchronize when there is only a single thread.
A full mark—and-sweep GC would have deal with this extra overhead.
Whereas a generational GC would reduce the amount of a miss because it only considers a subset of objects; and the least likely bit of memory in the disk/slower swap space are the most recently allocated objects.
The question then is: are real world programs as generational as we think? This might depend on whether we assume are are using a generational collector. If you assume shortlived object allocation is ~free then you will produce software conforming to the generational hypothesis but that is not good evidence that the generational hypothesis is true.
It is a bit like saying "caches are important because of locality of reference" and exhibiting as evidence code optimised to make good use of caches.
We have to be careful that we are not just measuring things designed around themselves.
I have recently prototyped my own non-OO, multifn-style polymorphic language using tagged pointers (60 bit payload, 4 bit tag) -- you can fit a float32, int/uint, etc. into 60bits -- as well as a case-insensitive 12 char alphanumeric string. Where various unboxed container types are available (eg., matrix of doubles, etc.) --- you can do a lot with just allocating to a local arena for a given scope, then resetting the arena. In this case, I lean to wards a very simple GC for all the heap stuff, since its probably going to be mostly large long-lived objects.
Basically, the idea is to encode all non-float values into the NaN representation. The NaN value in a float64 leaves 52-bits undefined. You can use those to encode the other values.
This assumption is approximately true for GUI apps, and a lot more true for server apps. The idea is that all the objects allocated during an event dispatch or a server request are dead when control returns to the event loop.
To make those assumptions fit the program better, the size of the generations should be tuned (manually or automatically) such that for the rate of allocation this assumption holds true.
You need three generations: the first, where objects are born and only a few survive long enough to get out; the second, for objects which are destined to die young, but happen to still be alive when you collect the first generation; and the third, for objects that hopefully never die, so you never have to collect it.
The first generation should be sized to fit in something close to a CPU cache, so it’s really quick to scan.
The second generation should be sized so that it is not collected often enough to have live objects in it - it should outlive the longest event / request dispatch you’ve got, and then a little bit more.
If your application doesn’t feel like a GUI or a server app that dispatches events or requests, generational GC probably isn’t the right choice for you.