As someone who’s gone down the rust “native pointers vs pin vs …” rabbit hole many times now, I really recommend just using a Vec for the data and storing indexes into the vec when you need a pointer.
Pin adds a huge amount of weird incidental complexity to your code base - since you need to pin-project your struct fields (but which ones?). You can’t just take an &self or &mut self in functions if your value is pinned, and pin is just generally confusing, hard to use and hard to reason about.
The article ended up with Vec<Box<T>> - but that’s a huge code smell in my book. It’s much less performant than Vec<T> because every object needs to be individually allocated & deallocated. So you have orders of magitude more calls to malloc & free, more memory fragmentation and way more cache misses while accessing your data. The impact this has on performance is insane.
Vec & indexes is a lovely middle ground. In my experience it’s often (remarkably) slightly more performant than using raw pointers. You don’t have to worry about vec reallocations (since the indexes don’t change). And it’s 100% safe rust. It feels weird at first - indexes are just pointers with more steps. But I find rust’s language affordances just work better if you write your code like that. Code is simple, safe, ergonomic and obvious.
wging 25 days ago [-]
> Code is simple, safe, ergonomic and obvious.
Dunno about 'safe' -- or at least not in the more general sense that you seem to intend, rather than the more limited sense of rust's safe/unsafe distinction. If you store an index into a Vec<T> as a usize, rather than a &T, very little is stopping you from invalidating that pseudo-pointer without knowing it. (Or from using it as an index into the wrong vector, etc...)
These problems are manageable and I'm not saying 'never do this' -- I've done it myself on occasion. It's just that there are more pitfalls than you're indicating here, and it is actually a meaningful tradeoff of bug potential for ease-of-use.
josephg 25 days ago [-]
I mean safe in the narrow way that rust intends. It’s memory safe, but as you imply, we’re leaving the door to open to logic bugs if you misuse those array indices.
But honestly, I think danger from that is wildly overstated. The author isn’t talking about implementing an ECS or b-tree here. They’re just populating an array from a file when the program launches, then freeing the whole thing when the program terminates. It’s really not rocket science.
The other big advantage of this approach is that you don’t have to deal with unsafe rust. So, no unsafe {} blocks. No wrangling with rust’s frankly awful syntax for following raw pointers. No stressing about whether or not a future version of rust will change some subtle invariant you’re accidentally depending on, or worrying about if you need to use MaybeInit or something like that. I think the chance of making a mistake while interacting with unsafe code is far higher than the chance of misusing an array index. And the impact is usually worse.
The author details running into exactly that problem while coding - since they assumed memory allocated by vec would be pinned (it isn’t). And the program they ended up with still doesn’t use pin, even though they depend on the memory being pinned. That’s cause for far more concern than a simple array index.
laladrik 23 days ago [-]
> The author isn’t talking about implementing an ECS or b-tree here.
Do you mean that b-tree might work here better?
> They’re just populating an array from a file when the program launches, then freeing the whole thing when the program terminates. It’s really not rocket science.
That's exactly why I consider indices.
> since they assumed memory allocated by vec would be pinned (it isn’t)
Could you tell me, please, where you read in the article that I assume it? I wrote in the article "I realized that the problem is related to the fact that vectors of children move in the memory if they don't have enough space to extend." and even made an animation for clarity https://laladrik.xyz/VectorMove.webm. However, if you see the assumption in the article, please, let me know. I correct it or elaborate.
josephg 23 days ago [-]
Yes, in your article you consider indexes then ultimately decide not to use them in favor of Vec<Box<T>> & pointers. I recommend that you use indexes instead. I think they’re the better choice.
> Could you tell me, please, where you read in the article that I assume it?
You assume it in your first attempt at solving this problem. You describe that attempt in detail. That’s what I’m referring to.
The code you ended up with is still dangerous code, because your boxes are still not guaranteed to remain pinned in memory.
laladrik 22 days ago [-]
A clear. I hid it in my mind. I haven't tried the approach with indices, because... well, I was lazy to do it. However, I agree that this approach would be better, then the current one.
> You describe that attempt in detail.
I appreciate if you put a quote, because I fail to find the description of the attempt in detail. In fact, instead of assuming that a vector is pinned I wrote this "I realized that the problem is related to the fact that vectors of children move in the memory if they don't have enough space to extend."
> The code you ended up with is still dangerous code, because your boxes are still not guaranteed to remain pinned in memory.
You are right, boxes are not pinned, but the data, which the point to, is pinned, isn't it? My pointers point to that part of memory.
IshKebab 26 days ago [-]
Damn I hate it when you write a whole project and someone comes along and says "this already exists" and you realise how much time you wasted (yeah even if some of it counts towards learning I'd still rather not needlessly repeat other people's work).
Anyway, pprof has a fantastic interactive Flamegraph viewer that lets you narrow down to specific functions. It's really very good, I would use that.
They only show a very simple example and no zooming, but it works very well with huge flamegraphs.
Gobd 25 days ago [-]
I tried to find something fast and native. Saying "native" I mean something which doesn't require a browser.
Uses a browser which doesn't meet the requirements they set.
josephg 25 days ago [-]
Yep. Personally I love the Firefox profiler for interacting with perf - since it can show you flame graphs and let you explore a perf trace by dominators and whatnot.
But I applaud the effort to make small, native apps. I agree with the author - not everything should live in the browser.
IshKebab 25 days ago [-]
I think they were saying "fast and native" because web things usually aren't fast. In this case it is though, so I don't see why it would be a problem for it to be web based.
laladrik 23 days ago [-]
I said "fast and native" because none of browsers made it impossible to inspect the flamegraph.
> In this case it is though, so I don't see why it would be a problem for it to be web based.
Would you like to say that your browser is able to render the flamegraph https://laladrik.xyz/img/pic.svg and inspect it? As I wrote in the article (the first paragraph), it takes a couple of seconds to render the graph, and nothing happens when I press on a frame. Could you check it, please? If your browser renders it as fast enough and allows to open a frame, please, let me know how could I improve any of browsers. I would really appreciate it.
It uses the Firefox profiler to view its recorded profiles. You can (don't have to, just can) even share them, I was looking at this profile just yesterday: https://share.firefox.dev/3PxfriB for my day job, for example.
or w/e and when it completes (can be one-time run or long-running / interactive) now I have a great native experience exploring the profile.
All the normal bells and whistles are there and I can double click on something and see it inline in the source code with per-line (cumulative) timings.
Sesse__ 25 days ago [-]
> MacOS Instruments is really quite good.
It really isn't. It's probably the slowest profiler UI I've ever used (it loves to beachball…), it hardly has any hardware performance counters, and its actual profiling core (xctrace) is… just really buggy? After the fifth time where it told me “this function uses 5% CPU” and you optimize it away and absolutely nothing happened, because it was just another Instruments mirage. Or the time where it told me opening a file on iOS took 1000+ ms, but that was just because its end timestamps were pure fabrications.
Maybe it's better if you have toy examples, but for large applications, it's among the worst profilers I've ever seen along almost every axis. I'll give you that gprof is worse, though…
laladrik 23 days ago [-]
I love pprof. I used it so many times to profile my Go applications. However, as you wrote, the visualization is in a browser, which I found incapable to render the flamegraphs I had to work with.
audidude 25 days ago [-]
As someone who went down this path many years ago, I think the GTK numbers in the article are a bit misleading. You wouldn't create 1000 buttons to do a flamegraph properly in GTK.
In Sysprof, it uses a single widget for the flamegraph which means in less than 150 mb resident I can browse recordings in the GB size range. It really comes down to how much data gets symbolized at load time as the captures themselves are mmap'able. In nominal cases, Sysprof even calculates those and appends them after the capture phase stops so they can be mmap'd too.
That just leaves the augmented n-ary tree key'd by instruction pointer converted to string key, which naturally deduplicates/compresses.
The biggest chunk of memory consumed is GPU shaders.
adolph 25 days ago [-]
The article linked as “W3C specifications are bigger than POSIX.” is also worth reading.
The total word count of the W3C specification catalogue is 114 million words at the time of writing. If you added the combined word counts of the C11, C++17, UEFI, USB 3.2, and POSIX specifications, all 8,754 published RFCs, and the combined word counts of everything on Wikipedia’s list of longest novels, you would be 12 million words short of the W3C specifications.
Sorry, but that analysis is too sloppy to allow any such comparisons.
If you look at the scraped document list [1]:
* Most of these are not normative! They're not specifications, they're guides, recommendations, terminology explainers, and so on.
* A lot of documents are irrelevant to implementing a web browser (XSLT, XPath, RDF, XHTML, ITS, etc.).
* A lot are obsolete (e.g. SMIL, OWL).
* There are tons of duplicate versions (all of CSS 1-3 are included; multiple versions of HTML, MathML, and of course the irrelevant XML-based standards).
* Many standards are scraped both as individual section files, and as a single complete.html file. He didn't notice this, and counted both.
As a particularly egregious example, he includes every version of the Web Content Accessibility Guidelines (WCAG) standard, going back to 1999, each of which is large.
I have not done any kind of analysis myself (which should be thorough to actually be fair), but if you prune it down to the core technologies (HTML5, CSS, ECMAScript, PNG/GIF/WebP, etc.), I'll wager it's probably less than a million, or at the very least less than 2 million. The ECMAScript spec is just 356,000 words.
Something that’s been on my mind recently is that there’s a need of a high-performance flame graph library for the web. Unfortunately the most popular flame graph as a library / component, basically the react and d3 ones, work fine but the authors don’t actively maintain them anymore and their performance with large profiles is quite poor.
Most people that care about performance either hard-fork the Firefox profiler / speedscope flame graph component or create their own.
Would be nice to have a reusable, high performance flame graph for web platforms.
tdullien 25 days ago [-]
For the prodfiler.com flamegraph viewer we ended up building it in Pixi.JS, which allowed us to have nice GPU acceleration and render massive flamegraphs quickly. Omitting to draw blocks of less than half a pixel width is also a good idea, as is the monospace font.
Scene_Cast2 25 days ago [-]
I recently went through trying to profile Rust code recently. I realized that the profiling toolchain is underdeveloped across the board - "perf", the recommended profiler, isn't cross-platform (and there aren't any profilers that I found that "just work"); visualizing traces from a multi-threaded app is not fun; there isn't an IDE plugin to highlight the problematic lines, etc.
lubsch 25 days ago [-]
A very enjoyable and inspiring read! I wonder if self-rolling a native application similar to this is feasible on Wayland.
tantalor 25 days ago [-]
It's very funny they would call out poor performance of KDAB's Hotspot, a performance analysis app.
laladrik 28 days ago [-]
Hello, I found that it's difficult to visualize the flamegraph out of the huge amount of data when I was profiling Rust Analyzer. Viewing the flamegraph in a browser (Firefox and Chrome) made it impossible to view. In fact, it was simply frozen. I made this visualizer to solve my problem. Maybe it would help someone else. I leave the link to my article about it, but you can find the link to the project right in the first paragraph.
atq2119 25 days ago [-]
Props to you for making a cool little project, but as somebody who's been involved in Linux graphics a bit: please just let Xlib die. It's an outdated API, even if you ignore the existence of Wayland. For something like fast visualizations, you should really go with something that does offscreen rendering and then blits the result. As long as you're just drawing a bunch of rectangles, even CPU software rendering may be the better solution, though obviously modern tools should use GPU rendering.
I see your journey and how you ended up with Xlib. But I think that's really more of an indictment of the sorry state of GUI in Rust.
I know that's not your job, I just couldn't let this use of Xlib stand uncommented because it's really bad for the larger ecosystem.
laladrik 23 days ago [-]
> please just let Xlib die
I don't have arguments against the point about Xlib. However, I struggle to use its alternative XCB. XCB doesn't have enough documentation to understand how to use it. In fact, I even looked at the source code of Qt and GTK, but the usage doesn't explain the XCB API. I'd really appreciate if you share with me the data you have. The only thing which I found recently is the wrapper from System76 https://pop-os.github.io/libcosmic/tiny_xlib/index.html. However, it's not a documentation still. I just hope to find some usages of the wrapper and communicate with the original API.
> if you ignore the existence of Wayland
How did you conclude it? I even mentioned it in the article. I don't use it - it's true. However, I can't wait to do it. I've been trying for a couple of years now. Regrettably, I experience various technical difficulties every time. As a result, I still use my i3.
> For something like fast visualizations, you should really go with something that does offscreen rendering and then blits the result.
Do you mean double buffering?
> though obviously modern tools should use GPU rendering
Would you mind elaborating on it?
atq2119 21 days ago [-]
Re Wayland: Regardless of whether you use Xlib or XCB, your application will use the X protocol, which means it won't run natively on a Wayland desktop but only via Xwayland. Real GUI toolkits nowadays have separate backends for X and Wayland, defaulting to the Wayland backend when run on a Wayland desktop. So while XCB is better than Xlib, you really shouldn't use either.
Re offscreen rendering: This is orthogonal to double buffering but refers to the way modern compositing desktops (including Wayland) work. The application renders the window image into an offscreen surface which is then handed off to the compositor, which blits it to the screen.
Re GPU rendering: You can draw rectangles on a surface on the CPU by simply setting pixel color values in a loop, or you can draw them on the GPU by sending a list of rectangles and have the GPU do the rasterization of the rectangles. Toolkits like Qt or Gtk can do both, depending on the backend that is selected, and will typically default to GPU rendering on modern desktops.
laladrik 10 days ago [-]
I fail to see the reference to be honest. I remember the term from OpenGL, when I rendered something to a framebuffer (actually its attachmets) and then applied to the current framebuffer. It helped me to do effects like night vision. Does using offscreen rendering imply using OpenGL?
Does GPU rendering mean that I have to involve OpenGL/Vulkan?
guipsp 25 days ago [-]
I think going for xlib is somewhat missing the forest for the trees. Does it take less memory? Yeah, but you lose out on any gpu assistance you might get for free otherwise.
This only really matters as you get to bigger resolutions tho, as you avoid redrawing.
janice1999 26 days ago [-]
I was surprised to hear the Hotspot isn't fast. I had assumed it would be since it's written in C++.
mandarax8 25 days ago [-]
I've never had hotspot not be fast enough. Even on 20Gb traces, everything is instant.
Only thing that ever takes some time is the initial load of the perf file and filtering (bit still really fast).
23 days ago [-]
laladrik 23 days ago [-]
Unfortunately, I can't find the original perf file I get the flamegraph https://laladrik.xyz/img/pic.svg out of. However, I can create with a similar bunch of data and provide it soon.
laladrik 23 days ago [-]
Ok, now I remember what was the deal with Hotspot. It makes it possible to work with a flamegraph of mine. However, it takes almost half a minute to load my perf.data. However! I totally recommend Hotspot over my hack in case when you need to have a comprehensive view of the data. In particular, I love to see the off-CPU load, which my FlameGraphViewer doesn't show.
bombela 25 days ago [-]
This article reads like it was AI padded generously.
creatonez 25 days ago [-]
Except it very obviously was not. Is this accusation going to come up in every single HN thread?
Pin adds a huge amount of weird incidental complexity to your code base - since you need to pin-project your struct fields (but which ones?). You can’t just take an &self or &mut self in functions if your value is pinned, and pin is just generally confusing, hard to use and hard to reason about.
The article ended up with Vec<Box<T>> - but that’s a huge code smell in my book. It’s much less performant than Vec<T> because every object needs to be individually allocated & deallocated. So you have orders of magitude more calls to malloc & free, more memory fragmentation and way more cache misses while accessing your data. The impact this has on performance is insane.
Vec & indexes is a lovely middle ground. In my experience it’s often (remarkably) slightly more performant than using raw pointers. You don’t have to worry about vec reallocations (since the indexes don’t change). And it’s 100% safe rust. It feels weird at first - indexes are just pointers with more steps. But I find rust’s language affordances just work better if you write your code like that. Code is simple, safe, ergonomic and obvious.
Dunno about 'safe' -- or at least not in the more general sense that you seem to intend, rather than the more limited sense of rust's safe/unsafe distinction. If you store an index into a Vec<T> as a usize, rather than a &T, very little is stopping you from invalidating that pseudo-pointer without knowing it. (Or from using it as an index into the wrong vector, etc...)
These problems are manageable and I'm not saying 'never do this' -- I've done it myself on occasion. It's just that there are more pitfalls than you're indicating here, and it is actually a meaningful tradeoff of bug potential for ease-of-use.
But honestly, I think danger from that is wildly overstated. The author isn’t talking about implementing an ECS or b-tree here. They’re just populating an array from a file when the program launches, then freeing the whole thing when the program terminates. It’s really not rocket science.
The other big advantage of this approach is that you don’t have to deal with unsafe rust. So, no unsafe {} blocks. No wrangling with rust’s frankly awful syntax for following raw pointers. No stressing about whether or not a future version of rust will change some subtle invariant you’re accidentally depending on, or worrying about if you need to use MaybeInit or something like that. I think the chance of making a mistake while interacting with unsafe code is far higher than the chance of misusing an array index. And the impact is usually worse.
The author details running into exactly that problem while coding - since they assumed memory allocated by vec would be pinned (it isn’t). And the program they ended up with still doesn’t use pin, even though they depend on the memory being pinned. That’s cause for far more concern than a simple array index.
Do you mean that b-tree might work here better?
> They’re just populating an array from a file when the program launches, then freeing the whole thing when the program terminates. It’s really not rocket science.
That's exactly why I consider indices.
> since they assumed memory allocated by vec would be pinned (it isn’t)
Could you tell me, please, where you read in the article that I assume it? I wrote in the article "I realized that the problem is related to the fact that vectors of children move in the memory if they don't have enough space to extend." and even made an animation for clarity https://laladrik.xyz/VectorMove.webm. However, if you see the assumption in the article, please, let me know. I correct it or elaborate.
> Could you tell me, please, where you read in the article that I assume it?
You assume it in your first attempt at solving this problem. You describe that attempt in detail. That’s what I’m referring to.
The code you ended up with is still dangerous code, because your boxes are still not guaranteed to remain pinned in memory.
> You describe that attempt in detail.
I appreciate if you put a quote, because I fail to find the description of the attempt in detail. In fact, instead of assuming that a vector is pinned I wrote this "I realized that the problem is related to the fact that vectors of children move in the memory if they don't have enough space to extend."
> The code you ended up with is still dangerous code, because your boxes are still not guaranteed to remain pinned in memory.
You are right, boxes are not pinned, but the data, which the point to, is pinned, isn't it? My pointers point to that part of memory.
Anyway, pprof has a fantastic interactive Flamegraph viewer that lets you narrow down to specific functions. It's really very good, I would use that.
https://github.com/google/pprof
Run `pprof -http=:` on a profile and you get a web interface with the Flamegraph, call graph, line based profiling etc.
It's demonstrated in this video.
https://youtu.be/v6skRrlXsjY
They only show a very simple example and no zooming, but it works very well with huge flamegraphs.
But I applaud the effort to make small, native apps. I agree with the author - not everything should live in the browser.
> In this case it is though, so I don't see why it would be a problem for it to be web based.
Would you like to say that your browser is able to render the flamegraph https://laladrik.xyz/img/pic.svg and inspect it? As I wrote in the article (the first paragraph), it takes a couple of seconds to render the graph, and nothing happens when I press on a frame. Could you check it, please? If your browser renders it as fast enough and allows to open a frame, please, let me know how could I improve any of browsers. I would really appreciate it.
It uses the Firefox profiler to view its recorded profiles. You can (don't have to, just can) even share them, I was looking at this profile just yesterday: https://share.firefox.dev/3PxfriB for my day job, for example.
I have a `profile` function I use.
Then I just do: or w/e and when it completes (can be one-time run or long-running / interactive) now I have a great native experience exploring the profile.All the normal bells and whistles are there and I can double click on something and see it inline in the source code with per-line (cumulative) timings.
It really isn't. It's probably the slowest profiler UI I've ever used (it loves to beachball…), it hardly has any hardware performance counters, and its actual profiling core (xctrace) is… just really buggy? After the fifth time where it told me “this function uses 5% CPU” and you optimize it away and absolutely nothing happened, because it was just another Instruments mirage. Or the time where it told me opening a file on iOS took 1000+ ms, but that was just because its end timestamps were pure fabrications.
Maybe it's better if you have toy examples, but for large applications, it's among the worst profilers I've ever seen along almost every axis. I'll give you that gprof is worse, though…
In Sysprof, it uses a single widget for the flamegraph which means in less than 150 mb resident I can browse recordings in the GB size range. It really comes down to how much data gets symbolized at load time as the captures themselves are mmap'able. In nominal cases, Sysprof even calculates those and appends them after the capture phase stops so they can be mmap'd too.
That just leaves the augmented n-ary tree key'd by instruction pointer converted to string key, which naturally deduplicates/compresses.
The biggest chunk of memory consumed is GPU shaders.
The total word count of the W3C specification catalogue is 114 million words at the time of writing. If you added the combined word counts of the C11, C++17, UEFI, USB 3.2, and POSIX specifications, all 8,754 published RFCs, and the combined word counts of everything on Wikipedia’s list of longest novels, you would be 12 million words short of the W3C specifications.
https://drewdevault.com/2020/03/18/Reckless-limitless-scope....
If you look at the scraped document list [1]:
* Most of these are not normative! They're not specifications, they're guides, recommendations, terminology explainers, and so on.
* A lot of documents are irrelevant to implementing a web browser (XSLT, XPath, RDF, XHTML, ITS, etc.).
* A lot are obsolete (e.g. SMIL, OWL).
* There are tons of duplicate versions (all of CSS 1-3 are included; multiple versions of HTML, MathML, and of course the irrelevant XML-based standards).
* Many standards are scraped both as individual section files, and as a single complete.html file. He didn't notice this, and counted both.
As a particularly egregious example, he includes every version of the Web Content Accessibility Guidelines (WCAG) standard, going back to 1999, each of which is large.
I have not done any kind of analysis myself (which should be thorough to actually be fair), but if you prune it down to the core technologies (HTML5, CSS, ECMAScript, PNG/GIF/WebP, etc.), I'll wager it's probably less than a million, or at the very least less than 2 million. The ECMAScript spec is just 356,000 words.
[1] https://paste.sr.ht/~sircmpwn/475ad10f9ff9f63cd0a03a3f998370...
Something that’s been on my mind recently is that there’s a need of a high-performance flame graph library for the web. Unfortunately the most popular flame graph as a library / component, basically the react and d3 ones, work fine but the authors don’t actively maintain them anymore and their performance with large profiles is quite poor.
Most people that care about performance either hard-fork the Firefox profiler / speedscope flame graph component or create their own.
Would be nice to have a reusable, high performance flame graph for web platforms.
I see your journey and how you ended up with Xlib. But I think that's really more of an indictment of the sorry state of GUI in Rust.
I know that's not your job, I just couldn't let this use of Xlib stand uncommented because it's really bad for the larger ecosystem.
I don't have arguments against the point about Xlib. However, I struggle to use its alternative XCB. XCB doesn't have enough documentation to understand how to use it. In fact, I even looked at the source code of Qt and GTK, but the usage doesn't explain the XCB API. I'd really appreciate if you share with me the data you have. The only thing which I found recently is the wrapper from System76 https://pop-os.github.io/libcosmic/tiny_xlib/index.html. However, it's not a documentation still. I just hope to find some usages of the wrapper and communicate with the original API.
> if you ignore the existence of Wayland
How did you conclude it? I even mentioned it in the article. I don't use it - it's true. However, I can't wait to do it. I've been trying for a couple of years now. Regrettably, I experience various technical difficulties every time. As a result, I still use my i3.
> For something like fast visualizations, you should really go with something that does offscreen rendering and then blits the result.
Do you mean double buffering?
> though obviously modern tools should use GPU rendering
Would you mind elaborating on it?
Re offscreen rendering: This is orthogonal to double buffering but refers to the way modern compositing desktops (including Wayland) work. The application renders the window image into an offscreen surface which is then handed off to the compositor, which blits it to the screen.
Re GPU rendering: You can draw rectangles on a surface on the CPU by simply setting pixel color values in a loop, or you can draw them on the GPU by sending a list of rectangles and have the GPU do the rasterization of the rectangles. Toolkits like Qt or Gtk can do both, depending on the backend that is selected, and will typically default to GPU rendering on modern desktops.
Does GPU rendering mean that I have to involve OpenGL/Vulkan?
Only thing that ever takes some time is the initial load of the perf file and filtering (bit still really fast).