That’s just… no. I do not believe I have ever encountered any software which would parse it in that way, and I refuse to believe such software ever existed. It would be <http://example.com/>.
But the PDF matches the HTML. I dunno, something weird is going on. Look at the hyperlinks there, too, “http://xn--ivg but not the rest of the URL that follows, and how the -- has been changed to –. Something went wrong somewhere in the editing or publication.
silvestrov 27 days ago [-]
My guess is that the html formatter changed the text "example.com" into "http://example.com" to make it a valid absolute URL.
chrismorgan 26 days ago [-]
Anything that turns </example.com/> into <http://example.com/> should be shot.
I dislike automatic linkifiers, especially in technical contexts, because they get things wrong so often, as regards what is a link at all (and certainly never linkify if there’s no protocol! “example.com/foo” should not be turned into <http://example.com/foo>), and as regards what can be part of the link (largely around trailing punctuation). Just require explicit delimition, like <…>, or else it’s text.
(Markdown’s […](…) is bad because ) is part of URL code points, meaning parentheses in URLs won’t be percent-encoded by a normal serialiser, so then its parser gets messy trying to compensate, assuming that parentheses will normally be paired in URLs. Your delimiter needs to not be part of the set of URL code points.)
HN’s auto-linkifier is, most of the time, one of the better ones (it was bad ten years ago, but got fixed around punctuation inclusion a few years ago), but it still has problems. I noticed too late that it mangled something in my comment: where you get http://xn--ivg, that xn--ivg is ”, because what I actually wrote was
… too, “http://” but not …
ramon156 27 days ago [-]
Because its 2024
marginalia_nu 27 days ago [-]
http:// is not a typo for https://. There's still a fairly large amount of web servers that do not talk https, and you simply cannot assume that they do. That will leave you with a lot of dead links. Besides, most that accept both will auto-renegotiate to https.
TacticalCoder 27 days ago [-]
> There's still a fairly large amount of web servers that do not talk https, and you simply cannot assume that they do.
OTOH I'm browsing since years forcing HTTPS only and life goes on fine. If the absolute worse comes to worse, I can use archive.is or archive.org but it's very rare that I need that.
Basically: if a link is HTTP to me it's not worth opening.
The one exception would be Debian packages URLs: but these are signed and the signatures are verified.
User _apt is the only one allowed to emit HTTP traffic.
This prevents my ISP or anyone else injecting nasty stuff.
forgotmypw17 27 days ago [-]
Just because it is accessible to you does not mean it is accessible to everyone else. HTTPS has many failure modes which make it unreliable for essential access, such as time mismatches, certificate expirations, ssl version mismatches, etc. Security and privacy are important, and they are also not absolute. Sometimes the risk is outweighed by the importance of being able to access essential resources and reading material.
Analemma_ 27 days ago [-]
User preferences should not be encoded into parser behavior, that’s nuts. You wouldn’t just arbitrarily change an ftp:// link to an imap:// link, so why would you accept it here? That exists at a whole other layer of the stack.
dmd 27 days ago [-]
They would arbitrarily change an ftp:// link to an sftp:// link and then complain that it didn't work.
Cicero22 27 days ago [-]
This sort of work is something I wouldn't be able to do, but I can't help but point out at least one potential issue with the paper. It's a lot easier to find problems than solutions I guess.
Are the benchmarks comparing node versions valid to conclude a real world performance increase?
I had a lot of fun writing low latency parsers for various message standards C++. There are a lot of fun things you can do when you can take ownership of the read buffer and you can figure out how to parse in-situ (modifying the data in place as you move along)
youngtaff 27 days ago [-]
Lemire’s blog is well worth a read if you’re interested in this sort of thing https://lemire.me/blog/
notamy 27 days ago [-]
The title seems to have a few words missing. Original title:
> Parsing millions of URLs per second
HL33tibCe7 27 days ago [-]
HN’s stupid/arrogant automatic title rewriter strikes again
kristianp 27 days ago [-]
I've never noticed a title being rewritten automatically when posting an article. Are you sure that's really a thing?
pests 27 days ago [-]
There are some auto rewrite rules. Off the top of my head: numbers in the beginning are stripped, [pdf] or [video] can be added to the end, and one more I can't remember that gets stripped off beginning and can cause confusion.
A pdf link to "5 Reasons To Do Things" will be "Reasons To Do Things [pdf]" for example.
Tomte 27 days ago [-]
„How“ at the beginning is stripped, leading to all these strange sounding „I <verb>“ submissions.
TRiG_Ireland 18 days ago [-]
There was also an interesting article on assistive technology called "How Disabled People Use the Web", or something similar, which looked very silly with the "How" stripped.
27 days ago [-]
Tomte 27 days ago [-]
Yes. And the algorithm is really incredibly stupid, but dang is opposed to even small improvements (like showing the changed title on submission beforehand, like the „x characters to long“ message).
PaulHoule 27 days ago [-]
Fixed
ignoramous 27 days ago [-]
So, suprassing 80k karma, one gets title edit rights?
PaulHoule 27 days ago [-]
I think anybody can edit a title within a short time of posting something. Or if there is a karma threshold it is way less than 80k.
I caught that one manually but YOShInOn's tail end needs some love and could be updated so it that it fixes up titles that get mashed automatically or adds a comment sometimes to editorialize or provide an archive link.
27 days ago [-]
trung123f 26 days ago [-]
u has similar repo? tks
27 days ago [-]
TZubiri 27 days ago [-]
[flagged]
wiseowise 27 days ago [-]
Maybe you should’ve spent 2 minutes reading the article instead of arrogantly dismissing it with layman knowledge.
fabrice_d 27 days ago [-]
The article explains optimizations to spend less cycles parsing URLs than other libraries. Very interesting work, there's no reason not to do things efficiently when it's possible.
Also, good luck using regex to write a RFC or WHATWG conformant URL parser.
TZubiri 27 days ago [-]
2 minutes reading an rfc about uris and I find a regex literally used in the specs:
I guess there are reasons to do things efficiently when possible, but a million URLs is not it, the adage about the root of all evil comes to mind. A billion URLs per second and it's almost interesting, but not really.
yagiznizipli 27 days ago [-]
RFC 3986 is a lot simpler than WHATWG spec. You can literally write a zero copy 3986 parser whereas you can’t with WHATWG. (And Ada is still faster than 3986 parsers)
switchbak 27 days ago [-]
Last I heard Noam Chomsky was still alive, and a quick Google doesn’t contradict that. Or is this some kind of high brow joke that went over my head?
Why would normalization change http:// to https:// ?
> For example, given the base string http://example.org/foo/bar, the relative string http://example.com/ leads to the final URL http://example.org/example.com/.
That’s just… no. I do not believe I have ever encountered any software which would parse it in that way, and I refuse to believe such software ever existed. It would be <http://example.com/>.
But the PDF matches the HTML. I dunno, something weird is going on. Look at the hyperlinks there, too, “http://xn--ivg but not the rest of the URL that follows, and how the -- has been changed to –. Something went wrong somewhere in the editing or publication.
I dislike automatic linkifiers, especially in technical contexts, because they get things wrong so often, as regards what is a link at all (and certainly never linkify if there’s no protocol! “example.com/foo” should not be turned into <http://example.com/foo>), and as regards what can be part of the link (largely around trailing punctuation). Just require explicit delimition, like <…>, or else it’s text.
(Markdown’s […](…) is bad because ) is part of URL code points, meaning parentheses in URLs won’t be percent-encoded by a normal serialiser, so then its parser gets messy trying to compensate, assuming that parentheses will normally be paired in URLs. Your delimiter needs to not be part of the set of URL code points.)
HN’s auto-linkifier is, most of the time, one of the better ones (it was bad ten years ago, but got fixed around punctuation inclusion a few years ago), but it still has problems. I noticed too late that it mangled something in my comment: where you get http://xn--ivg, that xn--ivg is ”, because what I actually wrote was
OTOH I'm browsing since years forcing HTTPS only and life goes on fine. If the absolute worse comes to worse, I can use archive.is or archive.org but it's very rare that I need that.
Basically: if a link is HTTP to me it's not worth opening.
The one exception would be Debian packages URLs: but these are signed and the signatures are verified.
User _apt is the only one allowed to emit HTTP traffic.
This prevents my ISP or anyone else injecting nasty stuff.
Are the benchmarks comparing node versions valid to conclude a real world performance increase?
one possible confounder is the version of V8.
https://github.com/nodejs/node/blob/v18.x/deps/v8/include/v8... https://github.com/nodejs/node/blob/v20.x/deps/v8/include/v8...
ideally, they would've patched Node 18.15 with their changes directly and test their patch against 18.15.
https://daniel.haxx.se/blog/2023/11/21/url-parser-performanc...
https://arxiv.org/abs/2311.10533
> Parsing millions of URLs per second
A pdf link to "5 Reasons To Do Things" will be "Reasons To Do Things [pdf]" for example.
I caught that one manually but YOShInOn's tail end needs some love and could be updated so it that it fixes up titles that get mashed automatically or adds a comment sometimes to editorialize or provide an archive link.
Also, good luck using regex to write a RFC or WHATWG conformant URL parser.
https://www.rfc-editor.org/rfc/rfc3986#page-7
>> The following line is the regular expression for breaking-down a well-formed URI reference into its components.
>> ^(([^:/?#]+):)?(//([^/?#]))?([^?#])(\?([^#]))?(#(.))?
Did Chomsky die for nothing?
I guess there are reasons to do things efficiently when possible, but a million URLs is not it, the adage about the root of all evil comes to mind. A billion URLs per second and it's almost interesting, but not really.
Damn. I got fake newsd. It was probably a close call.
My commiseration email must have looked so silly.
And depending on the length of your URL, 4000 will not be trivial, depending on the output format.
Today, Chrome supports 32,768 characters.. good luck processing that in 4,000 cycles! It'd require SIMD or some other fanciness.
But per core, right?