* All white space in HTML and XML is preserved verbatim.
* HTML has a default presentation scheme that varies by interpreter. For everything else use CSS.
* The default presentation of white space in HTML and XML is what is called tokenized space, which is that all consecutive white space characters are displayed as a single space character. Again, you can control this with CSS.
* White space does not determine the behavior or display of other HTML tags.
* White space is a text node in the DOM. If it is adjacent to other text that text and white space are one text node, otherwise the white space is its own text node.
That should be all there is to it. JavaScript has absolutely no bearing on this subject.
matt_kantor 35 days ago [-]
> White space does not determine the behavior or display of other HTML tags.
As mentioned in the article, the collapsing behavior of leading/trailing white space can affect other elements.
The only difference between the two cases in the following example is that the latter has a space at the end of first element's content. That causes the second element's leading space to not be rendered: https://codepen.io/mkantor/pen/RNbXVJM?editors=1000
MrJohz 35 days ago [-]
In addition, the two cases can be toggled on and off using CSS alone, because `display:inline`, `display:block`, and `display:inline-block` all do different things to the two pieces of whitespace:
* `display:inline` allows the trailing space to be rendered (in red). There is now whitespace between the two elements, therefore all other whitespace up until the next non-whitespace character will be ignored.
* `display:inline-block` will remove the trailing red whitespace. Therefore when the blue whitespace is found, this is rendered.
* `display:block` will render the element as a block. Therefore trailing whitespace in that element and leading whitespace in the next element will both be ignored.
This is surprisingly weird and funky. I thought I had a fairly good handle of how odd whitespace in HTML could be, but I didn't realise how much it was controlled by different CSS display settings.
lmm 35 days ago [-]
> All white space in HTML and XML is preserved verbatim.
Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"? That's a perspective you can take, but it has pretty undesirable results: it means you can't ever reformat HTML or XML source. And it very clearly contradicts how most people understand and expect to work with HTML.
> The default presentation of white space in HTML and XML is what is called tokenized space, which is that all consecutive white space characters are displayed as a single space character. Again, you can control this with CSS.
Where is the CSS option to display multiple consecutive spaces, but line break as normal? That's something very much wanted in a lot of cases (and what the author's &ncsp; idea is getting at).
Also, this doesn't make it clear what happens (or should happen) with consecutive white space characters that have different markup.
> White space is a text node in the DOM. If it is adjacent to other text that text and white space are one text node, otherwise the white space is its own text node.
What does "adjacent" mean here? And what happens if the white space is adjacent to text on both sides?
MaulingMonkey 35 days ago [-]
> Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"?
In the strictest sense? Yes.
Of course, we can build tools that make... call it "unsound assumptions"... and I'll happily use them and encourage their use, because you can make the correct judgement call that those assumptions should hold in your context (and that the one causing the assumptions to be broken, if they ever are, is the one "at fault" rather than the tools.)
On the other hand, if those same tools are then automatically applied beyond your control, there's a good chance those unsound assumptions will be broken, and become a source of pain and suffering for whatever strange - or not so strange - edge cases your own context comes with.
Whitespace isn't the only source of this problem - and it's one of the problems I have with WYSIWYG editors in general. Often, they don't clean up after themselves and leave behind a bunch of editor shrapnel, in part because they can't remove stuff that might technically be semantically inequivalent. Those same editors might also remove stuff I wanted to keep!
lmm 35 days ago [-]
> Of course, we can build tools that make... call it "unsound assumptions"... and I'll happily use them and encourage their use, because you can make the correct judgement call that those assumptions should hold in your context (and that the one causing the assumptions to be broken, if they ever are, is the one "at fault" rather than the tools.)
That's a pretty bad way to standardise a data format IMO. If readers, writers, and tools all want these representations to be equivalent, far better to make that equivalence part of the standard - the point of the standard is to support the use cases, and being able to sensibly reformat HTML is far more valuable than being able to preserve a distinction that doesn't show up in any browser and most writers would never intend anyway.
austin-cheney 35 days ago [-]
The need for equivalence, if any, is in the parsing and not the visual presentation. HTML does not consider itself, according to its maintainers, to be a presentation format.
lmm 35 days ago [-]
Which makes it decidedly unfortunate that you cannot determine whether a sequence of spaces are collapsible or not without consulting the presentation layer, even though this should be a semantic/parsing question.
austin-cheney 35 days ago [-]
As a semantic/parsing consideration the white space is preserved.
tsimionescu 35 days ago [-]
The concept "collapsible spaces" is not a part of the HTML format, it is a decision that certain renderers apply, and others may not.
lmm 35 days ago [-]
Are there actual renderers that don't, and are there real users who consider that reasonable behaviour? I mean maybe someone somewhere has a spacebar heating workflow that relies on it, but file formats and standards should not add ways to shoot yourself in the foot if they can help it.
tsimionescu 35 days ago [-]
As this article shows, browsers actually collapse spaces differently based on the specific CSS applied - and this is in fact intended behavior, not some corner case.
Also, the output of HTML parsers is the HTML structure, and changing that to collapse spaces would break numerous tools. So while probably all HTML renderers do some kind of space collapsing, there are many other uses of HTML parsing that don't. Most likely the syntax highlighting in your HTML editor of choice in fact relies on a space-preserving HTML parser, just for one example.
lmm 35 days ago [-]
> browsers actually collapse spaces differently based on the specific CSS applied
Sure. But they do all collapse spaces. I don't think anyone wants their browser to always preserve all the spaces that are in the source.
> and this is in fact intended behavior, not some corner case.
Eh maybe. They collapse the spaces of block elements like block elements and the spaces of inline elements like inline elements; that seems like the obvious thing that your renderer would do if you didn't make any deliberate design decision.
> So while probably all HTML renderers do some kind of space collapsing, there are many other uses of HTML parsing that don't. Most likely the syntax highlighting in your HTML editor of choice in fact relies on a space-preserving HTML parser, just for one example.
I very much doubt it. And even if it did, that would be an incredibly backwards reason to keep that behaviour - "we've spent all this effort working around our bad standard, that would be wasted if we fixed the standard".
tsimionescu 35 days ago [-]
Creating parsers which entirely ignore parts of the input is generally a bad idea, because you lose the ability to round-trip. That is, it's often a desirable property to have a way to go text1 -> DOM -> text2, and have text2 be identical to text1, or at least very close to it. This is particularly true for markup languages, which intermix text and tags.
lmm 34 days ago [-]
But somehow almost every programming language and data format manages to define these equivalences and have it not ruin their editors. JSON is whitespace-insensitive but syntax highlighting it in my editor works fine; I don't know or care what the parser implementation that accomplishes that is, but it's never caused any problems I've heard of.
tsimionescu 34 days ago [-]
I really don't get what you mean. HTML and JSON behave essentially the same way in relation to spaces. It's you who seems to be asking for the HTML parsers to apply display logic in the parsing step. And sure, JSON parsers discard whitespace information outside of JSON strings, but that only works because JSON has an explicit string type. In HTML everything is a user-visible string unless it's a tag, so the same logic fundamentally can't be applied.
In fact JSON is the perfect example - if you have multiple spaces or \n in a JSON string and load that into some DOM element with JS at runtime, those spaces will be eaten up just as much by the browser renderer as any spaces that were part of the original HTML. Because, again, HTML and even the DOM don't do any kind of space collapsing; only the browser render step does that, as instructed by CSS.
lmm 34 days ago [-]
> JSON parsers discard whitespace information outside of JSON strings, but that only works because JSON has an explicit string type. In HTML everything is a user-visible string unless it's a tag, so the same logic fundamentally can't be applied.
Well, sure. The point is that's an unfortunate design.
tsimionescu 34 days ago [-]
But it's a core part of the concept, the whole idea behind a markup language. Basically the whole point of HTML, and even of SGML before it, is that you are adding annotations in-line in a text, not representing a text as a tree-like data structure, at least for much of it.
> > All white space in HTML and XML is preserved verbatim.
> Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"? That's a perspective you can take, but it has pretty undesirable results: it means you can't ever reformat HTML or XML source. And it very clearly contradicts how most people understand and expect to work with HTML.
I don't think that's what grandparent meant. I read that HTML and XML do not impose any coalescing of whitespace. Whatever whitespace is read by a parser is accepted as such. Whether the whitespace has semantic value or not is not a concern for HTML or XML as data formats.
On the other hand, coalescing whitespace is a feature of HTML and XML renderers.
And you are correct: a tool that reformats whitespace inside a <verbatim> tag will output semantically wrong results (e.g. if the contents are Python code). Which supports the point: the semantics of whitespace are not determined by the HTML or XML data formats, but by the tools generating and consuming the data.
tsimionescu 35 days ago [-]
> What does "adjacent" mean here? And what happens if the white space is adjacent to text on both sides?
I think it is the simplest sense of adjacent: if the character at the previous or next position in the bytestream is considered text, than the whitespace character is part of the same text node; if it is considered anything else, it's a separate text node. This applies recursively, since whitespace is text itself. If you wanted to specify it very formally, you probably need to include some extra verbiage for escape sequences which represent text characters, but that's the only ambiguity I can think of.
DivNode
TextNode (starts with "\n A", ends with "<h\n")
AddressNode
TextNode (3 spaces)
BrNode
TextNode (space)
TextNode ("cde\n")
Any collapsing, un collapsing, ignoring, and so an are handled by manipulating these nodes further. But this is the semantics of the HTML itself.
lmm 35 days ago [-]
> Any collapsing, un collapsing, ignoring, and so an are handled by manipulating these nodes further. But this is the semantics of the HTML itself.
Well the post I replied to talked about the DOM, and as a description of the behaviour of the DOM I don't think your description is accurate - when you write
"<a href="ABC">foo </a> bar" you only end up with one space character contained in a DOM text node, not two.
So yes, the DOM itself contains what I and the poster before mentioned. It's the presentation layer that decides how to do space collapse.
As a bonus, you can also see that the DOM of the page has four children of `body`: `<div>`, then a text node with the content "\n\n", `<script>`, and another text node containing "\n".
Tested all this with a simple HTML I saved on disk and opened in Firefox.
afiori 35 days ago [-]
> Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"?
In a very limited sense it is like this for most languages, at least ones where error traces include line:cols numbers or that give magic macros like php's __LINE__.
If you where to use those data in you logic a formatting could break your code, similarly if you read the text content of you html and use its whitespace in your logic a formatting could break your code, aside from that in most cases 1 whitespace or 1000 whitespace are generally equivalent in HTML
lmm 35 days ago [-]
> In a very limited sense it is like this for most languages, at least ones where error traces include line:cols numbers or that give magic macros like php's __LINE__.
But generally people take the view that code that throws error traces with different line numbers can still be semantically equivalent, and that changing your code's behaviour depending on __LINE__ is unreasonable. Ultimately which files are considered equivalent will be a social convention, but it should be a social convention that fits the use cases and makes the file format easier to work with.
ximm 35 days ago [-]
I expected an uninformed rant and actually got a really well informed, balanced rant. I don't agree that this is a major issue, but every time I thought of a counter argument it was immediately addressed in the article. Well done!
jraph 35 days ago [-]
Yep. I don't agree with the premise, and I'm never reaching for the pre tag as they seem to do, but the content is informative and goes into a lot of details. The topic is very well researched.
I think HTML whitespace handling is a good compromise. It has drawbacks but they are workable. I wouldn't want the quoting solution (suddenly you'd need an additional escaping mechanism, which complexifies things, and I do believe it would make authoring HTML harder). And I'm not sure how could HTML do better without such a quoting solution.
I'm particularly not convinced by the CMS argument:
- A CMS can let users write stuff in HTML with a wysiwyg editor
- A CMS can trim printed strings, and replace new lines with br elements. If you need people to be able to break lines while writing, you can use something like markdown or whatever HN does.
As for prettifying HTML with automated tools in the editor, I never bothered. That scares me exactly because I'm afraid they will break my careful handling of whitespaces or do ugly stuff, I just prefer doing it by hand.
XML and SGML ought to have a deintent syntax that would allow indenting the code without indenting the content in the pre tag, though.
shiomiru 35 days ago [-]
It seems well-intentioned, but not particularly well informed...
* It mixes concerns of HTML and CSS - e.g. <pre> eating newlines is a result of the HTML parser, but whitespace collapsing is specified in CSS.
* It suggests turning HTML from a markup language into... whatever incompatible thing the author came up with. Arguably it's not much worse than XHTML, but I don't expect better adoption.
* The final suggestion is to add a character reference to CSS - the sole issue being that CSS does not see character references, those are turned by HTML into Unicode codepoints. Also, the set of character references is closed, for good reasons.[0]
* (Nit, but "block formatting context" does not mean what the author thinks it means. Flex items behave like blocks because they become blockified - BFCs solve a separate, similarly hairy issue (floats & margin collapsing.))
The part about <aside> is worded in a way that can be confusing. It implies that a <div> is always going to be displayed as a block, but an <aside> is some kind of special element that could be either a block or inline.
Whether it's "inline", "block", or "inline-block" is determined by the CSS for every visual HTML element. A <div> is not a block, nor an <a> an inline, those are just the defaults and can be easily overridden.
vintagedave 35 days ago [-]
On whitespace: what about sentences? HTML has paragraph-level, and word-level, spacing. (This article is about word-level.)
By the way, for such a spacious and space-conscious blog no space between em dashes is out of place, they glue words together
"paragraph—but"
vintagedave 27 days ago [-]
Good point! :)
eqvinox 35 days ago [-]
The author seems entirely unaware that HTML (SGML/XML) entities are essentially text replacement.   is the same as a literal space, and breaking this (or adding a new &lf; that isn't a literal line break) would create an even worse mess than these purported whitespace problems.
(Also ​ should probably have been discussed in this article.)
nayuki 35 days ago [-]
Be careful with what you mean when you say "essentially text replacement" and "same as a literal space".
In HTML, ampersand entities can only be used outside of tags or inside attribute values, for example: <tag key="&value;">&abc;</tag>. Ampersand entities cannot be used as a substitute for tags; for example, <tag> is literal renderable text and not a tag.
This distinction is important because of odd languages like Java, where \uXXXX processing happens before tokenization. So this is a legal program: class Foo \u007B }. (U+7B is left brace.) One consequence is that to put a literal quotation mark inside a string, you can't use \u0022, but you must use \".
eqvinox 35 days ago [-]
Yes, this is why I added the "essentially". HTML does it after tokenization, but before whitespace processing (which implies before a lot of other things that by necessity happen between tokenization and whitespace processing, these steps are nowhere close to each other). But the author seems to vaguely believe/suggest to either process entities at a much later stage, or carry down some "shadow effect" in addition to expanding the entity to its text content. Both of these are complete non-starters IMHO.
(btw, you/anyone have other examples for that Java-style ordering of escapes vs. tokenization? I can think of C preprocessor trigraphs [which are deprecated since C23] but nothing else…)
nayuki 35 days ago [-]
C trigraphs is a great example that I forgot, because no one uses it and everyone hates it; glad to hear that C23 is finally removing that wart. I'm not aware of other languages with Java-style escapes before tokenization. I only learned the Java example due to the book Java Puzzlers; it has little impact on real-world code because it's considered obfuscation. I don't know the nitty-gritty details of other programming languages that I come in contact with; I just know Java especially well.
radium3d 35 days ago [-]
I do the “you’re probably not going to do” mentioned all the time. Maybe that’s why I haven’t had many instances of needing to adjust for white space in html over the last almost 30 years of web development haha
peter-m80 35 days ago [-]
Broken = works as expected
lexicality 35 days ago [-]
it works as specified, but in many cases that's not what someone who hasn't read the spec would expect
seba_dos1 35 days ago [-]
I dunno, given all the constraints it actually seems to be specified in a pretty reasonable way to work in an approachable manner even (or especially) if you had never read a spec in your life.
jfk13 35 days ago [-]
Does the article's "Example 31" look correct to anyone? I see no "small space between the two boxes" (as claimed) in any of the browsers I tried.
matt_kantor 35 days ago [-]
If you view the source it looks like they goofed and don't actually have any white space between the elements in the rendered example. It's just `<ul class="inline-block-list"><li></li><li></li></ul>`.
When I try the code they show on the page myself the space is displayed.
65 35 days ago [-]
One thing I've always found incredibly annoying is the character is not the same width as a space. It's slightly thinner.
wesammikhail 35 days ago [-]
The number of hours I've wasted over the years on trying to resolve whitespace problems over the years is... not okay to say the least.
Great writeup though!
malaise 35 days ago [-]
White space being broken was one of my first gripes with HTML, and that was about 25 years ago. It’s amazing to see a whole post about it finally, but also so obvious that I’m surprised to see one.
worksonmine 35 days ago [-]
This is one of the reasons I need at least a minify step on the HTML of any site I build. Sometimes I'd prefer not to but it's the easiest solution to have both indented HTML in the source and consistent spacing in the result.
I don't think it's broken though, imagine if every whitespace was rendered. And how do we know what should be collapsed? I don't see a better solution that satisfies every situation.
Use <pre> to preserve the whitespace in code, for everything else use CSS.
ximm 35 days ago [-]
Concerning text-to-speech and the missing separation of HTML and CSS: There are several open issues about this in the spec that defines how accessible names and descriptions are computed from HTML elements: https://github.com/w3c/accname/issues?q=state%3Aopen%20label...
vintagedave 35 days ago [-]
This bothers me too:
> Newlines and tabs are also treated identically and collapsed into spaces.
This means that simply formatting your HTML -- as you might when hand-editing -- can add spaces. Possibly, something else is going on too: I'm not an expert here, I've just observed rendering differences when you have a newline between tags. Maybe it's inline vs block tags as the author also discusses?
This has bitten me a few times and my hand-written sites or SSG-generated sites that I want to look nice have several ugly, long lines that should be broken up to be readable, but are not.
ximm 35 days ago [-]
The one thing I thought was IMHO missing from this article was JavaScript.
In HTML, it is pretty natural to add white space (i.e. text nodes containing white space) between all elements. You basically only have to worry if you want to avoid that.
In JavaScript, the opposite is true. If you want to create a text node, you have to do so explicitly. If you just create elements and append them to the same parent, they will be added without whitespace.
I am not sure how JSX behaves in this regard. Last time I checked it was more like JavaScript than HTML, which was of curse very confusing for people.
CSSer 35 days ago [-]
iirc, in JSX whitespace between elements/components is ignored unless it’s part of a string inside a JSX expression e.g. {“ “}.
34 days ago [-]
P-Nuts 35 days ago [-]
It has always slightly nagged me that web browsers don’t do as good a job as TeX at line breaking and sentence spacing.
nilslindemann 35 days ago [-]
It should remove all whitespace before and after line breaks ("\n"), including the line break itself. And otherwise replace inline whitespace with one space. In pre's keep the whitespace. Well, at least we have display:flex.
The issue probably comes from the fact that web browsers try to render HTML even if it’s not perfect. HTML isn’t super strict, so browsers will still display pages with small mistakes. There was a push to make HTML stricter with XHTML, which enforced rules like case-sensitive elements and closing tags, kind of like XML. But it didn’t really stick. Browsers had a hard time with those stricter rules, so HTML’s more relaxed approach stuck around. For some time I really tried to use XHTML when createing weh pages, but then I asked my self why all of the trouble when browsers don't follow the standards.
nayuki 35 days ago [-]
Half of what you said is correct, but:
> There was a push to make HTML stricter with XHTML / But it didn’t really stick.
> which enforced rules like case-sensitive elements
That's a good thing. I don't miss the old days (~2000) of <HTML><BODY><P> etc. It's ugly to my eyes. Moreover, even today, it's legal to write <dIv></DIv>.
> Browsers had a hard time with those stricter rules
Nonsense; it's developers and amateur web designers who couldn't cope with XHTML. Browsers parse XHTML perfectly these days, because it's just an application of XML. Also, existing tools like Macromedia Dreamweaver didn't add XHTML export support; it only outputted HTML.
> I really tried to use XHTML when createing weh pages, but then I asked my self why all of the trouble when browsers don't follow the standards
Browsers do follow the standards in XHTML mode! My page documents the behaviors and the yellow screen of death. For me, it's worth the trouble because it helps me detect errors like invalid syntax and unclosed tags.
TheChaplain 35 days ago [-]
It's hard to take any article seriously with a header like "X is broken" when it obviously isn't for a ridiculously large majority.
perching_aix 35 days ago [-]
What a puzzling reply, HTML parsers are famously mistake-tolerant and convention driven. Its standard is almost a suggestion, there can be plenty wrong with it without causing practical everyday issues, as the modern web is very established.
shiomiru 35 days ago [-]
Throwing any codepoint sequence at two separate compliant HTML5 parsers will result in the same DOM, and to my knowledge, all major browsers use compliant parsers.
agos 35 days ago [-]
I find more puzzling that the author, after declaring it "broken", offers a solution that is as broken but also laughable.
> If the next token is a U+000A LINE FEED (LF) character token, then ignore that token and move on to the next one. (Newlines at the start of pre blocks are ignored as an authoring convenience.)
This is also applied to <textarea>.
Personally, I think it was a mistake, because it complicates things and doesn’t do enough to justify itself. If it also did leading whitespace trimming across all lines, it’d be interesting enough to maybe justify itself as an authoring convenience (… though honestly I suspect that’d end up worse), but as it is it’s just an extra complication. I’ve needed to deal with the nuance of its special behaviour more than once or twice, and I’ve seen others stumble over it too. It’s also part of the fairly small pile of HTML features that make it not-round-trippable: it’s only done in parsing; the serialiser doesn’t insert an extra ␊ if it would emit `<pre>␊`.
This is one of the many cases that tempts me in the direction of the XML syntax (which, to many people’s surprise, is absolutely still a thing—save a local file with extension .xhtml, or serve over HTTP with MIME type application/xhtml+xml). The fact that XML doesn’t have a parser that guesses what you meant is generally a nice feature.
(XML also has whitespace collapsing, xml:space. Honestly it’s interesting in this context, conveying whitespace-handling intent, but I’ll ignore it. Because it’s never coming to HTML.)
But we’re stuck with this behaviour, because it would break compatibility.
And that’s where a lot of the rest of the article baffles me, because I get the general sense, from the way he presents information, that this guy doesn’t understand a lot of HTML’s history and philosophy, things I’d expect to be understood by a memory of the Angular team. The suggestions made are generally just obviously not suitable for HTML, not just because of compatibility, but also because of philosophy.
You think   should be different from SPACE? Sorry, I think we’re up to about forty years since that ship sailed; entities/character references are strictly shorthand, and numbered entities are strictly code points. And do you know how confusing it would be if it worked differently? It would be a one-off special case.
You think you can add a new entity to handle this? In XML I think you might be able to do that (way too long since I’ve written a DTD to remember clearly), but in HTML they’re called character references, because that’s all they can be, and your non-collapsing space would need to be either something entirely new in the document model, or shorthand for something like <span style=white-space:pre-wrap> </span>.
> You'd think the CMS should be able to solve this problem, but it really can't.
Uh, yes it can, and they all do, where they accept plain text, by either chunking the text into HTML paragraphs (e.g. "<p>" + s/\n\n/<\/p><p>/ + "</p>"), or by turning your text line breaks into HTML line breaks (e.g. s/\n/<br>/). CMSes do a lot of dodgy stuff like this. If you want to have nightmares, look at WordPress’s wpautop function, and think through the implications of it all. It’s a radioactive wasteland of bad ideas.
It’s also rather important to remember that two line breaks in HTML (e.g. <p>A<br><br>B</p>) is not the same as a paragraph break (e.g. <p>A</p><p>B</p>). Consider margins and text-indent, for a start.
> How Could we Fix This?
The offered solution, “quote your strings”, is what almost all programming languages tend to do. Document languages practically never quote their strings (I can’t immediately think of any even vaguely popular ones that do). Document languages consistently default to text mode, with only markup elements requiring special syntax.
As is later noted, there is, of course, absolutely no chance of HTML ever doing anything even vaguely like this. And honestly, if such a breaking change were on the cards, you’d be making far more invasive changes to HTML’s syntax.
> 3. HTML already breaks the rules of common text formatting.
> • The idea that you can write HTML today by just typing the text you want is a lie.
No it isn’t: no one ever suggested that was a feature; there was no dishonesty. HTML is a markup language.
—⁂—
The remark on template language whitespace control is incorrect:
Say hello to
{%- username -%}
and welcome them to the team!
You’ll actually get “Say hello toDeveland welcome them to the team!” which is clearly not what’s wanted.
—⁂—
For my own part, I have at times seriously considered producing HTML with only the whitespace I mean, and applying something along the lines of `:root { white-space: pre-wrap }`.
But then I remember that there’s a lot more that’s dodgy around segmentation, both in the directions of extraneous and missing breaks. For example, this URL and its rendering:
data:text/html,<body style=font-family:monospace;width:5ch>Look at C++!<br>X </a>
Look
at C+
+!
X </
a>
Viewing on my phone (which, due to narrower column width, is more likely to demonstrate such problems), I think I’ve come across three articles on HN in the last week or so exhibiting this sort of problem. If I were writing much that referred to C++, I would genuinely make something to change it to <nobr>C++</nobr>, and I do sometimes tweak breaking behaviour inside <code> elements to control where breaks can occur. (I’m also the kind of guy who types actual no-break spaces in Bible references where the book has an ordinal, e.g. “1 John 2:3” will have one NBSP and one SPACE.)
And in the end… HTML collapsing whitespace has done a lot to quell the two-spaces-between-sentences convention some hold, so it’s not all bad. ;-)
nayuki 35 days ago [-]
Awesome comment, thanks. I agree with pretty much everything you said.
> tempts me in the direction of the XML syntax (which, to many people’s surprise, is absolutely still a thing—save a local file with extension .xhtml, or serve over HTTP with MIME type application/xhtml+xml)
> The offered solution, “quote your strings”, is what almost all programming languages tend to do.
Correct, except for shell scripts. In shell, by default everything is a literal string passed into the program, unless it has designators such as a prefix dollar sign. As you can imagine, this causes all sorts of escaping nightmares. And don't get me started about array handling. That's why I find myself reaching for Python instead of Bash, because it's so much easier to build sane, composable, debuggable programs in the former.
> I would genuinely make something to change it to <nobr>C++</nobr>
> I’m also the kind of guy who types actual no-break spaces in Bible references where the book has an ordinal, e.g. “1 John 2:3” will have one NBSP and one SPACE.
Similarly, I am one of the few people who uses NBSP between a number and a unit symbol: 10 kg.
chrismorgan 34 days ago [-]
A few comments on your practical guide to XHTML, which is otherwise an entirely excellent document:
Firstly, you don’t make it clear that XHTML is dead, dead, dead. The HTML Standard superseded it, and although it originally contained a thing still called XHTML, it eventually retired the XHTML name due in part to confusion with XHTML 1 <https://github.com/whatwg/html/pull/2062>. And so what we have instead is XML syntax for HTML (arguably also a tad misleading due to processing differences like document.write’s absence, tagName case differences and missing features like <noscript>, which feel a bit beyond syntax differences). Honestly, browsers never truly even implemented XHTML; what they implemented was strictly more accurately an XML syntax for HTML. They only validated the XML part, never the XHTML part, and allowed HTML features absent from XHTML. By now, I’d call even XHTML 1.0 or 1.1 validation fairly worthless, given the further divergences in the HTML Standard—though I don’t know what there may be in the way of XML-syntax HTML validation to replace it.
Anyway, I’d suggest talking about HTML and XML syntax instead, since that’s been the accepted terminology for the last eight years; or at the very least a note to some such effect at the beginning of the article.
Secondly, about boolean attributes there’s another option you don’t mention: setting the attribute to an empty string, e.g. checked="", which is what empty attribute syntax is equivalent to these days. The HTML 4 and XHTML 1 DTDs specified `checked (checked) #IMPLIED`, meaning `checked="checked"` was the only valid value, with the minimised form `checked` as equivalent¹, and `checked=""` invalid, I believe. But in practice, I believe literally no one implemented it the SGML way, and so `.getAttribute("checked")` gave you an empty string on <input checked>… though serialising the element (.innerHTML of parent) in at least Firefox 4.0² turns checked="" into checked="checked".
Thirdly, it’s possibly worth noting that <noscript> doesn’t work in XML syntax.
¹ SGML is weird. It’s not omitting the attribute value like any sane person would expect, but rather the attribute value. Given <!ATTLIST alpha beta (gamma) #IMPLIED> (“element alpha has an optional attribute named beta which has only one legal value, gamma”), the minimised form of <alpha beta="gamma"> is <alpha gamma>; <alpha beta> would be invalid. So I guess XHTML was mildly backed into a corner on the syntax. I doubt even a single user agent ever actually implemented it the SGML way, rather special-casing it; it’s documented that many HTML user agents didn’t actually support <input checked="checked">, only <input checked>.
² Downloaded from https://ftp.mozilla.org/pub/firefox/releases/ and it Just Worked. Unfortunately, I’m on x86_64 Linux without multitarch i686 set up, and 4.0 was the oldest with x86_64 builds. Shouldn’t be hard to get 1.0 running, but my curiosity wasn’t quite great enough for the effort. I was worried 4.0 would be too new, but this "" → "checked" serialisation suggests it was still weird, so ancient browsers probably do much the same.
—⁂—
Responses to other parts of your comment:
• Shell scripting was the main exception I had in mind, for unquoted strings; and they’re something you have to be careful with. Perl has barewords; and they’re typically mostly disabled as a footgun.
• I had completely forgotten <nobr> was deemed non-conforming! One of these days I should investigate more thoroughly why, and probably appeal for it to be revived and formalised. It’s a useful and semantically-valuable thing, exactly as much as <wbr>, which I think had exactly the same “this wasn’t even in any W3C HTML spec” problem as <nobr>. In the mean time, I will continue to use it from time to time, as appropriate.
• I like to type NARROW NO-BREAK SPACE between a number and a unit, if I reckon a little spacing is called for (normally I don’t).
nayuki 33 days ago [-]
Thanks for the great feedback. I agree with what you said, though there are details I'd like to respond to.
> XHTML is dead / retired the XHTML name / XML syntax for HTML
It's complicated. XHTML 1.0 certainly existed, I coded to that standard, and I used W3C's validator. XHTML 1.0 is basically HTML 4.01 modified to fit XML syntax. XHTML 2 was drafted but never implemented as far as I knew. XHTML 1.0 code is still forward compatible with the current so-called XHTML5. In that sense, XHTML is not dead in name or in implementation, and it's just the trunk of history whereas XHTML 2 is a dead branch. I would prefer to continue using the XHTML name, despite what you legitimately pointed out regarding "XML syntax for HTML".
> boolean attributes
I never knew about that complicated behavior. Thanks for the introduction; I'll have to look further into it.
> <noscript> doesn’t work in XML syntax
It works perfectly fine and some of my pages have examples. I'm not even using CSS to hide the element; I'm solely relying on the browser's default behavior. https://www.nayuki.io/page/calculate-gcd-javascript
> <template> tag needs special-casing in the parser
I didn't know this and it saddens me that the <template> DocumentFragment behavior breaks the uniformity of the XML DOM.
> SGML is weird
It is indeed, just from my cursory study of it. I only know enough SGML to point out that it's weird and then give a few links to more information. I only mentioned SGML in my article to explain where the weirdness of HTML came from.
> It’s not omitting the attribute value like any sane person would expect, but rather the attribute value.
Looks like you misspelled something there, having repeated the same phrase.
> the minimised form of <alpha beta="gamma"> is <alpha gamma>; <alpha beta> would be invalid
I did not expect that, and it doesn't seem to match the behavior of HTML.
> why <nobr> was non-conforming
To me it seems to have a presentation quality like <b>, <i>, <u>, <tt>, <center>, etc.
> exactly as much as <wbr>
I think there's no way to design <wbr/> to be expressed CSS because it really is a single point in the text rather than a span.
> The noscript element is only effective in the HTML syntax, it has no effect in the XML syntax. This is because the way it works is by essentially "turning off" the parser when scripts are enabled, so that the contents of the element are treated as pure text and not as real elements. XML does not define a mechanism by which to do this.
That’s the sense in which I say it doesn’t work. In the way people most commonly use it, to add a “JS disabled so this won’t work” message in the body, you’re saved by the user agent stylesheet including `@media (scripting) { noscript { display: none !important; } }`. But if you try doing things like varying styles via <noscript>, that won’t work. And that’s why <noscript> is (nominally, as ever) disallowed in XML syntax.
• As for <nobr> seeming presentational: it can be used presentationally, but so can <strong>, and often is. If I think to places where I’ve used it, most of the time perception of the content would have been materially harmed by line breaks. And that’s the real test of whether something is presentation; framed another way, in a Reader Mode, should it remain or disappear? I expect inline styles and spans may be stripped by a Reader Mode, including <span style="white-space:nowrap">, but I say <nobr> should remain. Drawing connections to Unicode may be helpful too; most things there are Unicode characters for, if there are HTML things for they should be preserved as likely content semantics: I wouldn’t expect a Reader Mode to mangle my Unicode NO-BREAK SPACE to SPACE or my NARROW NO-BREAK SPACE to THIN SPACE or SPACE—they can be presentational (or even stenographic!), but they’re used for content semantics. And <nobr> is content semantics.
• <wbr> can be expressed in CSS easily: <span style="display:inline-block"></span>.
masfuerte 35 days ago [-]
Another option is word joiners:
C⁠+⁠+
You can make this easier to type by defining an entity in XHTML:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd" [
<!ENTITY cpp "C⁠+⁠+">
]>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<body style="font-family:monospace;width:5ch"><p>Look at &cpp;!</p></body>
</html>
chrismorgan 34 days ago [-]
Yeah, that’s also a reasonable solution. For short things, you can often use U+200B ZERO-WIDTH SPACE instead of <wbr>, and this is U+2060 WORD JOINER as an alternative to wrapping in <nobr>.
BoujidStack 35 days ago [-]
Whitespace in HTML always seems to do its own thing, never quite what you expect!
bloak 35 days ago [-]
U+0020 is perhaps the weirdest character in Unicode. Most of the time it's used to separate words, which is arguably a kind of mark-up. But sometimes it's used for other kinds of formatting. Also, if you were to use explicit mark-up for words I have no idea how you'd handle punctuation. Perhaps writing should be redesigned from the ground up?
But meanwhile, though we'd all love to see the plan, let's stick with the mess we're used to.
* HTML has a default presentation scheme that varies by interpreter. For everything else use CSS.
* The default presentation of white space in HTML and XML is what is called tokenized space, which is that all consecutive white space characters are displayed as a single space character. Again, you can control this with CSS.
* White space does not determine the behavior or display of other HTML tags.
* White space is a text node in the DOM. If it is adjacent to other text that text and white space are one text node, otherwise the white space is its own text node.
That should be all there is to it. JavaScript has absolutely no bearing on this subject.
As mentioned in the article, the collapsing behavior of leading/trailing white space can affect other elements.
The only difference between the two cases in the following example is that the latter has a space at the end of first element's content. That causes the second element's leading space to not be rendered: https://codepen.io/mkantor/pen/RNbXVJM?editors=1000
* `display:inline` allows the trailing space to be rendered (in red). There is now whitespace between the two elements, therefore all other whitespace up until the next non-whitespace character will be ignored.
* `display:inline-block` will remove the trailing red whitespace. Therefore when the blue whitespace is found, this is rendered.
* `display:block` will render the element as a block. Therefore trailing whitespace in that element and leading whitespace in the next element will both be ignored.
This is surprisingly weird and funky. I thought I had a fairly good handle of how odd whitespace in HTML could be, but I didn't realise how much it was controlled by different CSS display settings.
Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"? That's a perspective you can take, but it has pretty undesirable results: it means you can't ever reformat HTML or XML source. And it very clearly contradicts how most people understand and expect to work with HTML.
> The default presentation of white space in HTML and XML is what is called tokenized space, which is that all consecutive white space characters are displayed as a single space character. Again, you can control this with CSS.
Where is the CSS option to display multiple consecutive spaces, but line break as normal? That's something very much wanted in a lot of cases (and what the author's &ncsp; idea is getting at).
Also, this doesn't make it clear what happens (or should happen) with consecutive white space characters that have different markup.
> White space is a text node in the DOM. If it is adjacent to other text that text and white space are one text node, otherwise the white space is its own text node.
What does "adjacent" mean here? And what happens if the white space is adjacent to text on both sides?
In the strictest sense? Yes.
Of course, we can build tools that make... call it "unsound assumptions"... and I'll happily use them and encourage their use, because you can make the correct judgement call that those assumptions should hold in your context (and that the one causing the assumptions to be broken, if they ever are, is the one "at fault" rather than the tools.)
On the other hand, if those same tools are then automatically applied beyond your control, there's a good chance those unsound assumptions will be broken, and become a source of pain and suffering for whatever strange - or not so strange - edge cases your own context comes with.
Whitespace isn't the only source of this problem - and it's one of the problems I have with WYSIWYG editors in general. Often, they don't clean up after themselves and leave behind a bunch of editor shrapnel, in part because they can't remove stuff that might technically be semantically inequivalent. Those same editors might also remove stuff I wanted to keep!
That's a pretty bad way to standardise a data format IMO. If readers, writers, and tools all want these representations to be equivalent, far better to make that equivalence part of the standard - the point of the standard is to support the use cases, and being able to sensibly reformat HTML is far more valuable than being able to preserve a distinction that doesn't show up in any browser and most writers would never intend anyway.
Also, the output of HTML parsers is the HTML structure, and changing that to collapse spaces would break numerous tools. So while probably all HTML renderers do some kind of space collapsing, there are many other uses of HTML parsing that don't. Most likely the syntax highlighting in your HTML editor of choice in fact relies on a space-preserving HTML parser, just for one example.
Sure. But they do all collapse spaces. I don't think anyone wants their browser to always preserve all the spaces that are in the source.
> and this is in fact intended behavior, not some corner case.
Eh maybe. They collapse the spaces of block elements like block elements and the spaces of inline elements like inline elements; that seems like the obvious thing that your renderer would do if you didn't make any deliberate design decision.
> So while probably all HTML renderers do some kind of space collapsing, there are many other uses of HTML parsing that don't. Most likely the syntax highlighting in your HTML editor of choice in fact relies on a space-preserving HTML parser, just for one example.
I very much doubt it. And even if it did, that would be an incredibly backwards reason to keep that behaviour - "we've spent all this effort working around our bad standard, that would be wasted if we fixed the standard".
In fact JSON is the perfect example - if you have multiple spaces or \n in a JSON string and load that into some DOM element with JS at runtime, those spaces will be eaten up just as much by the browser renderer as any spaces that were part of the original HTML. Because, again, HTML and even the DOM don't do any kind of space collapsing; only the browser render step does that, as instructed by CSS.
Well, sure. The point is that's an unfortunate design.
Loads of HTML has to do with presentation.
b, colspan, map, etc
The colspan attribute is not even a tag and is an artifact of organization, not presentation.
I suggest you read more about what these things are in HTML, the history of HTML, accessibility, CSS, and on and on and on...
Colspan is an HTML attribute, yes. And column-span is a CSS property. HTML has many such duplications.
Might I suggest a healthy dose of your recommendation.
[1] https://html.spec.whatwg.org/multipage/text-level-semantics....
> Do you mean "any change to whitespace in HTML or XML results in a semantically inequivalent document"? That's a perspective you can take, but it has pretty undesirable results: it means you can't ever reformat HTML or XML source. And it very clearly contradicts how most people understand and expect to work with HTML.
I don't think that's what grandparent meant. I read that HTML and XML do not impose any coalescing of whitespace. Whatever whitespace is read by a parser is accepted as such. Whether the whitespace has semantic value or not is not a concern for HTML or XML as data formats.
On the other hand, coalescing whitespace is a feature of HTML and XML renderers.
And you are correct: a tool that reformats whitespace inside a <verbatim> tag will output semantically wrong results (e.g. if the contents are Python code). Which supports the point: the semantics of whitespace are not determined by the HTML or XML data formats, but by the tools generating and consuming the data.
I think it is the simplest sense of adjacent: if the character at the previous or next position in the bytestream is considered text, than the whitespace character is part of the same text node; if it is considered anything else, it's a separate text node. This applies recursively, since whitespace is text itself. If you wanted to specify it very formally, you probably need to include some extra verbiage for escape sequences which represent text characters, but that's the only ambiguity I can think of.
So for example the following HTML:
Has the following structure: Any collapsing, un collapsing, ignoring, and so an are handled by manipulating these nodes further. But this is the semantics of the HTML itself.Well the post I replied to talked about the DOM, and as a description of the behaviour of the DOM I don't think your description is accurate - when you write "<a href="ABC">foo </a> bar" you only end up with one space character contained in a DOM text node, not two.
I created a page like this:
And you'll see in the console: So yes, the DOM itself contains what I and the poster before mentioned. It's the presentation layer that decides how to do space collapse.As a bonus, you can also see that the DOM of the page has four children of `body`: `<div>`, then a text node with the content "\n\n", `<script>`, and another text node containing "\n".
Tested all this with a simple HTML I saved on disk and opened in Firefox.
In a very limited sense it is like this for most languages, at least ones where error traces include line:cols numbers or that give magic macros like php's __LINE__.
If you where to use those data in you logic a formatting could break your code, similarly if you read the text content of you html and use its whitespace in your logic a formatting could break your code, aside from that in most cases 1 whitespace or 1000 whitespace are generally equivalent in HTML
But generally people take the view that code that throws error traces with different line numbers can still be semantically equivalent, and that changing your code's behaviour depending on __LINE__ is unreasonable. Ultimately which files are considered equivalent will be a social convention, but it should be a social convention that fits the use cases and makes the file format easier to work with.
I think HTML whitespace handling is a good compromise. It has drawbacks but they are workable. I wouldn't want the quoting solution (suddenly you'd need an additional escaping mechanism, which complexifies things, and I do believe it would make authoring HTML harder). And I'm not sure how could HTML do better without such a quoting solution.
I'm particularly not convinced by the CMS argument:
- A CMS can let users write stuff in HTML with a wysiwyg editor
- A CMS can trim printed strings, and replace new lines with br elements. If you need people to be able to break lines while writing, you can use something like markdown or whatever HN does.
As for prettifying HTML with automated tools in the editor, I never bothered. That scares me exactly because I'm afraid they will break my careful handling of whitespaces or do ugly stuff, I just prefer doing it by hand.
XML and SGML ought to have a deintent syntax that would allow indenting the code without indenting the content in the pre tag, though.
* It mixes concerns of HTML and CSS - e.g. <pre> eating newlines is a result of the HTML parser, but whitespace collapsing is specified in CSS.
* It suggests turning HTML from a markup language into... whatever incompatible thing the author came up with. Arguably it's not much worse than XHTML, but I don't expect better adoption.
* The final suggestion is to add a character reference to CSS - the sole issue being that CSS does not see character references, those are turned by HTML into Unicode codepoints. Also, the set of character references is closed, for good reasons.[0]
* (Nit, but "block formatting context" does not mean what the author thinks it means. Flex items behave like blocks because they become blockified - BFCs solve a separate, similarly hairy issue (floats & margin collapsing.))
[0]: https://github.com/whatwg/html/blob/main/FAQ.md#html-should-...
Whether it's "inline", "block", or "inline-block" is determined by the CSS for every visual HTML element. A <div> is not a block, nor an <a> an inline, those are just the defaults and can be easily overridden.
There's no concept of sentences as a semantic element. I need to write a thoughtful rant on this sometime. You can read a mini-thesis here: https://daveon.design/about-dave-on-design.html#typography-&...
"paragraph—but"
(Also ​ should probably have been discussed in this article.)
In HTML, ampersand entities can only be used outside of tags or inside attribute values, for example: <tag key="&value;">&abc;</tag>. Ampersand entities cannot be used as a substitute for tags; for example, <tag> is literal renderable text and not a tag.
This distinction is important because of odd languages like Java, where \uXXXX processing happens before tokenization. So this is a legal program: class Foo \u007B }. (U+7B is left brace.) One consequence is that to put a literal quotation mark inside a string, you can't use \u0022, but you must use \".
(btw, you/anyone have other examples for that Java-style ordering of escapes vs. tokenization? I can think of C preprocessor trigraphs [which are deprecated since C23] but nothing else…)
When I try the code they show on the page myself the space is displayed.
Great writeup though!
I don't think it's broken though, imagine if every whitespace was rendered. And how do we know what should be collapsed? I don't see a better solution that satisfies every situation.
Use <pre> to preserve the whitespace in code, for everything else use CSS.
> Newlines and tabs are also treated identically and collapsed into spaces.
This means that simply formatting your HTML -- as you might when hand-editing -- can add spaces. Possibly, something else is going on too: I'm not an expert here, I've just observed rendering differences when you have a newline between tags. Maybe it's inline vs block tags as the author also discusses?
This has bitten me a few times and my hand-written sites or SSG-generated sites that I want to look nice have several ugly, long lines that should be broken up to be readable, but are not.
In HTML, it is pretty natural to add white space (i.e. text nodes containing white space) between all elements. You basically only have to worry if you want to avoid that.
In JavaScript, the opposite is true. If you want to create a text node, you have to do so explicitly. If you just create elements and append them to the same parent, they will be added without whitespace.
I am not sure how JSX behaves in this regard. Last time I checked it was more like JavaScript than HTML, which was of curse very confusing for people.
> There was a push to make HTML stricter with XHTML / But it didn’t really stick.
Yup! And XHTML still works in practice: https://www.nayuki.io/page/practical-guide-to-xhtml
> which enforced rules like case-sensitive elements
That's a good thing. I don't miss the old days (~2000) of <HTML><BODY><P> etc. It's ugly to my eyes. Moreover, even today, it's legal to write <dIv></DIv>.
> Browsers had a hard time with those stricter rules
Nonsense; it's developers and amateur web designers who couldn't cope with XHTML. Browsers parse XHTML perfectly these days, because it's just an application of XML. Also, existing tools like Macromedia Dreamweaver didn't add XHTML export support; it only outputted HTML.
> I really tried to use XHTML when createing weh pages, but then I asked my self why all of the trouble when browsers don't follow the standards
Browsers do follow the standards in XHTML mode! My page documents the behaviors and the yellow screen of death. For me, it's worth the trouble because it helps me detect errors like invalid syntax and unclosed tags.
> If the next token is a U+000A LINE FEED (LF) character token, then ignore that token and move on to the next one. (Newlines at the start of pre blocks are ignored as an authoring convenience.)
This is also applied to <textarea>.
Personally, I think it was a mistake, because it complicates things and doesn’t do enough to justify itself. If it also did leading whitespace trimming across all lines, it’d be interesting enough to maybe justify itself as an authoring convenience (… though honestly I suspect that’d end up worse), but as it is it’s just an extra complication. I’ve needed to deal with the nuance of its special behaviour more than once or twice, and I’ve seen others stumble over it too. It’s also part of the fairly small pile of HTML features that make it not-round-trippable: it’s only done in parsing; the serialiser doesn’t insert an extra ␊ if it would emit `<pre>␊`.
This is one of the many cases that tempts me in the direction of the XML syntax (which, to many people’s surprise, is absolutely still a thing—save a local file with extension .xhtml, or serve over HTTP with MIME type application/xhtml+xml). The fact that XML doesn’t have a parser that guesses what you meant is generally a nice feature.
(XML also has whitespace collapsing, xml:space. Honestly it’s interesting in this context, conveying whitespace-handling intent, but I’ll ignore it. Because it’s never coming to HTML.)
But we’re stuck with this behaviour, because it would break compatibility.
And that’s where a lot of the rest of the article baffles me, because I get the general sense, from the way he presents information, that this guy doesn’t understand a lot of HTML’s history and philosophy, things I’d expect to be understood by a memory of the Angular team. The suggestions made are generally just obviously not suitable for HTML, not just because of compatibility, but also because of philosophy.
You think   should be different from SPACE? Sorry, I think we’re up to about forty years since that ship sailed; entities/character references are strictly shorthand, and numbered entities are strictly code points. And do you know how confusing it would be if it worked differently? It would be a one-off special case.
You think you can add a new entity to handle this? In XML I think you might be able to do that (way too long since I’ve written a DTD to remember clearly), but in HTML they’re called character references, because that’s all they can be, and your non-collapsing space would need to be either something entirely new in the document model, or shorthand for something like <span style=white-space:pre-wrap> </span>.
> You'd think the CMS should be able to solve this problem, but it really can't.
Uh, yes it can, and they all do, where they accept plain text, by either chunking the text into HTML paragraphs (e.g. "<p>" + s/\n\n/<\/p><p>/ + "</p>"), or by turning your text line breaks into HTML line breaks (e.g. s/\n/<br>/). CMSes do a lot of dodgy stuff like this. If you want to have nightmares, look at WordPress’s wpautop function, and think through the implications of it all. It’s a radioactive wasteland of bad ideas.
It’s also rather important to remember that two line breaks in HTML (e.g. <p>A<br><br>B</p>) is not the same as a paragraph break (e.g. <p>A</p><p>B</p>). Consider margins and text-indent, for a start.
> How Could we Fix This?
The offered solution, “quote your strings”, is what almost all programming languages tend to do. Document languages practically never quote their strings (I can’t immediately think of any even vaguely popular ones that do). Document languages consistently default to text mode, with only markup elements requiring special syntax.
As is later noted, there is, of course, absolutely no chance of HTML ever doing anything even vaguely like this. And honestly, if such a breaking change were on the cards, you’d be making far more invasive changes to HTML’s syntax.
> 3. HTML already breaks the rules of common text formatting.
> • The idea that you can write HTML today by just typing the text you want is a lie.
No it isn’t: no one ever suggested that was a feature; there was no dishonesty. HTML is a markup language.
—⁂—
The remark on template language whitespace control is incorrect:
Say hello to {%- username -%} and welcome them to the team!
You’ll actually get “Say hello toDeveland welcome them to the team!” which is clearly not what’s wanted.
—⁂—
For my own part, I have at times seriously considered producing HTML with only the whitespace I mean, and applying something along the lines of `:root { white-space: pre-wrap }`.
But then I remember that there’s a lot more that’s dodgy around segmentation, both in the directions of extraneous and missing breaks. For example, this URL and its rendering:
Viewing on my phone (which, due to narrower column width, is more likely to demonstrate such problems), I think I’ve come across three articles on HN in the last week or so exhibiting this sort of problem. If I were writing much that referred to C++, I would genuinely make something to change it to <nobr>C++</nobr>, and I do sometimes tweak breaking behaviour inside <code> elements to control where breaks can occur. (I’m also the kind of guy who types actual no-break spaces in Bible references where the book has an ordinal, e.g. “1 John 2:3” will have one NBSP and one SPACE.)And in the end… HTML collapsing whitespace has done a lot to quell the two-spaces-between-sentences convention some hold, so it’s not all bad. ;-)
> tempts me in the direction of the XML syntax (which, to many people’s surprise, is absolutely still a thing—save a local file with extension .xhtml, or serve over HTTP with MIME type application/xhtml+xml)
I am one of the very few people who serves content in XHTML mode in practice, and also wrote a detailed guide on what it entails: https://www.nayuki.io/page/practical-guide-to-xhtml
> The offered solution, “quote your strings”, is what almost all programming languages tend to do.
Correct, except for shell scripts. In shell, by default everything is a literal string passed into the program, unless it has designators such as a prefix dollar sign. As you can imagine, this causes all sorts of escaping nightmares. And don't get me started about array handling. That's why I find myself reaching for Python instead of Bash, because it's so much easier to build sane, composable, debuggable programs in the former.
> I would genuinely make something to change it to <nobr>C++</nobr>
I didn't know about the <nobr> element. I looked it up and it's deprecated. However, it can be simulated with CSS white-space:no-wrap. https://developer.mozilla.org/en-US/docs/Web/HTML/Element/no...
> I’m also the kind of guy who types actual no-break spaces in Bible references where the book has an ordinal, e.g. “1 John 2:3” will have one NBSP and one SPACE.
Similarly, I am one of the few people who uses NBSP between a number and a unit symbol: 10 kg.
Firstly, you don’t make it clear that XHTML is dead, dead, dead. The HTML Standard superseded it, and although it originally contained a thing still called XHTML, it eventually retired the XHTML name due in part to confusion with XHTML 1 <https://github.com/whatwg/html/pull/2062>. And so what we have instead is XML syntax for HTML (arguably also a tad misleading due to processing differences like document.write’s absence, tagName case differences and missing features like <noscript>, which feel a bit beyond syntax differences). Honestly, browsers never truly even implemented XHTML; what they implemented was strictly more accurately an XML syntax for HTML. They only validated the XML part, never the XHTML part, and allowed HTML features absent from XHTML. By now, I’d call even XHTML 1.0 or 1.1 validation fairly worthless, given the further divergences in the HTML Standard—though I don’t know what there may be in the way of XML-syntax HTML validation to replace it.
Anyway, I’d suggest talking about HTML and XML syntax instead, since that’s been the accepted terminology for the last eight years; or at the very least a note to some such effect at the beginning of the article.
Secondly, about boolean attributes there’s another option you don’t mention: setting the attribute to an empty string, e.g. checked="", which is what empty attribute syntax is equivalent to these days. The HTML 4 and XHTML 1 DTDs specified `checked (checked) #IMPLIED`, meaning `checked="checked"` was the only valid value, with the minimised form `checked` as equivalent¹, and `checked=""` invalid, I believe. But in practice, I believe literally no one implemented it the SGML way, and so `.getAttribute("checked")` gave you an empty string on <input checked>… though serialising the element (.innerHTML of parent) in at least Firefox 4.0² turns checked="" into checked="checked".
Thirdly, it’s possibly worth noting that <noscript> doesn’t work in XML syntax.
Fourthly, maybe not really worth noting, but XML syntax isn’t actually proper XML any more: the <template> tag needs special-casing in the parser <https://html.spec.whatwg.org/multipage/xhtml.html#templateTa...>.
¹ SGML is weird. It’s not omitting the attribute value like any sane person would expect, but rather the attribute value. Given <!ATTLIST alpha beta (gamma) #IMPLIED> (“element alpha has an optional attribute named beta which has only one legal value, gamma”), the minimised form of <alpha beta="gamma"> is <alpha gamma>; <alpha beta> would be invalid. So I guess XHTML was mildly backed into a corner on the syntax. I doubt even a single user agent ever actually implemented it the SGML way, rather special-casing it; it’s documented that many HTML user agents didn’t actually support <input checked="checked">, only <input checked>.
² Downloaded from https://ftp.mozilla.org/pub/firefox/releases/ and it Just Worked. Unfortunately, I’m on x86_64 Linux without multitarch i686 set up, and 4.0 was the oldest with x86_64 builds. Shouldn’t be hard to get 1.0 running, but my curiosity wasn’t quite great enough for the effort. I was worried 4.0 would be too new, but this "" → "checked" serialisation suggests it was still weird, so ancient browsers probably do much the same.
—⁂—
Responses to other parts of your comment:
• Shell scripting was the main exception I had in mind, for unquoted strings; and they’re something you have to be careful with. Perl has barewords; and they’re typically mostly disabled as a footgun.
• I had completely forgotten <nobr> was deemed non-conforming! One of these days I should investigate more thoroughly why, and probably appeal for it to be revived and formalised. It’s a useful and semantically-valuable thing, exactly as much as <wbr>, which I think had exactly the same “this wasn’t even in any W3C HTML spec” problem as <nobr>. In the mean time, I will continue to use it from time to time, as appropriate.
• I like to type NARROW NO-BREAK SPACE between a number and a unit, if I reckon a little spacing is called for (normally I don’t).
> XHTML is dead / retired the XHTML name / XML syntax for HTML
It's complicated. XHTML 1.0 certainly existed, I coded to that standard, and I used W3C's validator. XHTML 1.0 is basically HTML 4.01 modified to fit XML syntax. XHTML 2 was drafted but never implemented as far as I knew. XHTML 1.0 code is still forward compatible with the current so-called XHTML5. In that sense, XHTML is not dead in name or in implementation, and it's just the trunk of history whereas XHTML 2 is a dead branch. I would prefer to continue using the XHTML name, despite what you legitimately pointed out regarding "XML syntax for HTML".
> boolean attributes
I never knew about that complicated behavior. Thanks for the introduction; I'll have to look further into it.
> <noscript> doesn’t work in XML syntax
It works perfectly fine and some of my pages have examples. I'm not even using CSS to hide the element; I'm solely relying on the browser's default behavior. https://www.nayuki.io/page/calculate-gcd-javascript
> <template> tag needs special-casing in the parser
I didn't know this and it saddens me that the <template> DocumentFragment behavior breaks the uniformity of the XML DOM.
> SGML is weird
It is indeed, just from my cursory study of it. I only know enough SGML to point out that it's weird and then give a few links to more information. I only mentioned SGML in my article to explain where the weirdness of HTML came from.
> It’s not omitting the attribute value like any sane person would expect, but rather the attribute value.
Looks like you misspelled something there, having repeated the same phrase.
> the minimised form of <alpha beta="gamma"> is <alpha gamma>; <alpha beta> would be invalid
I did not expect that, and it doesn't seem to match the behavior of HTML.
> why <nobr> was non-conforming
To me it seems to have a presentation quality like <b>, <i>, <u>, <tt>, <center>, etc.
> exactly as much as <wbr>
I think there's no way to design <wbr/> to be expressed CSS because it really is a single point in the text rather than a span.
> The noscript element is only effective in the HTML syntax, it has no effect in the XML syntax. This is because the way it works is by essentially "turning off" the parser when scripts are enabled, so that the contents of the element are treated as pure text and not as real elements. XML does not define a mechanism by which to do this.
That’s the sense in which I say it doesn’t work. In the way people most commonly use it, to add a “JS disabled so this won’t work” message in the body, you’re saved by the user agent stylesheet including `@media (scripting) { noscript { display: none !important; } }`. But if you try doing things like varying styles via <noscript>, that won’t work. And that’s why <noscript> is (nominally, as ever) disallowed in XML syntax.
• As for <nobr> seeming presentational: it can be used presentationally, but so can <strong>, and often is. If I think to places where I’ve used it, most of the time perception of the content would have been materially harmed by line breaks. And that’s the real test of whether something is presentation; framed another way, in a Reader Mode, should it remain or disappear? I expect inline styles and spans may be stripped by a Reader Mode, including <span style="white-space:nowrap">, but I say <nobr> should remain. Drawing connections to Unicode may be helpful too; most things there are Unicode characters for, if there are HTML things for they should be preserved as likely content semantics: I wouldn’t expect a Reader Mode to mangle my Unicode NO-BREAK SPACE to SPACE or my NARROW NO-BREAK SPACE to THIN SPACE or SPACE—they can be presentational (or even stenographic!), but they’re used for content semantics. And <nobr> is content semantics.
• <wbr> can be expressed in CSS easily: <span style="display:inline-block"></span>.
C⁠+⁠+
You can make this easier to type by defining an entity in XHTML:
But meanwhile, though we'd all love to see the plan, let's stick with the mess we're used to.