The purported easiness of well-formedness

When I started learning HTML around 2002 I was told that HTML was on the way out and would eventually be replaced by a new version known as XHTML:

If you want your site to work well in today’s browsers and non–traditional devices, and to continue to work well in tomorrow’s, it’s a good idea to author new sites in XHTML…

Jeffrey Zeldman

XHTML is HTML reformulated to adhere to the XML standard. It is the foundation language for the future of the Web.

Musciano and Kennedy

It turns out this never really came to pass. XHTML was quite popular for a time but has lately fallen out of favour. For example, W3Techs offers data showing that XHTML usage peaked in 2012 when it was used in over 65% of the websites that they surveyed. Today their data shows that it is used in under 7% of websites.

XHTML has the interesting feature that it is based on XML, and XML processors are required to handle syntax errors incredibly strictly:

…if your document contains a parse error, the entire document is invalid. That means if you bank on XHTML and make a single typo somewhere, nothing at all renders. Just an error.

This sucked. It sounds okay on the face of things, but consider: […] generating it dynamically and risking that particular edge cases might replace your entire site with an unintelligible browser error? That sucks.

Eevee

Back when XML was being standardized a few people thought this was not a good idea, and the issue was hotly contested. In this post I want to consider one of the claims put forward by those in favour of XML’s lack of error-handling: that it is very easy to make well-formed XML documents. This was repeatedly stated as a point in favour of requiring no error recovery in XML:

Well-formedness should be easy for a document to attain.

Tim Bray

We went to a lot of work to make well-formedness easy. It is a very low bar to get over […] the standard required to achieve reliable interoperability is so easy to explain and to achieve.

Tim Bray

well-formedness is so easy that it isn’t a significant burden on anyone

Tim Bray

No information provider who does even the most cursory checking will publish non-WF docs […] no user will ever be in the position that he can’t see an “interesting” doc just because it’s non-WF, because there won’t be any

Tim Bray

Anyone who can’t make a syndication feed that’s well-formed XML is an incompetent fool.

Tim Bray

Even back then not everyone was convinced about the easiness of producing well-formed XML:

the argument seems to be, don’t worry. Since most if not all XML documents will be machine generated they will all be well formed. I don’t buy it! Programmers are human to and make as many errors as prose authors.

Dave Hollander

Anyone who has a single error in his document is a bozo? Ahem. I don’t buy any of this.

Terry Allen

I like the concept of WF very much, but I’m by no means confident that what goes towards WF in XML really meets my intuitive notion. Indeed, I believe that WF in XML may not be quite as easy to achieve as it’s made out.

Arjun Ray

These kinds of arguments seem hard to settle one way or the other. Bray says that XML well-formedness is easy to obtain. Others disagree. No hard evidence is offered either way, though Bray has four rules that he claims are easy and enforce XML well-formedness. Even if we assume the rules are easy I’ve learned that “easy” is not an intrinsic property; what is easy to one person is not at all easy to another. It would seem as if there is no way to resolve the issue of how easy well-formedness really is.

However, this discussion took place in 1997. We now don’t have to speculate about how easy it is to author XML: we have the benefit of being able look back at history and see what actually happened. The XML specification was published as a W3C recommendation in 1998. XHTML reformulated HTML in XML and was published as a W3C recommendation in 2000. With over 2 decades of XHTML documents being published we don’t need to argue how hard or easy it is to obtain well-formedness—we can determine how hard it is in practice.

Now, in most cases XHTML documents are likely parsed by a browser as HTML and not as XML. This means that well-formedness errors would not be shown on the page because HTML parsers are lenient. Nevertheless, an XHTML document is—by virtue of being XHTML—also an XML document, regardless of how it is parsed. Those authoring XHTML are therefore also subject to the well-formedness rules of XML. Of course, that shouldn’t be a burden for most documents if well-formedness is such a low bar.

Since most modern websites don’t use XHTML anymore I collected data from the Wayback Machine archive. I collected the list of the top websites in the world published by Alexa at the end of 2009. Then I downloaded the homepage of the top 200 sites as they appeared on January 1, 2010 (or the closest date available in the Wayback Machine). Of the top 200 sites I was successful in retrieving data for 195 of them. Of those, 81 websites used XHTML 1.0 Transitional, 21 sites used XHTML 1.0 Strict, and 1 website (bet9ja.com) used XHTML+RDFa 1.0.

I then checked for well-formedness of each document using xmlwf. 11/81 of the XHTML 1.0 Transitional sites were well-formed, 7/21 of the XHTML 1.0 Strict sites were well-formed, and the single XHTML+RDFa 1.0 document was well-formed. That’s right: 82% of websites were ill-formed.

Regardless of how low a bar you consider well-formedness, the fact of the matter is that most webpages didn’t meet that bar in 2010, when XML had already been out for over a decade. And these weren’t pages designed by clueless developers, either. Just imagine how much effort goes into the development of the a top-200 webpage!

I wasn’t able to find much previous work on well-formedness in practice but a 2005 post on the Google Reader blog did a similar thing for RSS and Atom feeds which are typically sent to browsers as XML. In that post they tabulate the top 22 separate errors they found which prevented feeds from being well-formed and estimated that 7% of all feeds had at least one of those errors.

Hence, Bray’s prediction that “there won’t be any” non-wellformed XML documents was unrealistically optimistic. Bray also goes as far as calling anyone who doesn’t produce well-formed XML an “incompetent fool” on his blog. Note that Bray is the co-editor of the XML specification and one of the foremost XML experts in the world. He more than perhaps anyone else in the world is in a position of being able to publish well-formed XML documents and—naturally enough—his blog is published in XHTML 1.1. Amusingly, the post claiming those who don’t create well-formed XML are incompetent contains an unescaped & and as a result is itself not well-formed.

<p>By the way, it doesn’t make any difference whether the ill-formedness is grossly-missing tags as above or a single unescaped <code>&</code>; in these kinds of apps, if it isn’t XML, this is evidence of serious breakage.</p>

Irony: forgetting to escape the & in a blog post claiming that a single unescaped & is incompetence.

In the blog post Bray claims to have run the post through an XML checker and in fact uses this very example to argue why well-formedness is so easy to achieve in practice. But his blog post was written nearly twenty years ago and in that time there has undoubtedly been many server upgrades, edits to posts, software updates to his blogging scripts, etc. Any number of subtle and nearly imperceptible changes could introduce a well-formedness error and indeed it seems that page has been ill-formed for years.

I think this underscores just how poor humans are at technical details like well-formedness. Yes, the rules might seem simple in the abstract, but what about a minor typo you make two decades from now as a part of a routine update? What about the thousands of edge cases you didn’t consider at first? The unfortunate lesson that we can take from the history of computing is that bugs fester in even the very simplest programs. Yes, the rules may be easy—but that makes them deceptive if you need to get them exactly right.