BLUG: SGML Geekery

Login / Register

The odd bits of HTML you never knew existed, and why you should only use XHTML 1.0 if you have to.

There are several current versions of HTML, including

There is a school of thought that says the XHTML 1.0 flavours are better than their HTML 4.01 counterparts. This isn't really true; there is nothing that can be expressed in XHTML 1.0 that cannot be expressed in the corresponding flavour of HTML 4.01. In particular, XHTML 1.0 Strict is no more or less a reliable data source than HTML 2.0 assuming both are valid. The theoretical advantage of XHTML over HTML here is XHTML should not be processed if it is blatantly wrong, but HTML should. This is only a theoretical advantage because most XHTML is currently served as text/html, which means it is treated as HTML by browsers. Or at least it should be treated as HTML. What browsers actually do with it is as unpredictable as what any browsers do with anything.

Unless you are doing something that requires XML, it is hard to make a case for using XHTML 1.0 over HTML 4.01. As it happens, I prefer the XHTML empty element syntax over the HTML way, but that is purely personal preference. It can even be argued that serving XHTML as text/html is harmful because that means it is processed as old-style HTML and you lose the benefits of XML's lack of error tolerance.

HTML allows various forms of SGML markup minimization, such as certain end tags being optional. Using these techniques can cause problems for some browsers. XHTML, being XML, doesn't have any markup minimization. This is something in XHTML's favour, but any HTML document on the web should be fully normalised anyway. More of that in a moment.

CSS works better - or at least more predictably - with XHTML 1.0 than HTML 4.01. CSS works best when there is a clear parse tree [1]. XHTML, being XML, doesn't have any of the fun markup minimization options that HTML does. This makes the parse tree much clearer to the author, and hence the CSS is easier to write. XHTML 1.0 Strict and HTML 4.01 Strict work better with CSS than their transitional counterparts because the transitional counterparts have all sorts of presentational guff in them. These presentational things interact strangely with CSS.

The real fun in CSS starts when you are applying CSS to invalid HTML. The browser has to guess what was meant, and hence the parse tree is unpredictable.

Just as with an HTML 4.01 document, you can put divs and spans more or less anywhere in the body of an XHTML document, and every element can be styled using classes, ids, or (shudder) the style attribute [2]. What you should always do with HTML is mark up the document according to its logical structure, regardless of how it looks. You should then style the page with CSS. If you get the structure right - which may require some divs and spans [3] - you may even be able to get away with a stylesheet that uses only contextual selectors. Which is so cool nitrogen condenses on it.

I've mentioned markup minimization and normalization. They deserve some further explanation. In SGML it is possible to specify that some start tags and some end tags are optional; they can be inferred unambiguously from the context. The classic example is:

<html>
<head><title></title>
<body>

There is no closing head tag, but the head element is closed, because the DTD makes it clear that the body element cannot be a child of the head. There are plenty of other similar things as well, including my personal favorite, the NET enabling start tag and the null end tag (or NET). Basically,

<foo/bar/

is exactly equivalent to

<foo>bar</foo>

An SGML document is normalized if all the markup minimization has been removed. Normalizing a document can be done completely automatically, using the sgmlnorm utility in the OpenSP suite.

Combining these, we can make a really short HTML 4.01 document:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN"><title//<p//

There might be a small prize for the first person to describe what elements are present in that document and why.

The more extreme forms of minimization - like the above - confuse all mainstream browsers. The more normal markup minimization - such as the 'missing' </head> above - probably don't confuse modern browsers, but they should be avoided anyway because most browsers are very bad at coping with anything even slightly complex.

[1] Except in Netscape 4, where CSS works best when it is turned off.

[2] There is only one time when it is acceptable to use the style attribute in production code: when the building is on fire and you must finish the page before evacuating but you cani't open a second editor to add an id selector to the stylesheet and the flames are lapping round your keyboard. Even the you must remove the style attribute as soon as you reach a place of safety. The style attribute is more evil than the font element and the center element combined, and using it makes maintenance a nightmare, and don't get me started on debugging pages with style attributes and style sheets. In fact if you are the kind of person who uses the style attribute in production code it was probably me who set the building on fire when you were inside.

[3] The average number of spans per page should be much less than one. Whereas there is no way other than the div element to group block elements, grouping inline elements (or splitting them) can normally be accomplished by using a more appropriate element.

By Andrew McFarland
Created: Fri, 8 Nov 2002 14:42:22 +0000
Ammended: Fri, 8 Nov 2002 14:56:17 +0000