HTML vs XHTML: Rumble in the Jungle

A recent post on the MODx forums got me thinking a bit about some of the controversy surrounding web standards. The main question is this: Which specification should most developers use? HTML or XHTML? The answer you’ll get is mixed because, quite frankly, neither specification is being used as it was intended.

Proponents of each specification make many claims as to why each specification is flawed. Some will claim that XHTML isn’t fully supported by any version of Internet Explorer and thus should not be used. Others will claim that HTML is stale and isn’t as semantically correct as XHTML. For each developer you talk to you’re going to get a different set of complaints on each side of the fence.

I’m a web standards advocate and I’ll plainly admit it. One link provided by a poster on the MODx forums was to an article called HOWTO Spot a Wannabe Web Standards Advocate. I found this article to be quite humorous because, as the poster said, “there’s a lot of ignorance and hype” around this topic, which is certainly the case in this article. He seems to be just blindly attacking people who support web standards rather than addressing the real issues. So before I go into what the real problems are with web standards, let’s take the Pepsi Challenge and see if I’m a match, a “Wannabe Web Standards Advocate” if you will:

Talks about the importance of the alt tag.

What can I say…you’re right, there is no alt tag. It’s an attribute. Move on.

Claims <b> and <i> are deprecated.

If you look at the HTML 4.01 spec it states that although they are not all deprecated, their use is discouraged in favor of style sheets. The HTML 5 spec goes on to say that the i element should be used as a last resort when no other element is more appropriate and that style sheets can be used to format i elements, just like any other element can be restyled. Thus, it is not the case that content in i elements will necessarily be italicized. So, yeah, to claim that these are deprecated isn’t entirely correct. The word “discouraged” is more accurate.

And spells it “depreciated”.

We’re human. We make mistakes. Even spellcheck doesn’t always work. Have you ever misspelled a word?

Uses <span style="font-style: italic;">, because <i> is presentational.

That’s because the use of <i> is presentational. <i> means italic which does not describe the intension of use but rather how the word or set of words is to be displayed.

Wants software to use <em> and <strong> when the UI says italic and bold.

The use of italic and bold is a convention used in user interfaces, NOT markup code. The intended purpose of HTML was to describe data, NOT what it is supposed to look like. Using UI conventions to describe data isn’t a step in the right direction and isn’t proper semantics.

<em> and <strong> on the other hand are better because they describe the data and thus are far more semantically correct that the bold and italic equivalents. <em> denotes emphasis and <strong> denotes stronger emphasis. The default presentation of these elements ended up being italic and bold respectively.

Even Tim Berners-Lee, the father of the world wide web, has talked at great lengths about what he calls the semantic web and how HTML documents and the like are supposed to describe the data. Who am I to argue with him?

Marks up quoted text as <cite>.

Yeah, that would be incorrect, wouldn’t it. A <cite> is supposed to contain a citation or a reference to other sources. On the other hand, both a <q> and <blockquote> can be used for quotes depending on their use.

Complains about upper-case tags in HTML.

HTML may not be case-sensitive but XHTML is. Personally, when it comes to coding markup, javascript, and server-side code it’s better to stick with a standard of coding. Having a mix of upper-case and lower-case tags in your markup is sloppy at best. So, yeah, it’s better to stick with lower-case tags to promote consistency and avoid errors with javascript code simply because someone didn’t write proper markup.

Claims XHTML 1.0 is more semantic than HTML 4.01.

Claims XHTML 1.0 is more structured than HTML 4.01.

Claims XHTML 1.0 is less presentational than HTML 4.01.

Both HTML and XHTML are virtually identical when it comes to the HTML specific tags available. One isn’t necessarily more semantic or less presentational over the other. How a coder decides to use the tags is what makes code semantic or not.

However, the claim that XHTML 1.0 is more structured than HTML 4.01, that is actually true. Since XHTML follows the same syntax rules as XML, XHTML documents are required to be well-formed and thus are more structured than HTML. More on that in a sec.

Claims browsers parse XHTML served as text/html faster than they parse HTML.

I don’t believe this is true provided that an equivalent HTML document is written to be as well-formed as an XHTML document. HTML is pretty forgiving and thus if your document is riddled with unclosed tags then that could potentially cause a browser to load the page a bit slower than the XHTML equivalent due to the extra processing needed to interpret the HTML code properly. However, the same could be said about XHTML documents.

Refers to “the benefits of XHTML” without specifying what the benefits are.

There’s a good SitePoint forum post that has some of the frequently asked questions about XHTML vs HTML. There are some differences, but there is one key difference that I think makes XHTML better. Since XHTML requires that documents be well-formed, validation has to be more thorough and thus code errors are spotted much easier. I’ll illustrate this towards the end of this post.

Uses large XHTML 1.0 Transitional documents with table layouts while claiming enhanced compatibility with handheld devices thanks to XHTML.

“Future proofs” a site by migrating from HTML 4.01 Transitional to XHTML 1.0 Transitional and keeps serving it as text/html with all the old JavaScript scripts in place.

These are hybrid approaches and, honestly, probably shouldn’t be used anymore. A hybrid approach to using a transitional doctype with some table-based layout elements was used primarily with sites that needed to be transitioned over but the underlying code couldn’t be completely rewritten yet. For sites done from scratch, I wouldn’t even consider a hybrid approach. Browser support is much, much better these days and thus a hybrid approach is no longer valid.

Uses the XML empty element notation on pages that are supposed to be HTML pages.

I’m not sure I follow you here. Both HTML and XHTML specification allow for certain tag types to be self closing empty tags: area, base, basefont, br, col, frame, hr, img, input, isindex, link, meta, param.

Now if you’re talking about the practice of having an element with nothing it it (ie. <div></div>), I can sort of understand it. Why would you need to have an empty element in your code when you can simply insert it on the fly with javascript?

Complains about doctypeless application/xhtml+xml or SVG documents and smugly points to validator.w3.org.

The W3C clearly states that you have to have both a doctype and a xmlns declaration in the head of every XHTML document for it to be considered valid. Not having it means that you run the risk of a browser working in quirks mode and many not render the page as it was intended.

The W3C Validator is nothing more than a tool that allows people to validate their code to ensure there are no errors. A simply check can and will reveal problems like a missing doctype and xmlns declaration, thus the reason why everyone smugly points people to this site.

Claims all tables are evil.

Tables aren’t evil if they are used properly. Tables are meant for tabular data, NOT presentation purposes. Even the specification says this about tables:

“Tables should not be used purely as a means to layout document content as this may present problems when rendering to non-visual media.”

So, yeah, if you’re using tables for layout purposes then you’re doing evil.

Advocates pixel-based absolute CSS positioning as the righteous replacement for evil tables.

The use of CSS doesn’t automatically mean pixel-based absolute positioning. You can use floats and static widths and heights instead of absolute positioned layers. CSS is for the presentation layer of a properly written HTML/XHTML document. A raw, unstyled HTML document will look ok on any device in any browser. Adding CSS allows the looks of your document to degrade gracefully in any browser. Plus, you can easily change the look and presentation of a site simply be switching out the CSS on the fly, something you can’t do easily with table-based layouts. If you’re still using tables for design and presentation then good luck getting it to work on a variety of browser and devices.

Changes //EN at the end of the public identifier in the doctype to the language code of the language the page is written in.

Omits the namespace declaration in XHTML or SVG and claims it is OK, because it validates.

Man, that is pretty stupid, isn’t it considering doctypes and namespaces are explicit. After all, if you don’t declare the doctype and namespace properly then the browser won’t recognize it.

Serves documents written using a home-grown XML vocabulary along with an XSLT transformation to HTML to browsers instead of serving HTML, because XML is more semantic.

Uh, what? If you’re serving XML with an XLST transformation then it isn’t HTML anymore. That’s the whole point of using a strict XML doctype and namespace with XHTML. Now, if you require your document to be parsed as an XML-based document then, yeah, that would be ok provided you don’t need Internet Explorer to parse it. People who use strict XHTML document usually do it for very specific valid reasons. But for someone to do it simply because they think it makes it more semantic, yeah, that’s stupid indeed. But using a XHTML 1.0 Strict doctype isn’t such a bad thing even when you’re using it with a “text/html” MIME type.

So, am I a “Wannabe Web Standards Advocate”? Nope, and neither is the guy who wrote this article. I’ve taken the time to understand what web standards are really all about and what the specifications for each standard really say. To not educate yourself on the issues and claim yourself to be any kind of advocate is a great disservice to yourself and others.

And, I’ll be honest, I still have a lot to learn. There is much to learn about the current HTML 5 drafts as well as what all of the current HTML and XHTML specs have to say about certain practices.

Here’s the real problem with HTML: it’s too forgiving. HTML is more forgiving and doesn’t require that you explicitly close tags. Most browsers will render HTML just fine even when everything isn’t perfect with the code. I’m not 100% sure, but I would assume that HTML code that isn’t well-formed could potentially cause issues in scripts that rely on the DOM to function properly. This is one area that I would really like to test.

As with most script and programming languages, it’s good practice to use well-written, elegant code. The main reason is that it helps eliminate typical mistakes made that cause errors. It also makes it easier to validate the code. The problem with HTML doctypes being too forgiving is that most of the validators out there allow for simple mistakes in the code. For example, consider the following code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" 
    "http://www.w3.org/TR/html4/strict.dtd">
<html>
    <head>
        <meta http-equiv="Content-Type"
            content="text/html; charset=iso-8859-1">
        <title>Test</title>
    </head>
    <body>
        <p>This is a test paragraph
        <p>This is another one
    </body>
</html>

This is perfectly legitimate HTML code and if you run in through the W3C Markup Validator (http://validator.w3.org/) you’ll see that it passes with flying colors. But what about those paragraph tags? Shouldn’t they be properly closed? Would code like this cause issues with being able to properly parse the DOM for the paragraphs? What about search engines, spiders, document readers for the blind? To me, this is a sloppy way to code and doesn’t promote the kind of standardized code that’s possible with XHTML. Again, I’d really like to run this through the grinder with various script libraries to see if it can cause potential issues with parsing the DOM.

Although the use of XHTML with a “text/html” MIME type might seem like a bad use of the format, it’s so engrained now that to go back to HTML 4 would be a bit of a step back. The use of XHTML as HTML isn’t a documented standard per se, it’s more of a standard that came to be out of necessity. No one can argue that HTML 4 is a well documented standard. The problem though is that it’s also a stale standard. The whole idea behind the “X” in XHTML was that it was “eXtensible” HTML that could be parsed either as strict XML or HTML. As such, developers flocked to it because of the promise it had. Yes, it’s true that XHTML isn’t supported on Internet Explorer…but that only applies to strict XML MIME types, not HTML.

Keep in mind that XHTML with a “text/html” MIME type is still just HTML. From a browser point of view, one is not better than the other when it comes to parsing the HTML and CSS. The tags are the same and the rendering is the same if both are written properly. Aside from a few subtle differences, the main difference is in the syntax and validation. Arguments that one is better than the other is pretty moot at best. I think it boils down to personal preference as well as the tools you use that dictate which standard to use.

The necessity for using XHTML is such I think because it’s unclear exactly when the HTML 5 standards will be finalized. No one knows anything about what is going on. There is entrenchment in the web standards community about the direction HTML 5 should take. The W3C is saying one thing, WHATWG saying another, with the Web Standards Project putting their two cents in as well. The end result is that we probably won’t see HTML 5 being put into a release candidate state until probably 2012 (which ironically enough is when the Aztecs predicted the end of the world would take place).

Based on all this, I’m leaning more on the continued use of XHTML. I use script libraries like MooTools and jQuery pretty heavily and I just don’t like the idea of getting something really screwed up simply because the HTML I’m writing doesn’t get validated properly. For me, it’s all about well-formed code and the ability to properly parse the DOM. Not that that isn’t possible with a strict HTML doctype, but I think most of the tools we use are more geared for XHTML validation.

If it takes four more years to get a final specification for HTML 5 drafted then it could potentially take another 4 years before we see widespread browser support. For that reason alone, I don’t see any reason why we shouldn’t continue to use a strict XHTML 1.0 doctype even if it’s not a 100% documented standard. It’s standard enough and that works for me.

Tags: , ,

Leave a Reply