Tony MacDonell - Teknision: XHTML, future proofing our crappy markup.

Friday, February 03, 2006

XHTML, future proofing our crappy markup.

XHTML was probably one of the best innnovations of the web I can think of. XHTML was designed to allow content to be marked up and delivered to a browser (including old ones), and have it render correctly as HTML, but also to be passed to an XML user-agent and be consumed there as well.

This was achieved by taking HTML and making it comply with the parsing rules of XML. Tags must always be closed, attributes must always be in quotes, etc.

here is the W3C link to the spec.

The concepts that drive XHTML really play nice with the concepts that drive CSS, which in a standards oriented world is the natural companion of XHTML.

The goal was to give the mark up a minimal role in presentation, just mainly providing the document structure and then let CSS take over and pretty it up. This all makes alot of sense and I think it was a great idea.

Most of our blogs today are driven by this technology obviously.... Well to web professionals reading this that is obvious, but to the average user, the change in technolgy behind the scenes really has no impact on their web experience. So really this innovation was focused on simplifying the code we had to write, and making it easier to read from a humans point of view.

More than that, standards evangelists claim the the real point of using XHTML is to future proof your mark up, and this is really the concept that resonates in my mind.

In my eys XHTML offers the oppourtunity of the ultimate syndication format. One in which any data in any format can be shared accross the web. This is much more useful than RSS which is limited to distributing lists only. XHTML is approachable by anyone, and easily rendered within any client that supports HTML.

Unfotunately though, while this is an awesome idea and worth persuing, the way we are going forward with it now needs a little more focus on the future. I would argue that there is little "future proofing" left to do. The future is now, and we are missing a huge oppourtunity.

Let me back up that point, and make a futher claim that:

Most of the XHTML that we are producing is just future proofing badly structured documents, that will end up being very badly formatted data in the future.

Last night I blogged about my experience of trying to consume an XHTML document using Flash and render the contents of the blog post in a Flash text field. What I saved for today was my frustration in interpreting the mark up using Flash.

Now to set this off on the right foot refer to the following claim in the XHTML specifcation:

3.2. User Agent Conformance

A conforming user agent must meet all of the following criteria:

3. When a user agent processes an XHTML document as generic XML, it shall only recognize attributes of type ID (i.e. the id attribute on most XHTML elements) as fragment identifiers.

This is the critical part of the spec for me. It tells me that as a developer of XML user agents, when I parse an XHTML document I am expected to only pay attention to the tags that are flagged with an ID. Cool, that makes a lot of sense, and should be easy to handle.

So I extend XML in Flash, and add some methods that easily allow me to find the tags I am supposed to be able to consume. Done, but then I discover something that just isn't right. when I look for the tag that is built to house my post content, I come across a div tag called:

id=''content"

within that there are more div tags:

id="main"
id="main2"

a h2 tag wraps the date which is in here apparently a child of content. note the date had no id tag. I then get to another div which is unnamed that contains the actual post contents.

So my issue becomes the fact that there is no rhyme or reason to what the contents of the tag named 'content' actually is. It's children are a combination of presentation data mixed with the actual data I need.

What the fix would be is to actually label the tags that do contain specific relevent data as their value. They could keep their div called 'content', but as an XML user-agent the rest of that presentation data within that tag is useless to me, unless those children have id's as well.

Label the specific tag that wraps the node I am looking for with an id that makes it easy and predictable for a machine to consume.

Label it so that the XML user agent does not need a high level understanding of HTML to be able to make sense of the contents. If the contents contain alot of HTML then in the future we will be holding back technology by forcing it to understand very old deprecated techniques of organizing content.

If I did want that date out of the 'content' tag, I would have trouble finding it. It is included as a child tag with the name 'h2' which does not describe it's content at all, and it has no id! The machine cannot make concrete decisions on what this content might or might not be.

If content creators actually focused being meticulous with identifying data within XHTML, you would evetually see a wave of best practices emerge that could really take the concept of syndication and remixing to a whole new level with the masses.

I fear that XHTML has not been explained correctly to most people, and that the term "future proof" needs clarification. I would say we want to future proof the information, not future proof the layout of the document.

The focus of XHTML should really be to describe the contents of a document so a browser can render it, but also design it so that a machine can consume it without having to treat it as a document, but instead treat it as structured data. Good XHTML should contain indicators that allow the machine to easily drop all the crappy "HTML carry over" and just strip out the good stuff.

9 Comments:

Anonymous said...

But surely h2 for your date just descbribe the heading order.
What if you had a list of multiple entries, how could you give them unique IDs?... Many questions, really interesting post to read.

1:06 PM

Tony said...

"What if you had a list of multiple entries, how could you give them unique IDs?"

Well that is a great question, but one that should be addressed to the people that write these standards. I think this is a great discussion to though, because instead of saying we are "future proofing" our content, and then continue to generate crap. We are starting to discuss the possible future requirements.

1:21 PM

Anonymous said...

Best innovations of the web? wtf are you smoking?

1:46 PM

Tony said...

Care to back up why you feel that way? I am prepared to explain why I think it is a great innovation.

1:48 PM

Anonymous said...

And suddenly, you made it all click. XHTML is about letting the browser do its job (render html), while allowing another type of tool (xml parser in your flash example) come along and make sense of the content as well. The fog has lifted for me!!!

3:44 PM

Anonymous said...

No one wants to use XHTML with flash.
Instead, you should have A)+B.n):
A) XML with data markup only
B.1)XSLT that renders XHTML for various devices (browser, mobile...)
B.2)Flash
B.3)Java, for instance
B.4, 5, 6...)Something that can parse XML

It is true that repeated data has no pre-determined naming convention, but I suggest nesting:
Inside id="titles" you can nest id="1", id="2" etc.
Regards,
Darko

2:50 AM

Anonymous said...