Valid XHTML Document: The Basics
To set the foundations for a good project that adheres to Web Standards, it is of great importance to ensure that your HTML Document is properly marked up and – what is more – valid. In this article, I would like to explain the DOCTYPE, the namespace, the XML prolog and leave a word or two about Character encoding.
If you want to start writing a valid XHTML Document, it has to start with a proper DOCTYPE. But what is a DOCTYPE? Generally speaking, a DOCTYPE can be described as a defined set of rules that you – the author – stated that your document adheres to. The validator is then able to check, whether this is actually true and will point out errors, if your document appears to not stick to the rules, i.e. is invalid.
XHTML 1.0 Transitional
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
The Transitional DOCTYPE derived from the standard HTML 4.01 DOCTYPE. The purpose of this DOCTYPE is to provide a way for designers and developers to convert their old-school web sites (those which are full of presentational markup, table-based layout etc.) to sites that are more up-to-date and adhere to state-of-the-art rules. You can use some presentational markup in your code, as well as some deprecated elements, and the validator will not mark them as errors.
XHTML 1.0 Strict
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
As the name implies, if you say that your document was marked up according to the Strict DOCTYPE, your document needs to be presentational markup free, and no deprecated elements and attributes must be used.
XHTML 1.0 Frameset
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
If you decided to use frames for your site, you have to used the Frameset DOCTYPE. Whether using frames is a good idea or not…well, I leave this up to you.
After the DOCTYPE, the first element in an XHTML document is always
<html>. For your document to be valid, you need to add a namespace declaration to that. A namespace is just saying, what pool of elements and attributes are available for usage in your document. If you add the namespace to the
<html> element, it will look similar to this:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
The value of the xmlns-attribute is an URL pointing to the namespace, the xml:lang attribute is stating to which language version of XML you are referring to (English in this case) and lang=”en” is saying that your document is written in English.
The W3C recommends, that every XML document, including XHTML documents should start with a so-called XML prolog, therefore it precedes the DOCTYPE declaration. It looks like this:
<?xml version="1.0" encoding="UTF-8"?>
Unfortunately Internet Explorer on Windows has some problems with that, i.e. the XML prolog puts IE/Windows into Quirks Mode (even though on IEBlog it says that this has been fixed). If you want to make sure that Microsoft’s browser is rendering your pages correctly, you need to skip that and set the encoding elsewhere. This can be done in the <head> element, using a meta-element, described in the next section.
Here is the right place to set the character encoding of your choice. Until recently, I used ISO-8859-1, which is capable of displaying characters used in Western European Languages. Not so long ago, I decided to switch to Unicode, UTF-8, because there is a numerical equivalent to every character possible in this encoding scheme. So, inside your <head>, you would add something similar to the following:
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8" />
Because you are serving the page as UTF-8, is it also not necessary to put encoded versions of special characters like the copyright symbol (©) on the page. You can just use the actual UTF-8 character itself. This is true except of the ampersand (&) which still needs to be encoded in your content.
When creating an (X)HTML document, you basically have the choice between serving your pages as text/html or application/xhtml+xml. Because of the fact that you have chosen an XHTML DOCTYPE, you might think it makes sense to use application/xhtml+xml as your MIME type of choice to make use of the advantages of XHTML, such as:
- Other XML based languages, such as MathML can be integrated in your document
- Errors in the document will be reported by user agents immediately, refusing to display it
- Guaranteeing that the document displayed is “well-formed”
However, doing so could cause a lot of problems, for example that Internet Explorer refuses pages served as application/xhtml+xml and that even the slightest flaw in your markup will lead to an XML error being displayed in the browser window. Under certain circumstances, it might make sense to use application/xhtml+xml, but my preferred way is to use text/html. Additionally, the MIME type of your document can only really be set on the web server, not in the document itself. So whether you put text/html or application/xhtml+xml in the meta element, does not matter. What is crucial is how your web server is set up to serve (X)HTML to user agents. Paul Haine goes a little bit more into detail in the first chapter of his book, HTML Mastery.
The title Element
As per the specifications of the W3C, every HTML document must have a title element in the head section. This should be used to identify the contents of the document, so for example:
<title>Flights delayed for hours by Northeast storms - CNN.com</title>
So the basic structure of your XHTML document, that is the foundation for a proper and valid page, should look like this:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" dir="ltr"> <head> <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" /> <title>Title of your document</title> </head> <body> </body> </html>
Further reading and sources
- The Web Standards Project
- Recommended DTDs to use in your Web document (W3C QA)
- XML Namespace (Wikipedia)
- Extensible Markup Language (XML) 1.0 – XML Prolog
- Quirks mode and strict mode
- The <?xml> prolog, strict mode, and XHTML in IE
- Character Encoding (Wikipedia)
- Serving different MIME types to different browsers – Content Negotiation
- The dir attribute
- The lang attribute