XHTML
Steven Pemberton
CWI, Amsterdam
Chair, W3C HTML Working Group
Overview
- History
- Philosophy
- XML and related technologies
- XHTML 1.0
- Modularisation
- XHTML Basic
- XHTML 1.1
- The Future
HTML 1
- The original HTML was designed in the early 1990's for scientific reports
- Each document was a single resource (not even <IMG>)
- (This explains much about HTTP by the way)
- It is amazing how much we have been able to do with a language with such beginnings
- It was described using SGML
HTML as an SGML Application
- SGML: an international standard in 1986
- It is a Meta-language that describes data formats, using DTD's (Document Type Definitions)
- Describes structure, not presentation
<H1>HTML as SGML Application</H1>
Example of a DTD fragment
<!ELEMENT table
(caption?, (col*|colgroup*), thead?,
tfoot?, (tbody+|tr+))>
<!ELEMENT caption %Inline;>
<!ELEMENT thead (tr)+>
...
Attributes
<!ATTLIST TABLE
%attrs; -- %coreattrs, %i18n, %events --
summary %Text; #IMPLIED
width %Length; #IMPLIED
border %Pixels; #IMPLIED
...
>
Entities
<!ENTITY % fontstyle
"TT | I | B | BIG | SMALL">
<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; | %formctrl;">
<!ENTITY % Length "CDATA" -- nn for pixels or nn% for percentage length -->
Problems with SGML
- Arcane syntax
- Very difficult to implement fully
- No support for types
Changes to HTML
- Netscape and Microsoft start adding to HTML: mostly presentation-oriented tags (like <BLINK>, <CENTER>), and frames
- The World Wide Web Consortium (W3C) started effort to:
- Keep HTML Pure
- Do presentation via Style Sheets
Separating content and presentation
- HTML was designed as a data-structuring language, but the later changes undermined this.
- Separating content from presentation has distinct advantages
For the author
- Easier to write your documents
- Easier to change your documents
- Easy to change the look of your documents
- Access to professional designs
- Your documents are smaller
- Visible on more devices
- Visible to more people
For the webmaster
- Separation of concerns
- Simpler HTML, less training
- Cheaper to produce, easier to manage
- Easy to change house style
- Reach more people
- Search engines find your stuff easier
- Visible on more devices
For the reader
- Faster download (one of the top 4 reasons for liking a site)
- Easier to find information
- You can actually read the information if you are sight-impaired
- Information more accessible
- You can use more devices
For the implementor
- Improves the implementation (separation of concerns)
- Can produce smaller browsers
Changes to HTML (2)
- Another change that Netscape made, with insufficient thought was Frames
- Frames create significant problems with web pages
The problems with frames
- Can't bookmark framesets
- [Back] does odd things
- [Page up] and [page down] work oddly
- [Reload] often doesn't work right
- Security is compromised
- Nested frames are hard to deal with (how do you get out?)
What frames can do
- Search and show interfaces
- Keeping script variables in a hidden frame
Style languages
- The first action that W3C did was to start an activity on Style Sheets (Nov 1995)
- This produced CSS1 initially (Dec 1996), then CSS2 (May 1998) (CSS3 is in preparation)
- Later produced XSL, an XML-based language, as complementary to CSS
CSS
- CSS is a separate language from HTML that allows you to specify how an HTML document, or set of documents, should look
- Separates content from presentation
- HTML can be a structure language again
Examples of CSS
h1 { font-weight: bold; font-size: 2em }
h2 { font-weight: bold; font-size: 1.5em }
em {background-color: yellow}
body {margin-left: 20%}
Using CSS
Advantages of CSS
- Makes HTML easier to write (and read)
- You can define a house style
- Compatible: you can still see the content on non-CSS browsers
- Pages are much smaller
- Accessible to sight-impaired
- ...
By the way...
- Check your logs: more than 95% of people browsing now use a CSS-enabled browser
- The current generation of browsers (IE 5, NS 6, Opera 4) have excellent support for CSS.
- You never need to use the <FONT> and <FONTFACE> elements again!
Documents
- As mentioned, HTML was designed for just one sort of document (scientific reports), but is now being used for all sorts of different documents
- You could use SGML to define other sorts of document, but SGML is notoriously hard to fully implement
- Enter XML
Enter XML
- XML is a W3C effort to simplify SGML
- It is a meta-language: a language for defining languages
- It is a subset of SGML
- One of the aims is to allow everyone to invent their own tags
- DTD is optional: a DTD can be inferred from a document
Consequences
- The requirement of being able to infer a DTD from a document has an effect on the languages you can define:
- Closing tags are now required
<LI>....</LI> <P>....</P>
- Empty tags are marked specially
<IMG SRC="pic.gif"/> <BR/> <HR/> (or <HR></HR> etc)
Consequences 2
By the way:
<P> is not like <BR>
Not like this:
<H1>XML</H1>
An underlying problem with HTML is that ...
<P>
You could use SGML to define ...
But like this:
<H1>XML</H1>
<P>
An underlying problem with HTML is that ... </P>
<P>
You could use SGML to define ...</P>
Consequence of XML
- Anyone can now design their own (Web-delivered) languages
- CSS makes them viewable
<address>
<name>Steven Pemberton</name>
<company>CWI</company>
<street>Kruislaan 413</street>
<postcode>1098 SJ</postcode>
<city>Amsterdam</city>
<speaker/>
</address>
So do we still need HTML?
- Workshop in May 1998
- XML is still a meta-language
- There is still a perceived need for a base-line mark-up
- HTML has some useful semantics, both implied and explicit (search engines gladly use it, for instance)
HTML as XML application
- Clean up (get rid of historical flotsam)
- Modularise – split into separate parts
- Allows other XML applications to use parts
- Allows special purpose devices to use subset
- Add any required new functionality (forms, better event handling, Ruby)
The HTML Working group
- International membership, around 20 members
- Many major players (IBM, Microsoft, Netscape, etc)
- Meets weekly by phone, quarterly face-to-face
Group experience
- There was more to be worked out than we anticipated
- XHTML is the first major application of XML, so the world's eyes are on us
- XML still needs the wrinkles ironed out
Philosophy of XHTML
- Transition from 'old world' to XML
- Clean up the language
- Return to structure only
- Use generic XML as much as possible
- Modularise
- Address wider needs (International, Accessibility)
- Add new functionality
Plan of action
- HTML 4.01: corrected version
- XHTML 1.0: transitional version of HTML 4.01 in 3 flavours
- Modularisation: agreement on split and methodology
- XHTML Basic: Small devices
- XHTML 1.1: clean version of 1.0 strict
(plan of action)
- Events: accessible and device-independent
- Ruby: needed Asian markup
- Forms: more control
- XHTML 2.0: Putting it all together
Differences HTML:XHTML
- Because of the difference between SGML and XML, there are some necessary differences, for instance:
- Use lower case: <p> not <P>
- Attributes are always quoted:
<th colspan="2">
- Anchors use id attribute not name (and not just on <a> by the way):
<a id="index"> <p id="top">
Example XHTML 1.0
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head><title>Virtual Library</title></head>
<body>
<p>Moved to <a href="http://vlib.org/">vlib.org</a>. </p>
</body>
</html>
Namespaces
- Namespaces have been added to XML to allow you to mix fragments from different languages (e.g. HTML + Maths)
- In the same way that object-oriented languages allow you to identify which function you are using, namespaces allow you to identify which tags you are using.
Example of nesting
<html xmlns="http://www.w3.org/1999/xhtml">
<head><title>A Math Example</title></head>
<body>
<p>The following is MathML markup:</p>
<math xmlns="http://www.w3.org/TR/REC-MathML">
<apply><log/><logbase><cn> 3 </cn> </logbase>
<ci> x </ci>
</apply>
</math>
</body>
</html>
Example of colonising
<math xmlns="http://www.w3.org/TR/REC-MathML"
xmlns:html="http://www.w3.org/1999/xhtml">
<apply><log/><logbase><cn> 3 </cn> </logbase>
<ci> x </ci>
</apply>
<html:p>This is a paragraph</html:p>
</math>
Namespaced attributes
XML 'namespace'
- XML also has its own pseudo-namespace for reserved attributes:
<para xml:lang="en">
Using 'generic' XML
- Presentation: use CSS
- Links: use Xlink or Schemas
- Forms: use CSS?
- Images etc.: use Xlink or Schemas
- (Natural) language of elements: use xml:lang attribute
Xlink?
- HTML has several 'built-in' hyperlinks: <a>, <img>, <object>, <link>, etc.
- Since XML allows you to define your own elements, a browser doesn't know which are links
- Xlink was started to solve this problem.
Xlink
- Xlink started as a method of describing which attributes of an element were a link
- It later changed into a language of links, so it could no longer be used to describe XHTML
- The current plan is now to introduce types into Schemas to describe links
Example of Xlink
<crossReference
xmlns:xlink="http://www.w3.org/1999/xlink"
xlink:type="simple"
xlink:href="students.xml"
xlink:role="studentlist"
xlink:title="Student List"
xlink:show="new"
xlink:actuate="onRequest">
Current List of Students
</crossReference>
Schemas
- Schemas are a new technology to replace much of DTDs.
- Schemas are expressed in XML
- They have support for data types
- Much easier to parse and implement than DTDs
Schemas: but
- They don't support the definition of entities (é)
- Not easy to read (or write)
Schema fragment
<elementType name='table'>
<refines>
<archetypeRef name='common'/>
<archetypeRef name='simpleBlockDisplay'/>
</refines>
more>>>
(schema fragment)
<sequence>
<elementTypeRef name='caption' minOccur='0' maxOccur='1'/>
<choice>
<elementTypeRef name='col' minOccur='0' maxOccur='*'/>
<elementTypeRef name='colgroup' minOccur='0' maxOccur='*'/>
</choice>
more >>>
(schema fragment)
<choice>
<sequence>
<elementTypeRef name='thead' minOccur='0' maxOccur='1'/>
<elementTypeRef name='tfoot' minOccur='0' maxOccur='1'/>
<elementTypeRef name='tbody' minOccur='1' maxOccur='*'/>
</sequence>
<elementTypeRef name='tr' minOccur='1' maxOccur='*'/>
</choice>
</sequence>
</elementType>
(equivalent DTD)
<!ELEMENT table
(caption?, (col*|colgroup*), thead?,
tfoot?, (tbody+|tr+))>
XHTML 1.0
- XHTML 1.0 is an XML-ised version of HTML 4.01
- Just like HTML 4.01, there are 3 versions: 'strict', 'transitional', and 'frameset'
Transitional version
- XHTML 1.0 has been carefully designed to make use of 'quirks' in existing HTML browsers
- Use of a small number of guidelines allows XHTML to be served to HTML user agents as well as XML user agents
Examples of Guidelines
Serving XHTML 1.0
- An XHTML 1.0 document that follows the guidelines can be served up either as HTML, or as XML
- But beware: CSS has slightly different rules for HTML and XML
- Similarly, the DOM has differences for HTML and XML
Modularisation
- XHTML has been divided into a number of modules.
- A module is a collection of elements and/or attributes that can be used as building blocks to build a DTD.
- A language can be built by using just XHTML modules, or adding your own
- We had originally defined Modularisation just for our own use, but it has turned out useful for other groups as well
XHTML modules
- Structure: html, head, title, body
- Text: abbr, acronym, address, blockquote, br, cite, code, dfn, div, em, h1, h2, h3, h4, h5, h6, kbd, p, pre, q, samp, span, strong, var
- Hypertext: a
- List: ol, ul, dl, li, dt, dd
(modules)
- Applet (deprecated): applet, param
- Presentation: b, i, hr, big, small, sub, sup, tt
- Edit: del, ins
- Bi-directional Text: bdo
(modules)
- Basic Forms: simple forms
- Forms: full forms
- Basic Tables: simple tables
- Tables: full tables
(modules)
- Image: img
- Client-side Image Map: map, +
- Server-side Image Map: change to img
- Object: object, param
- Frames
- Target: attribute
- Iframe
(modules)
- Intrinsic Events: adds events attributes
- Metainformation: meta
- Scripting: script
- Stylesheet: style
- Style Attribute
- Link: link
(modules)
- Base: base
- Name Identification: name attribute
- Legacy: basefont, center, font, s, strike, u, plus loads of attributes (eg align)
- Ruby: Asian markup
Note on modules
- Note that some modules consist of a single element, or just add some attributes to existing elements
- Not all modules are independent: if you use some modules, they bring other modules with them, or change other modules
- Future modules are planned (eg extended forms, events)
The XHTML family
- To be an XHTML host language you must use the Structure, Hypertext, Basic Text, and List modules
- To be an XHTML integration language you may define your own Structure module
Example integration languages
- SMIL is planning a module to integrate SMIL and HTML
- Likewise for MathML
Creating a DTD
- It is not expected that creating XHTML-based languages will be a daily activity
- Not the place to describe the method here: it depends on understanding DTDs.
- The Modularisation document has extensive examples
- Future versions will also use Schemas (we hope...)
XHTML Basic
- XHTML Basic is the first XHTML family-member to be defined using Modularisation
- It is designed for small devices, typically mobile telephones
XHTML Basic Modules
- Structure*: body, head, html, title
- Text*: abbr, acronym, address, blockquote, br, cite, code, dfn, div, em, h1, h2, h3, h4, h5, h6, kbd, p, pre, q, samp, span, strong, var
- Hypertext*: a
- List*: dl, dt, dd, ol, ul, li
- Basic Forms: form, input, label, select, option, textarea
- Basic Tables: caption, table, td, th, tr
- Image: img
- Object: object, param
- Metainformation: meta
- Link: link
- Base: base
XHTML Basic usage
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML Basic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml-basic10.dtd">
XHTML 1.1
- XHTML 1.1 is the second family member to be defined using Modularisation
- Its main aim is to present a cleaned-up, non-transitional version of XHTML 1.0 strict (no frames)
- It also adds Ruby markup
- Otherwise: no new functionality
XHTML 1.1 Modules
- Structure, Text, Hypertext, List, Object, Presentation, Edit, Bidirectional Text, Forms, Tables, Image, Client-side Image Map, Server-side Image Map, Intrinsic Events, Metainformation, Scripting, Stylesheet Module, Style Attribute (Deprecated ), Link, Base, Ruby.
Example XHTML 1.1
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" >
<head> <title>Virtual Library</title> </head>
<body>
<p>Moved to <a href="http://vlib.org/">vlib.org</a>.</p>
</body>
</html>
Ruby
Example Ruby markup
<ruby>
<rb>WWW</rb>
<rp>(</rp><rt>World Wide Web</rt><rp>)</rp>
</ruby>
Use CSS to describe presentation
XHTML 2.0
- XHTML 2.0 is still in preparation
- New forms
- New events
- More accessibility
Forms
- Being produced by a separate group
- Consists of three parts:
- data model
- instances
- user interface
- Will include much more client-side checking
- Form data will be sent to the server as XML
- Separates content from presentation (e.g. a radio button and a select box both allow you to select one from many, and you may want to use different choices on different devices)
Events
- Current events are almost all in terms of mouse: onclick, onmouseover, onfocus, etc.
- Future event model will be device independent, and allow you to define your own new events
- Uses the DOM event model
The DOM
- Domain Object Model: how you access a document via scripting
- Currently only an XML DOM
- An XHTML DOM is being investigated
Accessibility and Internationalisation
- W3C has an accessibility group that checks that new recommendations address people with accessibility needs
- There is also an internationalisation group that does the same for cultural issues (which produced <ruby>)
Accessibility problems
- A sighted person can work out the structure from the visual presentation
- A non-sighted person cannot: the structure must be present in the markup
- That is why new features were added to forms and tables in HTML 4, like <caption>
Structure
- Text would also benefit from such a treatment: not h1, h2 etc (which are subject to misuse) but nested sections with their own headings
Example of structure
<section>
<h>XHTML</h>
...
<section>
<h>Structure</h>
...
</section>
</section>
CSS can still handle it
section h { how an h1 should look }
section section h { h2 }
section section section h { h3 }
etc.
Conclusions
- XML with related technologies gives you the freedom to define and deliver your own document types
- HTML is still needed as a base-line markup
- The new HTML gives a transition path to the future
The State of Things
- New generation of XML+CSS browsers emerging
- Many XML applications appearing
- Major companies planning XML as output
(Adobe PDF, MS Office 2000)
- Now: HTML4, XHTML 1.0, Modularisation, Basic, 1.1
To Find Out More
- All XHTML developments are made public at www.w3.org/Markup
- Members of W3C can also look at www.w3.org/Markup/Group