On the Representation of Abstractions

Steven Pemberton, CWI, Amsterdam

The Author



The principle aim of ixml is to extract structured abstractions out of representations.

Even though the main use at the moment is to serialise those abstractions as XML, once you have that abstraction, there are many things you could do with it.

For instance, my implementation of ixml serialises ixml directly into memory in the form expected by the Earley parser.

XML processors could do the same, and have ixml as a pre-processor that serialises (non-XML) documents directly into an internal memory form, such as XDM.


Three examplesNumbers are abstractions: you can't point to the number three, just three bicycles, or three sheep, or three self-referential examples.

Three is what those bicycles and sheep and examples have in common.


You can represent a number in different ways:

3, III, 0011, ㆔, ३, ፫, ૩, ੩, 〣, ೩, ៣, ໓, Ⅲ, ൩, ၃, ႓, trois, drie.

You can concretise numbers as a length, a weight, a speed, a temperature.

But in the end, they all represent the same three.

Data is an abstraction too!

We are often obliged for different reasons to represent data in some way or another.

But in the end those representations are all of the same abstraction; there is no essential difference between the JSON

{"temperature": {"scale": "C"; "value": 21}}

and an equivalent XML

<temperature scale="C" value="21"/>



or indeed

temperature: 21°C

since the underlying abstractions being represented are the same.


However, I want to talk about the external representation of abstractions, and in particular in XML, which is currently the best available standardised notation available for such a purpose.

To represent abstractions, we need notations.

And notations need to be designed.


I wasn't involved with the design of XML: I just watched it from the sidelines.

Since HTML had become so successful, attention had focussed on SGML, which the first version of HTML had mostly used, and later versions completely used.

SGML is a complex language (it is rumoured that it has never been completely implemented).

It was decided to simplify SGML to produce a subset, and call it XML.

I was tasked with leading the group who were required to produce the XML representation of HTML, and so that design was going to be important for us.

And they did a good job. It is hard to avoid the second system effect.


$ clocks represented differently in ViewsHowever, I had just come from a project designing an "environment" called Views.

This was a system that amongst other things had:

This data language had no 'standard' external representation: you used the stylesheets to display the data in any way you wanted.

In other words, we started with the abstractions -- the design of the data model -- and representations followed.

The Design of XML

So I did know about the requirements for data representation.

As I watched the development of XML, I used to ask myself "I wonder why they did that?"

Years later I came on this quote from Tim Bray, one of the developers of XML:

"You know, the people who invented XML were a bunch of publishing technology geeks, and we really thought we were doing the smart document format for the future. Little did we know that it was going to be used for syndicated news feeds and purchase orders."

When I read that, a light went on. That explained a lot.


There have been many proposals to redesign XML, or design alternatives.

Tim Bray himself has said

Just drop the <!DOCTYPE>. Immense numbers of problems vanish. Mind you, the MathML crowd will come over to your house and strangle your kittens. But really, in the whole world they're the only ones who care; if I were XML dictator, I'd drop it in a microsecond.

That was in a comment to a proposal called XML5 to redesign XML (apparently only in order to 'improve' its error recovery!)

Microxml: Proposed by James Clark as smaller and 'dramatically' simpler; allow error-recovery.

Ftanml: Michael Kay's exercise at designing something better. Good at data and structured documents; concise; human-readable; data-model applicable to programming languages. Deal with the whitespace problem.

Binary XML: Several versions; "compacter".

XML 1.1: Changed the definition of names, and allowable characters.

Error recovery

A couple of the redesigns mentioned specifically error recovery.

Information representation is a kind of contract. You want the receiver to receive exactly what you sent.

If the receiver silently tries to guess what you really meant, that contract has been broken.

Error recovery is an example of the robustness principle

Robustness Principle

The Robustness Principle (also known as Postel's Law) states:

"Be conservative in what you produce,
be liberal in what you accept from others"

The Robustness Principle was proposed in order to improve interoperability between programs, which has been expressed by one wag as:

"you’ll have to interoperate with the implementations of others with lower standards. Do you really want to deal with those fools? Better to silently fix up their mistakes and move on."

The robustness principle can be useful, but it's not a law! You should be careful where you apply it.

Robustness and programming languages

Luckily, in general, the robustness principle is not applied to programming languages.

Imagine if the compiler tried to guess what you really meant when it came across an error!

Actually, it was tried in PL/I, and it was terrible. The only way you could find mistakes was to run tests, and as Dijkstra famously observed:

“Program testing can be used to show the presence of bugs, but never to show their absence!”

It is much better for the compiler to tell you about the problems before they occur.

Studies have shown that 90% of the cost of software comes from debugging, so reducing the need for debugging is really important.


Unfortunately the robustness principle has also been applied in Javascript. Did you know that

++[[]][+[]]+[+[]] evaluates to the string "10" ?

Javascript is for that reason very hard to debug, since it silently accepts certain classes of errors, that then don't show up until much later.


I literally spent an afternoon debugging this problem in someone else's code.

The Javascript was calculating the number of seconds from a number of hours and minutes. The essence of the code was this:

seconds = 60 * (hours * 60 + minutes)

I knew that hours was 1, and minutes was 0. So the answer should be 3600, right?

I was getting 10 times that: 36000.

Can you work out why?


I literally spent an afternoon debugging this problem in someone else's code.

The Javascript was calculating the number of seconds from a number of hours and minutes. The essence of the code was this:

seconds = 60 * (hours * 60 + minutes)

I knew that hours was 1, and minutes was 0. So the answer should be 3600, right?

I was getting 10 times that: 36000.

Can you work out why?

It turned out that hours and minutes (which were set very far away from this bit of code), were actually strings, not numbers.

So hours*60 converted hours to a number, and multiplied it by 60, giving 60.

That 60 was then converted to a string, and concatenated with minutes, which was "0", giving "600". This was then converted to a number, and multiplied by 60, giving 36000.

Error correction in data notations

We never had the need for error correction in the Views system.

How programmers work: they recompile until the program compiles, then they run it.

Just like with programming, if you displayed a document that didn't match its style-sheet, it displayed as much as it could, and a big red blob where it went wrong.

With data representations too, you fix them until they display correctly; then you send them to others.

Without error correction, all documents are correct, and you know when one isn't.

With error correction, if it looks right it gets sent on, and everyone has to fix it up.


A widely overlooked aspect of the design of notations is usability. A generally accepted definition of usability is

Some definitions add learnability to this.


As an example of how usability can affect notations, consider Roman numerals,


These are mostly fine for representing numbers, just about acceptable for addition, but terrible for multiplication, which was a university subject until the introduction of indo-arabic numerals, and disastrous for division.

Indo-arabic numerals reduce the time needed for performing these operations, reduce the errors made while doing them, and consequently are more pleasurable to use.

Nowadays even schoolchildren can do multiplication and division.


Another example is the two-letter abbreviations for state names in postal addresses in the USA.

For example, is "NE" the code for Nevada or Nebraska? It is Nebraska, but if NB had been used instead, there would be no need to ask.

Similarly for MI: Mississippi, Missouri, Michigan, or Minnesota? It's Michigan, but MG would have been a better choice.

But this only solves the problem for reading. To write an abbreviation, you would still have to look it up, or memorise it.

On the other hand, a straightforward rule like

"For states with a two word name, use the initials of those words, otherwise use the first and fourth letter of the state name, unless there is a repeated letter in the first four, and then use the first and the fifth"

generates a unique set of two-letter abbreviations that you don't have to memorise, and are easy to back-translate to the original state name.

Usability in XML

An example of good usability in XML is the requirement that attribute values should always be quoted:

<doom end="nigh"/>

HTML for instance doesn't have this as a requirement: the quotes are optional in certain circumstances. The problem is that you have to know and memorise the conditions under which they can be omitted.

To make matters worse, HTML processors don't tell you when you get it wrong, but try to guess what you meant and fix it up.

Needless to say this causes other errors, and slows you down trying to find the source of the error. To make matters even worse, CSS and Javascript, that can be embedded in HTML, have different rules for quoted tokens.


Another example of good usability in XML is the requirement that the closing tag of an element be the same as the starting tag. This helps you narrow down far faster and more directly to the problem when there is an error in the bracketing, whilst languages that use a single generic closing bracket only reveal the error very far away from its source.


What I now want to talk about is where XML gets in the way of properly representing abstractions, and how we could fix that.

One aim: existing documents remain valid with the same meaning.


Part of the desiderata for the Views system was that there should be a simple algebra of documents.

For instance, if you have two documents A and B, then


is also a document.

Furthermore there is a 'zero' document ∅ such that

A+∅ = ∅+A = A

Also we wanted composabilty, so that a documents could be embedded in others.

Text has these properties.

XML has none of them.


XML bans certain characters; from a print documents perspective, this is reasonable.

From a content perspective, it is not, and means there are documents you would like to represent, that you can't.

For instance, for a table of keybindings, you are forced to use an encoding for control characters, which you then have to process separately

<bind char="^A" name="SOH" rep="␁" funct="beginning-of-line"/>
<bind char="^B" name="STX" rep="␂" funct="back"/>
<bind char="^G" name="Bell" rep="␇" funct="cancel"/>

A solution, also used by XML 1.1, would be to allow character entities for any character, though not the character itself:

<bind char="&#1;" name="SOH" rep="␁" funct="beginning-of-line"/>
<bind char="&#2;" name="STX" rep="␂" funct="back"/>
<bind char="&#7;" name="Bell" rep="␇" funct="cancel"/>

There would still remain the problem of &#0;


What are attributes for? I could find no conclusive answer to this question. Are they metadata about an element?

They seem to be special cases of element children, but with differences:

It is difficult to understand the functional reason for the unorderedness, but the last restriction is surprising. You can have


but not

<p class="block" class="important" class="frame"> ... </p>

The result is that you get languages where attributes have (hidden) internal structure in order to overcome this restriction:

<p class="block important frame"> ... </p>


Whitespace is a recurring problem with XML, the main problem being identifying whether whitespace in content is really content or just for readability purposes.

Weirdly, the XML spec only allows for two values of whitespace treatment: preserve, and default.

In XHTML, we were lucky that we only wanted preserve: CSS provides values for how spaces whould be displayed.

In ixml (the language) we solved it by putting all content into attributes, so the question doesn't arise.

In XForms we can assign a value a whitespace property, and had use cases for the following values

Seems like reasonable values for the xml:space attribute.


XML has a rule that a document must have a single root element.

This would be like a filestore that only allows a single directory at the top-level.

XPath lets you select a group of elements, such as


which would give you all the top-level elements in an HTML document.

In XForms, I often use


to give me the topmost element, whatever it's called.

What this shows is, just like filestores, XML already has a top-level root, namely "/", and it doesn't really need to restrict it further.

An XML Server

I wrote an XML server using REST protocol, where every 'resource' was an XML document.

What I wanted was a POST to append the incoming document at the end of another one:


Posting another <member> element would just add it to the end.


This is unnecessarily hard. There is no reason why


shouldn't be an acceptable XML document, and then I could use the XPath


to get the set of top-level elements.

Similarly the empty document should be a valid XML document.

This would give you the simple algebra I talked about earlier.

If you still want a single-rooted document, that's fine.


XML was designed as "the smart document format for the future".

We are now using it for purposes outside of those design criteria.

So the need for change shouldn't be seen as a criticism of the original design, just a recognition that the design closely enough matches the new purposes that it can be adapted for them.

But those new purposes really do need those changes.

The advantage of the changes suggested here are that existing XML documents remain XML documents, with the same meaning, in the new definition, while allowing greater expressibility for the representation of abstractions.