The author

Invisible XML

Steven Pemberton, CWI, Amsterdam

XML World Domination!

XML and Authoring

XML is a popular format.

Advantages: toolchain and pipeline available for generic XML processing. You can easily use new formats within the generic framework.

But...

However, for authoring purposes XML is seldom preferred over a notation more directly suited to the purpose.

People prefer the direct

body {color: blue}

to

<rule><simple-selector name="body"/>
<block><property name="color" value="blue"/>
</block></rule>

Or

if (max<a) then max=a;

to

<statement><if><condition><comparison name="&lt;"><var name="max"><var name="a"></comparison></condition>
<then><statement><assign><var name="max"/>
<expression><var name="a"/>
</expression></assign></statement></then></if>
</statement>

Or

Similarly, authoring languages designed to make HTML input easier, such as for Wikipedia, or Markdown do not use a nested SGML/HTML style.

And doesn't RELAX NG have both an XML syntax and a 'compact' syntax?

Even XML Formats...

<a href="http://www.w3.org/TR/1999/xhtml">XHTML</a>

does not surface the real structure of the underlying data.

To make the relevant structure explicit, we should really write something like:

<a><href><method type="href"/>
<domain name="org"/><site name="w3"/><sub name="www"/><path><root><sub name="TR">
<sub name="1999">
<sub name="xhtml">
</sub></sub></sub></root></path></href>
<text>XHTML</text></a>

The Approach

Add a step to the XML processing chain.

Standard

XML parser

Option 1

ixml pipeline

Option 2

Alternative ixml pipeline

Example

In other words, the input document might be

body {color: blue}

but the result will be the same as if an XML parser had been presented with the XML document

<css>
   <rule>
      <simple-selector name="body"/>
      <block>
         <property name="color" value="blue"/>
      </block>
   </rule>
</css>

Syntax definition

We use a variant of VWG format. This looks like:

css: rules.
rules: rule; rules, rule.
rule: selector, block.
block: "{", properties, "}".
properties:  property; property, ";", properties.
property:  name, ":", value; empty.
selector: name.
name: ...
value: ...
empty: .

(Simplified CSS grammar for the sake of this talk).

Parsing

body {color: blue}

Parsing this with the grammar gives us:

<css>
   <rules>
      <rule>
         <selector>body</selector>
         <block>
            <properties>
               <property>
                  <name>color</name>
                  <value>blue</value>
               </property>
            </properties>
         </block>
      </rule>
   </rules>
</css>

Terminal symbols such as the brackets, colons and semicolons don't appear in the parse tree. (More later.)

Problems

There are certain elements in the tree (rules, properties) that we really aren't interested in:

<css>
   <rules>
      <rule>
         <selector>body</selector>
         <block>
            <properties>
               <property>
                  <name>color</name>
                  <value>blue</value>
               </property>
            </properties>
         </block>
      </rule>
   </rules>
</css>

Problems

The problem becomes even more apparent with a CSS snippet like

body {color: blue; font-weight: bold}

since the <block> element then becomes even more unwieldly:

<block>
   <properties>
      <property>
         <name>color</name>
         <value>blue</value>
      </property>
      <properties>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </properties>
   </properties>
</block>

We would prefer

<block>
   <property>
      <name>color</name>
      <value>blue</value>
   </property>
   <property>
      <name>font-weight</name>
      <value>bold</value>
   </property>
</block>

Repetition

The problem comes from using recursion to deal with repetition. So we shall use a specific notation for repetition. Zero or more repetitions:

(rule)*

and one or more repetitions:

(rule)+

We extend these postfix operators to also act as infix operators, for a commonly occurring case:

(property)*";"
(property)+";"

which respectively mean "zero or more, separated by semicolon" and "one or more, separated by semicolon" (the separator may also be a nonterminal).

Revised syntax

Now we can specify our syntax as:

css: (rule)*.
rule: selector, block.
selector: name.
block: "{", (property)*";", "}".
property:  name, ":", value; .
name: ...
value: ...

and the parsetree will now look like this:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property>
            <name>color</name>
            <value>blue</value>
         </property>
         <property>
            <name>font-weight</name>
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

Terminals

In general, terminal symbols do not appear in the parse-tree.

If you want them to show up, you add an explicit rule for them:

property:  name, colon, value; .
colon: ":".

which will cause them to show up in the tree like this:

 <property>
     <name>color</name>
     <colon/>
     <value>blue</value>
 </property>

Meaningful terminals

However, there are places where terminals have semantic meaning, and you do want them to appear in the parse-tree, e.g. the names and values of the properties.

Terminals that are to be copied to the parse tree are marked specially:

name: (+"a"; +"b"; ...etc...; +"9"; +"-")+.

In other words, terminals are discarded, unless they are preceded with a +, when they are copied to the parse-tree.

Refinements

A refinement is when a syntax rule doesn't represent anything of semantic importance, but there to shorten definitions

Suppose we wanted to define a series of properties in a separate rule:

properties: (property)*";".

and use it:

block: "{", properties, "}".

but not want <properties> to appear in the final parse tree.

Rule: If we preceed the use of a rule by a "-", it does not appear in the parse tree:

properties: (property)*";".
block: "{", -properties, "}".

This would result in the same parse-tree as above.

Note that this still allows a rule to be used in other places and appear in the parse tree if needed.

It is only the rule name that doesn't appear in the tree; its content still appears.

Spaces

For simplicity we have ignored treating spaces in the syntax description. Fixing:

property:  name, -colon, value; .
colon: -spaces, ":", -spaces.
spaces: " "*.

Similarly, we can use it to make empty alternatives more explicit:

property: name, -colon, value; -empty.
empty: .

Extensions

Strictly speaking, this is sufficient.

However, there are possible extensions that give you a little more control over the result.

The most obvious is allowing the specification of attributes. This is simply done by marking the use of rules with at signs:

css: (rule)*.
rule: selector, block.
block: "{", (property)*";", "}".
property: @name, -colon, value.

A rule used like this may clearly not contain any structural elements (though it may contain terminals and refinements), since attributes are not structured, but this is an easy condition to check for.

Parsetree with attributes

The parsetree will now look like this:

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color">
            <value>blue</value>
         </property>
         <property name="font-weight">
            <value>bold</value>
         </property>
      </block>
   </rule>
</css>

Parsetree with attributes

If we changed the rule for property to look like this:

property: @name, -colon, @value.

then the resultant parse-tree would look like

<css>
   <rule>
      <selector>body</selector>
      <block>
         <property name="color" value="blue"/>
         <property name="font-weight" value="bold"/>
      </block>
   </rule>
</css>

Note that by marking the use of a syntax rule in this way, and not the definition, it allows the syntax rule to be used for structural elements (<name>color</name>) as well as for attributes (name="color").

Parsing Algorithms

You could restrict the languages to LL(1) or LR(1) to ensure fast parsing. But:

Parsing algorithms that can parse any context-free grammar are fast enough (e.g. Earley, CYK, or GLR).

The only remaining problem is if the syntax author describes an ambiguous language: output one of the parses, and leave it at that.

Example ambiguous grammar

If expression were defined as:

expr: i; expr, div, expr.
i: "i".
div: "÷".

then a string such as

i÷i÷i

could be parsed as both

<expr><i/></expr>
<div/>
<expr>
   <expr><i/></expr>
   <div/>
   <expr><i/></expr>
</expr>

and as

<expr>
   <expr><i/></expr>
   <div/>
   <expr><i/></expr>
</expr>
<div/>
<expr><i/></expr>

(i.e as both i÷(i÷i) and (i÷i)÷i ).

Delivery

To deliver a source document to be parsed, we can use a media type that supplies a reference to the required syntax description. For instance:

application/xml-invisible; syntax=http://example.com/syntax/css

Clearly a system can cache well-known syntax descriptions.

Defining Invisible XML with Invisible XML

It should go without saying that the syntax descriptions themselves are in Invisible XML (though in their case the syntax description must be cached to prevent an infinite loop of processing.)

Definition

The definition might look like this:

ixml: (rule)+.
rule: @name, -colon, -definition, -stop.
definition: (alternative)*-semicolon.
alternative: (-term)*-comma.
term: -symbol; -repetition.
repetition: one-or-more; zero-or-more.
one-or-more: -open, -definition, -close, -plus, separator.
zero-or-more: -open, -definition, -close, -star, separator.
separator: -symbol; -empty.
empty: .
symbol: -terminal; nonterminal; refinement.
terminal: explicit-terminal; implicit-terminal.
explicit-terminal: -plus, @string.
implicit-terminal: @string.
nonterminal: @name.
refinement: -minus, @name.
attribute: -at, @name.

Terminals

string: -openquote, (-character)*, -closequote.
name: (-letter)+.
letter: +"a"; +"b"; ...
character: ...

colon: -S, ":", -S.
stop: -S, ".", -S.
semicolon: -S, ";", -S.
comma:  -S, ",", -S.
plus:  -S, "+", -S.
minus:  -S, "-", -S.
star:  -S, "*", -S.
open:  -S, "(", -S.
close:  -S, ")", -S.
at:  -S, "@", -S.
openquote: -S, """".
closequote: """", -S.
S: " "*.

Parse tree

This would then parse to the XML form:

<ixml>
   <rule name="ixml">
      <alternative>
         <one-or-more>
             <alternative>
                <nonterminal name="rule"/>
             </alternative><separator/>
         </one-or-more>
      </alternative>
   </rule>
   <rule name="rule">
      <alternative>
         <attribute name="name"/>
         <refinement name="definition"/>
      </alternative
   </rule>
   <rule name="definition">
      <alternative>
         <zero-or-more>
            <alternative>
               <nonterminal name="alternative"/>
            </alternative>
            <separator>
               <refinement name="semicolon"/>
            </separator>
         </zero-or-more>
      </alternative
   </rule>
   ... etc ...
   <rule name="separator">
      <alternative>
         <refinement name="symbol"/>
      </alternative>
      <alternative>
         <refinement name="empty"/>
      </alternative>
   </rule>
   ...
   <rule name="colon">
      <alternative>
         <refinement name="S"/>
         <implicit-terminal string=":"/>
         <refinement name="S"/>
      </alternative>
   </rule>
   ... etc ...
</ixml>

Cleaning up

Thanks to the context-free parsing algorithm, we can remove the <alternative> elements when there is only one alternative in a rule, by redefining definition:

definition: -alternative;
             alternative, -semicolon, (alternative)+-semicolon.

Note how we have used the "-" character to prevent it being copied in the first case (when there is only one).

You wouldn't be able to use such a rule as this if there were a requirement on the syntax to be LL(1) or LR(1), since the two parts of the rule start with the same symbols.

Cleaning up

Similarly, we can get rid of empty <separators/> thusly:

one-or-more: -open, -definition, -close, -plus;
             -open, -definition, -close, -plus, separator.
zero-or-more: -open, -definition, -close, -star;
              -open, -definition, -close, -star, separator.
separator: -symbol.

We can move the value of the separator into an attribute with:

separator: @explicit; @implicit; @nonterminal; @refinement.
explicit: -plus, -string.
implicit: -string.

Parse tree

This would then generate:

<ixml>
   <rule name="ixml">
      <one-or-more>
         <nonterminal name="rule"/>
      </one-or-more>
   </rule>
   <rule name="rule">
      <attribute name="name"/>
      <refinement name="definition"/>
   </rule>
   <rule name="definition">
      <alternative>
         <refinement name="alternative"/>
      </alternative>
      <alternative>
         <nonterminal name="alternative"/>
         <one-or-more>
            <nonterminal name="alternative"/>
            <separator refinement="semicolon"/>
         </one-or-more>
      </alternative>
   </rule>
   ... etc ...
   <rule name="separator">
      <alternative><refinement name="symbol"/></alternative>
      <alternative><refinement name="empty"/></alternative>
   </rule>
   ... etc ...
</ixml>

(An observant reader will have spotted that we have allowed attributes to be defined by attributes here -- for instance with @refinement -- that is we treat an attribute within an attribute definition as if it were a refinement).

Separator

As yet another possibility, we can move the separator into an attribute of the one-or-more or zero-or-more elements:

one-or-more: -open, -definition, -close, -plus;
             -open, -definition, -close, -plus, -separator.
zero-or-more: -open, -definition, -close, -star;
              -open, -definition, -close, -star, -separator.
separator: @explicit; @implicit; @nonterminal; @refinement.
explicit: -plus, -string.
implicit: -string.

Alternative Representation

Too many "-"s!

Instead: nothing is copied to the syntax tree unless specifically marked with "^" :

ixml: (^rule)+.
rule: @name, colon, definition, stop.
definition: alternative; ^alternative, semicolon, (^alternative)+semicolon.
alternative: (term)*comma.
term: symbol; repetition.
repetition: ^one-or-more; ^zero-or-more.
one-or-more: open, definition, close, plus;
             open, definition, close, plus, ^separator.
zero-or-more: open, definition, close, star;
              open, definition, close, star, ^separator.
separator: terminal; @nonterminal; 
refinement.
symbol: terminal; ^nonterminal; ^refinement.
terminal: ^explicit-terminal; ^implicit-terminal.
explicit-terminal: up, @string.
implicit-terminal: @string.
nonterminal: up, @name.
refinement: @name.
attribute: at, @name.

Terminals

string: openquote, (character)*, closequote.
name: (letter)+.
letter: ^"a"; ^"b"; ...
character: ...

colon: S, ":", S.
stop: S, ".", S.
semicolon: S, ";", S.
comma:  S, ",", S.
plus:  S, "+", S.
up:  S, "^", S.
star:  S, "*", S.
open:  S, "(", S.
close:  S, ")", S.
at:  S, "@", S.
openquote: S, """".
closequote: """", S.
S: " "*.

An ambiguity

This grammar is actually ambiguous in a trivial way, since with any two adjacent terminal symbols with spaces between them, the spaces can be associated with either of the terminals.

This ambiguity is trivial in the sense that the resulting parse trees would be identical, and so has no effect on the output.

However, the ambiguity would increase parse time.

Fix by removing all trailing uses of S in the rules, and adding an S to the rules for name and value:

name: S, (letter)+.

(or alternatively, removing all leading S's, and add a trailing S for name and value).

Usability

A remark that should be made is that the design reason for the introduction of the repetition operators "+" and "*" is no longer there. That is, for instance,

ixml: (^rule)+.

could be written as

ixml: ^rule; ixml, ^rule.

and

alternative: (term)*comma.

could be written

alternative: empty; terms.
terms: term; term, comma, terms.

and would give the same output. Nevertheless, the constructs are worth keeping for supporting these often-occurring idioms.

Identifiers

Modern programming languages don't allow spaces in identifiers: one-or-more, one_or_more, oneOrMore.

But programming languages like Fortran and the Algols did allow spaces.

There is no reason per se to restrict rule names in Invisible XML to be spaceless. There is no ambiguity in:

repetition: ^one or more; ^zero or more.
one or more: open, definition, close, plus; 
             open, definition, close, plus, ^separator.

we just have to be careful to remove the spaces in the generated XML:

name: word+spaces.
spaces: " "+.
word: letter+.
letter: ^"a"; ^"b"; ^"c"; ...

Extras

There are obvious extra odds and ends that need adding, such as sets of characters, to make terminal specification easier, for instance:

letter: ^["a"-"z", "A"-"Z", "-"].
S: [" ", "\t", "\n", ...]*.

but these are just details.

Restrictions on the XML Produced

You can turn any textual document into an equivalent XML document.

However, it is not in general possible to turn a textual document into a particular XML form without more work.

For instance, you could turn Wikipedia markup into an XML document, but not into XHTML in particular. And

{"a": 1, "b": 2}

can't be transformed into

<j><a>1</a><b>2</b></j>

Roundtripping

The input document and the generated XML are (in principle) isomorphic.

Returning the generated XML to its original format is just a process of presentation.

Use XSLT or even CSS in simple cases.

Straightforward to automatically generate the required piece of XSLT directly from the Invisible XML definition.

Alternatively, you could see the syntax definition as a style language for the trees.

Conclusion

Everything is now XML!

World domination is finally ours!