Not a talk about ixml, but a keynote I gave on the design of notations.
Where I said, in passing:
"Parsing is quite easy
It would be fairly easy to add a generalised part to the XML pipeline that parsed unmarked-up text, and produced XML as a parse tree: it's just a different sort of transform.
We could have our cake and eat it!"
Two best reviews ever!
"This is clearly a submission that needs to be shredded, burned, and the ashes buried in multiple locations"
"I think the audience will eat him alive. But I want to be there to hear it."
Two best reviews ever!
"This is clearly a submission that needs to be shredded, burned, and the ashes buried in multiple locations"
"I think the audience will eat him alive. But I want to be there to hear it."
That notwithstanding, it did get accepted:
We will also ask you to:
- revise your paper, taking into consideration the comments and suggestions of the peer review panel
2016, 2018: Iterations, based on user experience
2020: Working group formed
2022: Formal specification published
2024: 5 or 6 implementations, with others in development
Most textual data has an implicit structure.
Dates:
8/11/2024
URLs:
http://cwi.nl/~steven/Talks/2024/11-08-banking/
Bibliographies:
Steven Pemberton, Banking with ixml and XForms, Declarative Amsterdam 2024.
ixml turns data with implicit structure into data with explicit structure.
<date> <day>8</day> <month>11</month> <year>2024</year> </date>
You describe the format.
date: day, -"/", month, -"/", year. day: digit, digit?. month: digit, digit?. year: digit, digit, digit, digit. -digit: ["0"-"9"].
Feed the format and the data through the ixml processor.
Get structured data!
Feed the format and the data through the ixml processor:
Get more data into XForms:
XForms expects data in XML, and many sources of data aren't in XML
Define XForms programs in text:
It can be faster, easier, and more human-oriented to write programs in text, and let the computer convert it to XML.
My bank has stopped sending paper statements.
It offers online statements instead, either in PDF for printing yourself, or CSV.
It does offer some online search facilities, but is slow, and inconvenient.
I wanted something similar to how Quicken used to work.
So I wrote an application in XForms.
XForms is a declarative, XML-based, programming language
It is a declarative rather than a procedural language: this makes life much easier for the programmer. It is like spreadsheets but generalised.
Large projects have shown huge productivity gains:
To write the app in XForms I first had to get the data into a usable form.
As I said, the bank supplied PDF, or CSV, so CSV it had to be.
Use sed?
"Date","Name / Description","Account","Counterparty","Code","Debit/credit","Amount (EUR)","Transaction type","Notifications" "20161230","The Movies Art House AMSTERDAM","NL80INGB1234567890","","BA","Debit","11,00","Payment terminal","Card sequence no.: 009 29/12/2016 18:21 Transaction: 90B7V1 Term: FWG77T Value date: 30/12/2016" "20161229","DIJKMAN B.V. MUZIEK AMSTERDAM","NL80INGB1234567890","","BA","Debit","99,00","Payment terminal","Card sequence no.: 009 28/12/2016 15:59 Transaction: 48X1S3 Term: 465101 Value date: 29/12/2016" "20161227","CCV*FOODHALLEN AMSTERD AMSTERDAM","NL80INGB1234567890","","BA","Debit","6,25","Payment terminal","Card sequence no.: 009 24/12/2016 17:56 Transaction: 84P101 Term: CT449077 Value date: 27/12/2016"
The sed was largely opaque:
head -n 2 $1 | tail -1 | sed 's/"\(....\)[^,]*,[^,]*,"NL..INGB\([^"]*\)".*/<bank year="\1" acct="\2">/' tail --lines=+2 $1 | sed ' s/^/<entry>/ s/&/&/g s/"\(....\)\(..\)\(..\)",/<date>\1-\2-\3<\/date>/ s/"\([^"]*\)",/<type>other<\/type><name>\1<\/name>/ s/"\([^"]*\)",/<from>\1<\/from>/ s/"\([^"]*\)",/<to>\1<\/to>/ s/"\([^"]*\)",/<code>\1<\/code>/ s/"Af","\([^,]*\),\([^"]*\)",/<amount>-\1.\2<\/amount>/ s/"Bij","\([^,]*\),\([^"]*\)",/<amount>\1.\2<\/amount>/ s/"\([^"]*\)",/<sort>\1<\/sort>/ s/"\([^"]*\)"/<description>\1<\/description>/ s/$/<\/entry>/ ' echo '</bank>'
The ixml version at least is more readable.
bank: -head, entry*. head: -'"Date","Name / Description","Account","Counterparty","Code","Debit/credit","Amount (EUR)","Transaction type","Notifications"', nl. entry: date, type, name, from, to, code, amount, sort, description, nl. type: +"other". date: -'"', y, +"-", m, +"-", d, -'"', -",". -y: digit, digit, digit, digit. -m: digit, digit. -d: digit, digit. name: field, -",". from: field, -",". to: field, -",". code: field, -",". sort: field, -",". description: field. -field: -'"', c*, -'"'. -c: ~['"'; #a; #d]. amount: neg; pos. -neg: -'"Af",', +"-", number, -","; -'"Debit",', +"-", number, -",". -pos: -'"Bij",', number, -","; -'"Credit",', number, -",". -number: -'"', euros, -",", +".", cents, -'"'. -euros: digit+. -cents: digit, digit. -digit: ["0"-"9"]. -nl: (-#a; -#d)+.
A bank statement consists of a head (the labels) that is completely ignored, followed by any number of entries:
bank: -head, entry*. head: -'"Date","Name / Description","Account","Counterparty","Code","Debit/credit","Amount (EUR)","Transaction type","Notifications"', nl.
By including the labels literally, if the bank ever changes the format, I will get an immediate error.
Each entry has a number of fields:
entry: date, type, name, from, to, code, amount, sort, description, nl.
I have added an extra field, type
, that my application will
use. In the input it is empty, and will appear in the output as a preset
field.
type: +"other".
will always appear as
<type>other</type>
which will later get changed in the application.
A date is a field with a string of numbers in the input, like
"20161230"
so I add hyphens at the right places:
date: -'"', y, +"-", m, +"-", d, -'"', -",". -y: digit, digit, digit, digit. -m: digit, digit. -d: digit, digit.
which would give
<date>2016-12-30</date>
I don't need date structures to be any more refined than that.
A number of the fields just contain unstructured character data:
name: field, -",". from: field, -",". to: field, -",". code: field, -",". sort: field, -",". description: field. -field: -'"', c*, -'"'. -c: ~['"'; #a; #d].
A field is just zero or more characters surrounded by quotes, where a character is anything except a quote or an end-of-line character.
<name>The Movies Art House AMSTERDAM</name>
Finally amounts are recorded in an odd way, using two fields. Banks don't believe in negative numbers, just positive amounts of debit or credit:
"Credit", "11,00" "Debit", "6,25"
Note the European style of number representation.
amount: neg; pos. -neg: -'"Debit",', +"-", number, -",". -pos: -'"Credit",', number, -",". -number: -'"', euros, -",", +".", cents, -'"'. -euros: digit+. -cents: digit, digit. -digit: ["0"-"9"].
Which gives
<amount>11.00</amount>
or
<amount>-6.25</amount>
<bank> <entry> <date>2016-12-30</date> <type>other</type> <name>The Movies Art House AMSTERDAM</name> <from>NL80INGB1234567890</from> <to/> <code>BA</code> <amount>-11.00</amount> <sort>Payment terminal</sort> <description>Card sequence no.: 009 29/12/2016 18:21 Transaction: 90B7V1 Term: FWG77T Value date: 30/12/2016</description> </entry> <entry> <date>2016-12-29</date> <type>other</type> <name>DIJKMAN B.V. MUZIEK AMSTERDAM</name> <from>NL80INGB1234567890</from> <to/> <code>BA</code> <amount>-99.00</amount> <sort>Payment terminal</sort> <description>Card sequence no.: 009 28/12/2016 15:59 Transaction: 48X1S3 Term: 465101 Value date: 29/12/2016</description> </entry> <entry> <date>2016-12-27</date> <type>other</type> <name>CCV*FOODHALLEN AMSTERD AMSTERDAM</name> <from>NL80INGB1234567890</from> <to/> <code>BA</code> <amount>-6.25</amount> <sort>Payment terminal</sort> <description>Card sequence no.: 009 24/12/2016 17:56 Transaction: 84P101 Term: CT449077 Value date: 27/12/2016</description> </entry> </bank>
While my bank will gladly give me downloadable statements for my current account, for some reason it doesn't for my credit card, even though it doesn't send paper statements for that either.
All it does is display my transactions on the screen.
I can select all this text from the screen, and copy it as text to a file.
Then I can use ixml to transform it to structured data.
Transactions current period 2024 Today EUR Kobo Software Ireland D02T380 IE −5.49 22 October EUR THEATRE ROYAL HAYMARKE LONDON GBR −214.62 10 October EUR SP THEPHONESHOPBE JETTE BEL −1,199.00 Period of 5 Sept 2024 to 4 Oct 2024 Opening balance for this period: 0.00 Transaction total: −289.33 Monthly repayment: 289.33 Closing balance for this period: 0.00 2024 4 October EUR AFLOSSING 289.33 3 October EUR OTT* NT AT HOME LONDON GBR −11.39 2 October EUR Google Payment IE LTD Dublin IRL −14.99 24 September EUR TEAPIGS BRENTFORD GBR −140.56 23 September EUR The Pilgrim Hotel London GBR −4.38 22 September EUR The Pilgrim Hotel London GBR −9.18 18 September EUR TRIPIT REDMOND USA −44.99
So let's write some ixml for this. The top level is the current period, followed by a number of earlier periods.
Both contain a number of day
s.
cc: current, period*. current: -"Transactions current period", -#a, day*. period: -"Period of ", from, -" to ", to, -#a, opening, total, repayment, closing, day*. from: -date. to: -date.
This is mostly uninteresting stuff, since the opening and closing balances are always zero, and the repayment amount is in the transactions anyway.
opening: -"Opening balance for this period:", -#a, -amount, -#a. closing: -"Closing balance for this period:", -#a, -amount, -#a. repayment: -"Monthly repayment:", -#a, -amount, -#a. total: -"Transaction total:", -#a, -amount, -#a.
The interesting stuff is in the days. Each day has the date and a number of transactions (the currency is always EUR):
day: date, -#a, (-"EUR", -#a)?, transaction+. transaction: -#a, payee, -#a, amount, -#a. payee: ~[#a]*.
About the only impressive thing about this mess is that they do acknowledge that negative numbers exist, and (correctly) use #2212 as minus sign, which I replace with hyphen.
They use commas to separate thousands, and point for the decimal separator:
amount: (-#2212, +"-")?, (digit; -",")+, ".", digit, digit.
Ugh. Sometimes the year is before, sometimes after, sometimes not at all. Sometimes it's the year and the word "Today":
date: y, -#a, (d, -" ", m; "Today"); d, -" ", m, (-" ", y)?. d: digit, digit?.
Month names are either written in full, or as 3 letters, with the exception of Sept...
m: "January"; "February"; "March"; "April"; "May"; "June"; "July"; "August"; "September"; "October"; "November"; "December"; "Jan"; "Feb"; "Mar"; "Apr"; "May"; "Jun"; "Jul"; "Aug"; "Sept"; "Oct"; "Nov"; "Dec". y: digit, digit, digit, digit. -digit: ["0"-"9"].
<cc> <current> <day> <date> <y>2024</y>Today</date> <transaction> <payee>Kobo Software Ireland D02T380 IE</payee> <amount>-5.49</amount> </transaction> </day> <day> <date> <d>22</d> <m>October</m> </date> <transaction> <payee>THEATRE ROYAL HAYMARKE LONDON GBR</payee> <amount>-214.62</amount> </transaction> </day> <day> <date> <d>10</d> <m>October</m> </date> <transaction> <payee>SP THEPHONESHOPBE JETTE BEL</payee> <amount>-1199.00</amount> </transaction> </day> </current> <period> <from> <d>5</d> <m>Sept</m> <y>2024</y> </from> <to> <d>4</d> <m>Oct</m> <y>2024</y> </to> <opening>0.00</opening> <total>-289.33</total> <repayment>289.33</repayment> <closing>0.00</closing> <day> <date> <y>2024</y> <d>4</d> <m>October</m> </date> <transaction> <payee>AFLOSSING</payee> <amount>289.33</amount> </transaction> </day> <day> <date> <d>3</d> <m>October</m> </date> <transaction> <payee>OTT* NT AT HOME LONDON GBR</payee> <amount>-11.39</amount> </transaction> </day> <day> <date> <d>2</d> <m>October</m> </date> <transaction> <payee>Google Payment IE LTD Dublin IRL</payee> <amount>-14.99</amount> </transaction> </day> <day> <date> <d>24</d> <m>September</m> </date> <transaction> <payee>TEAPIGS BRENTFORD GBR</payee> <amount>-140.56</amount> </transaction> </day> <day> <date> <d>23</d> <m>September</m> </date> <transaction> <payee>The Pilgrim Hotel London GBR</payee> <amount>-4.38</amount> </transaction> </day> <day> <date> <d>22</d> <m>September</m> </date> <transaction> <payee>The Pilgrim Hotel London GBR</payee> <amount>-9.18</amount> </transaction> </day> <day> <date> <d>18</d> <m>September</m> </date> <transaction> <payee>TRIPIT REDMOND USA</payee> <amount>-44.99</amount> </transaction> </day> </period> </cc>
The code is simple: we repeat over the top level elements (either
current
or period
), provide a heading, then repeat
over the days, and in the days, repeat over the transactions:
<repeat ref="*"> <label class="period"> <output value="if(from, concat(from/y, '-', from/m, '-', from/d, ' to ', to/y, '-', to/m, '-', to/d), 'Current')"/> </label> <repeat ref="day"> <output ref="d"/> <output ref="m"/> <repeat ref="transaction"> <output class="amount" ref="amount"/> <output class="payee" ref="payee"/> </repeat> </repeat> </repeat>
ixml makes textual data available to the XML pipeline as structured data
ixml is being added to XPath as a function
Which means that ixml will shortly automatically be available to XForms