Freedom of Information Conference 2000
B Tommie Usdin Mulberry Technologies,
Inc
What is XML and why should you care?
This presentation
will be on my web site should anybody want to look at it. I think I'll
start with asking a couple of questions, talk for a few minutes, and then
just do questions on XML. So a few of you know a little bit about XML.
And a lot of you have heard of XML but haven't seen it. Let me do a couple
of minutes on what XML really is.
XML is a
data format, that's all it is. It doesn't do anything, it doesn't make
anything happen, it doesn't solve any problems. It's a data format that
was designed to be multiply usable and platform independent. And what
that means is that for any one of the things you can do with XML there's
a faster, cheaper, better way to do it. Anytime anyone looks at XML and
says, "That's really bulky and cumbersome, I know a better way to
do that," they're absolutely right. If you only want to do one thing
with your data don't mess with XML because it's overkill.
HTML,
pointy brackets, and dumb stuff with style sheets
But if you
want to do two, three, or a hundred things, or even if you don't know
what you're going to want to do with it next year or the year after, it's
a data format that was designed to be long-living, platform independent,
and multi-use, which also makes it bulky and a little difficult to create
and very powerful. You've all seen HTML; you know what the pointy brackets
stuff looks like? You've got the stuff in the pointy brackets and stuff
between the pointy brackets. The stuff between the pointy brackets in
HTML is the text and the stuff between the pointy brackets is the data
content. XML is in the same syntax except with HTML you go to the bookstore,
you buy a little pamphlet on what the HTML tags are, and that's all they
are. There isn't an HTML tag, for example, or solution or question or
answer. Even if you want to make your questions look different from your
answers in HTML you have to do it by doing dumb stuff with the style sheet.
If questions and answers are really important to you, in XML you make
a question tag and an answer tag. You just make the tags. Or in the case
of a whole bunch of people who want to share information, you agree on
what the tags are, you agree on what's important, and you tag it with
this dumb syntax which has an even dumber syntax called a DTD, which says
what are the tags and what are the relationships among them, and that's
all there is to it. Make sense? I say DTD's are a dumb syntax, I can teach
any intelligent human being to read them in 20 minutes. It's not hard
it's just ugly.
So that's
what XML so why should you care? Because a whole lot of you are going
to be working with data that you expect to live a long time, to be used
by people who aren't the people who create it. You don't know what all
of the future uses are and you don't want to depend on some company's
data format. XML is the data format that exists, that works, and is created
by a vendor-independent group. I won't call them a legitimate standard's
body because the W3C is weirder than that, but at least it isn't a vendor
that can go away or change the rules on you. It was designed for long
life and multi-use, so that's why you should care. Questions?
Questions from the floor
XXLT
- what is it and how is it used
Question
1: What is XXLT?
Tommie
Usdin: XML is this data format that I've talked about that. It sort
of lies there on the ground. Accompanying XML, and adding power to it,
are a whole lot of associated specifications. XXLT is probably the most
widely used. It's a transformation language designed for XML to take XML
structured in one fashion and with one set of tags and turn it into, usually
but not necessarily, some other data format - not necessarily in the same
sequence, not necessarily with the same structure. You can write an XXLT
programme and run it in dozens of different pieces of software that are
XMLT engines. So you can take your XML and make it into something else.
You want to structure, you want to make authoring as easy as you can,
so make it easy to author and then transform it into something that is
easy to do whatever the next thing is you want to do with it and hopefully
transform it into something that's easy to archive. Good archival data
formats are not the same as convenient authoring formats and XXLT is a
transformation tool that takes this and makes it into that - it's a really
good way of making one into the other.
I know half
a dozen commercial tools and maybe fifty free tools. One of the goals
of the XML language, explicitly stated, was to create a language that
any computer science graduate student could write a full parser in a weekend
and could write further tools for in the next weekend. This is a really
good thing. Its predecessor, SGML, which XML is a sub-set of, typically
takes four to five man years to write a full parser for, so a weekend
is a whole lot better than four to five years. Because it's so easy to
write tools for XML and the XXL-associated specs like XXLT, there are
huge numbers of people writing tools for it. Usually when you start a
project you say, "I'm going to go and use other people's tools"
and most of the time for most of the things you can find tools that do
what you need. But you may discover that the tools that produce exactly
the format that you like don't produce error messages that you like. So
you write another tool and you're really proud of it and you post it on
the web and make it available. So there are huge numbers of free tools
in the XML space.
As for authoring
tools, there are actually relatively few free authoring tools in XML.
There are some pretty good UNIX ones and some okay PC ones and one not
very good Mac tool free. There are a lot more and a lot better commercial
tools. One of the reasons in my opinion that XML took off with a bang
is because there were a bunch of people with SGML tools who did real fast
global changes on their documentation and changed a couple on their menu
items and ignored a lot of features that are in SGML but not in XML. Actually
some of them raised their prices and some of them lowered their prices,
which was interesting to watch, and said, "Look at our brand new
XML authoring tools".
Validity
with XML
Question
2: What about validity?
Tommie
Usdin: If it doesn't validate it's not an XML tool. There are two
flavours of XML, XML called well-formed and XML called valid. From the
point of view of people receiving XML well formed it is pretty good stuff.
It's a whole lot better than the junk you've been getting. From the point
of view of people creating XML, it is not responsible to create anything
that is valid. Valid means you knew what the rules were and you followed
them so you weren't sending somebody a journal article where you forgot
to mention the name of the author. If the document type definition for
a journal article says there has to be at least one author, you check
it. If you send it out without an author it's not valid. From the point
of view of people looking at content, you want to be able to look at it
even if it's missing something important like the author. From the point
of view of people creating it, create it with all the pieces there for
goodness sake!
Self
validation
Question
3: Our experience of looking at SGML is that loads of it doesn't validate
against itself, so is that just an odd thing or would you say that that's
pretty widespread?
Tommie
Usdin: Historically people have been calling anything that salt and
peppered with pointy brackets SGML. It isn't. You especially see that
from commercial type-setters who will take whatever their standard type-setting
code is, make the pages the way they always made them, and then say "If
you want this other thing we will tack another expensive process onto
the tail end of our composition presses - which we have known since 1880
- and give you whatever that funny acronym is that you said that you wanted."
And they'll add time, it'll add work, and they'll probably do it badly.
If you don't catch them real fast that's what they're going to do. Now,
there is no reason if they reengineer their process that they can't make
pretty pages. SGML or XML can both do it and they cost the same, but if
they can get away with not learning something new and charging you more
they're going to do that!
Is
SGML obsolete?
Question
4: Is XML going to replace SGML and how fast?
Tommie
Usdin: In the hype, absolutely positively, PC World says SGML is dead
and XML is everything. In real life XML is really just SGML minus a bunch
of features nobody used. Some people genuinely use the features in SGML.
People writing aircraft maintenance manuals, for example, use SGML because
they have incredibly complicated data. They use a whole bunch of the features
that got left out of XML. They're not going to budge from SGML and they
will only filter to XML for people who use XML display tools. But for
the topic de jour, journal publishing, I have never seen a journal article
that needed all of the features of SGML.
Format
for archiving?
Question
5: You mentioned that XML is something one might use for the long
haul. Would that be the way to build digital archives or would SGML be
appropriate?
Tommie Usdin:
The question is what is the difference between SGML and XML for a digital
archive. It seems to me that archives are one of those places where you
want as simple as possible electronic format, so on the whole I would
say XML probably has all the features you need for a digital archive.
You want to be really careful when you're archiving things like SGML or
XML. The SGML and the XML that's optimised for authoring, for example,
will leave out a lot of the words that you'll see on display because they'll
be generated on display. You want to be careful that those are in there
before you archive it, so there are issues with archiving XML but they're
not so much SGML or XML issues as getting all the data in there issues.
Other
options to XML?
Question
6: What are the alternatives to XML?
Tommie
Usdin: Well, there's PDF. It's easy to make, it's cheap, you can pass
around electronic pages, and you can't do a darn thing else with it. There's
HTML which is easy to make and it's compact and we've all met it before
and it's enormously popular and it supports seriously dumb searching and
lousy retrieval and not much else and incidentally you can't make decent
pages from it either. There are a bunch of proprietary data formats that
are being created within one set of tools and if you want to live within
that vendor's tools those work pretty nicely. In my opinion that is not
an alternative because vendors go away and change their data formats so
I don't like that option at all. There are things like Tech, which doesn't
support searching so well as XML but it's a well established, well understood
data format. People really heavily into math really like it because it
does great math and it's pretty stable, but it's not as strong on supporting
searching. Now, for all the things we've talked about in this room, nobody's
really talked about any of the real search support you can do with XML,
but it's my opinion you are going to get there. In XML you can for example
search for a word or phrase when it is in the recommendation section but
not when it is in the background section of a document. But first you
have to know what the recommendations are and what's the background and
have a search engine that knows how to do that. Now, those search engines
exist but, in the world we're talking today, the data doesn't.
To
read all about it
Question
7: Can you give us your website?
Tommie Usdin:
Yes, it's www.mulberrytech.com and this presentation is on it. I wrote
it in XML, I made it into HTML to display, and made it into PDF, assuming
we were going to do handouts. Good luck folks.
|