Log on / register
Feedback | Support | My details
  Quick Search
BioMed Central
PubMed Central
PubMed

Contents

B Tommie Usdin
Mulberry Technologies, Inc


BMC  Freedom of Information Conference 2000

B Tommie Usdin
Mulberry Technologies, Inc

What is XML and why should you care?

This presentation will be on my web site should anybody want to look at it. I think I'll start with asking a couple of questions, talk for a few minutes, and then just do questions on XML. So a few of you know a little bit about XML. And a lot of you have heard of XML but haven't seen it. Let me do a couple of minutes on what XML really is.

XML is a data format, that's all it is. It doesn't do anything, it doesn't make anything happen, it doesn't solve any problems. It's a data format that was designed to be multiply usable and platform independent. And what that means is that for any one of the things you can do with XML there's a faster, cheaper, better way to do it. Anytime anyone looks at XML and says, "That's really bulky and cumbersome, I know a better way to do that," they're absolutely right. If you only want to do one thing with your data don't mess with XML because it's overkill.

HTML, pointy brackets, and dumb stuff with style sheets

But if you want to do two, three, or a hundred things, or even if you don't know what you're going to want to do with it next year or the year after, it's a data format that was designed to be long-living, platform independent, and multi-use, which also makes it bulky and a little difficult to create and very powerful. You've all seen HTML; you know what the pointy brackets stuff looks like? You've got the stuff in the pointy brackets and stuff between the pointy brackets. The stuff between the pointy brackets in HTML is the text and the stuff between the pointy brackets is the data content. XML is in the same syntax except with HTML you go to the bookstore, you buy a little pamphlet on what the HTML tags are, and that's all they are. There isn't an HTML tag, for example, or solution or question or answer. Even if you want to make your questions look different from your answers in HTML you have to do it by doing dumb stuff with the style sheet. If questions and answers are really important to you, in XML you make a question tag and an answer tag. You just make the tags. Or in the case of a whole bunch of people who want to share information, you agree on what the tags are, you agree on what's important, and you tag it with this dumb syntax which has an even dumber syntax called a DTD, which says what are the tags and what are the relationships among them, and that's all there is to it. Make sense? I say DTD's are a dumb syntax, I can teach any intelligent human being to read them in 20 minutes. It's not hard it's just ugly.

So that's what XML so why should you care? Because a whole lot of you are going to be working with data that you expect to live a long time, to be used by people who aren't the people who create it. You don't know what all of the future uses are and you don't want to depend on some company's data format. XML is the data format that exists, that works, and is created by a vendor-independent group. I won't call them a legitimate standard's body because the W3C is weirder than that, but at least it isn't a vendor that can go away or change the rules on you. It was designed for long life and multi-use, so that's why you should care. Questions?

Questions from the floor

XXLT - what is it and how is it used

Question 1: What is XXLT?

Tommie Usdin: XML is this data format that I've talked about that. It sort of lies there on the ground. Accompanying XML, and adding power to it, are a whole lot of associated specifications. XXLT is probably the most widely used. It's a transformation language designed for XML to take XML structured in one fashion and with one set of tags and turn it into, usually but not necessarily, some other data format - not necessarily in the same sequence, not necessarily with the same structure. You can write an XXLT programme and run it in dozens of different pieces of software that are XMLT engines. So you can take your XML and make it into something else. You want to structure, you want to make authoring as easy as you can, so make it easy to author and then transform it into something that is easy to do whatever the next thing is you want to do with it and hopefully transform it into something that's easy to archive. Good archival data formats are not the same as convenient authoring formats and XXLT is a transformation tool that takes this and makes it into that - it's a really good way of making one into the other.

I know half a dozen commercial tools and maybe fifty free tools. One of the goals of the XML language, explicitly stated, was to create a language that any computer science graduate student could write a full parser in a weekend and could write further tools for in the next weekend. This is a really good thing. Its predecessor, SGML, which XML is a sub-set of, typically takes four to five man years to write a full parser for, so a weekend is a whole lot better than four to five years. Because it's so easy to write tools for XML and the XXL-associated specs like XXLT, there are huge numbers of people writing tools for it. Usually when you start a project you say, "I'm going to go and use other people's tools" and most of the time for most of the things you can find tools that do what you need. But you may discover that the tools that produce exactly the format that you like don't produce error messages that you like. So you write another tool and you're really proud of it and you post it on the web and make it available. So there are huge numbers of free tools in the XML space.

As for authoring tools, there are actually relatively few free authoring tools in XML. There are some pretty good UNIX ones and some okay PC ones and one not very good Mac tool free. There are a lot more and a lot better commercial tools. One of the reasons in my opinion that XML took off with a bang is because there were a bunch of people with SGML tools who did real fast global changes on their documentation and changed a couple on their menu items and ignored a lot of features that are in SGML but not in XML. Actually some of them raised their prices and some of them lowered their prices, which was interesting to watch, and said, "Look at our brand new XML authoring tools".

Validity with XML

Question 2: What about validity?

Tommie Usdin: If it doesn't validate it's not an XML tool. There are two flavours of XML, XML called well-formed and XML called valid. From the point of view of people receiving XML well formed it is pretty good stuff. It's a whole lot better than the junk you've been getting. From the point of view of people creating XML, it is not responsible to create anything that is valid. Valid means you knew what the rules were and you followed them so you weren't sending somebody a journal article where you forgot to mention the name of the author. If the document type definition for a journal article says there has to be at least one author, you check it. If you send it out without an author it's not valid. From the point of view of people looking at content, you want to be able to look at it even if it's missing something important like the author. From the point of view of people creating it, create it with all the pieces there for goodness sake!

Self validation

Question 3: Our experience of looking at SGML is that loads of it doesn't validate against itself, so is that just an odd thing or would you say that that's pretty widespread?

Tommie Usdin: Historically people have been calling anything that salt and peppered with pointy brackets SGML. It isn't. You especially see that from commercial type-setters who will take whatever their standard type-setting code is, make the pages the way they always made them, and then say "If you want this other thing we will tack another expensive process onto the tail end of our composition presses - which we have known since 1880 - and give you whatever that funny acronym is that you said that you wanted." And they'll add time, it'll add work, and they'll probably do it badly. If you don't catch them real fast that's what they're going to do. Now, there is no reason if they reengineer their process that they can't make pretty pages. SGML or XML can both do it and they cost the same, but if they can get away with not learning something new and charging you more they're going to do that!

Is SGML obsolete?

Question 4: Is XML going to replace SGML and how fast?

Tommie Usdin: In the hype, absolutely positively, PC World says SGML is dead and XML is everything. In real life XML is really just SGML minus a bunch of features nobody used. Some people genuinely use the features in SGML. People writing aircraft maintenance manuals, for example, use SGML because they have incredibly complicated data. They use a whole bunch of the features that got left out of XML. They're not going to budge from SGML and they will only filter to XML for people who use XML display tools. But for the topic de jour, journal publishing, I have never seen a journal article that needed all of the features of SGML.

Format for archiving?

Question 5: You mentioned that XML is something one might use for the long haul. Would that be the way to build digital archives or would SGML be appropriate?

Tommie Usdin: The question is what is the difference between SGML and XML for a digital archive. It seems to me that archives are one of those places where you want as simple as possible electronic format, so on the whole I would say XML probably has all the features you need for a digital archive. You want to be really careful when you're archiving things like SGML or XML. The SGML and the XML that's optimised for authoring, for example, will leave out a lot of the words that you'll see on display because they'll be generated on display. You want to be careful that those are in there before you archive it, so there are issues with archiving XML but they're not so much SGML or XML issues as getting all the data in there issues.

Other options to XML?

Question 6: What are the alternatives to XML?

Tommie Usdin: Well, there's PDF. It's easy to make, it's cheap, you can pass around electronic pages, and you can't do a darn thing else with it. There's HTML which is easy to make and it's compact and we've all met it before and it's enormously popular and it supports seriously dumb searching and lousy retrieval and not much else and incidentally you can't make decent pages from it either. There are a bunch of proprietary data formats that are being created within one set of tools and if you want to live within that vendor's tools those work pretty nicely. In my opinion that is not an alternative because vendors go away and change their data formats so I don't like that option at all. There are things like Tech, which doesn't support searching so well as XML but it's a well established, well understood data format. People really heavily into math really like it because it does great math and it's pretty stable, but it's not as strong on supporting searching. Now, for all the things we've talked about in this room, nobody's really talked about any of the real search support you can do with XML, but it's my opinion you are going to get there. In XML you can for example search for a word or phrase when it is in the recommendation section but not when it is in the background section of a document. But first you have to know what the recommendations are and what's the background and have a search engine that knows how to do that. Now, those search engines exist but, in the world we're talking today, the data doesn't.

To read all about it

Question 7: Can you give us your website?

Tommie Usdin: Yes, it's www.mulberrytech.com and this presentation is on it. I wrote it in XML, I made it into HTML to display, and made it into PDF, assuming we were going to do handouts. Good luck folks.

Register now



© 1999-2008 BioMed Central Ltd unless otherwise stated