Erik Naggum on XML


From: Erik Naggum <>

Newsgroups: comp.lang.lisp

Subject: Re: S-exp vs XML, HTML, LaTeX (was: Why lisp is growing)

Date: 28 Dec 2002 03:08:55 +0000

Organization: Naggum Software, Oslo, Norway

Lines: 259

Message-ID: <>

References: <> <> <> <> <atql2v$5v1$> <> <>


Mime-Version: 1.0

Content-Type: text/plain; charset=us-ascii

X-Trace: 1041044935 11331 (28 Dec 2002 03:08:55 GMT)


NNTP-Posting-Date: 28 Dec 2002 03:08:55 GMT

Mail-Copies-To: never

User-Agent: Gnus/5.09 (Gnus v5.9.0) Emacs/21.2

Xref: comp.lang.lisp:49484

* (thelifter)
I don't understand your criticism of XML.

I sometimes regret that human memory is such a great tool for one's personal life that coming to rely on the wider context it provides in one's communication with others is so fragile. I have explained this dozens of times, but I guess each repetition adds something.

Basically XML is just another way of writing S-expr or Trees or whatever you want to call it.

They are not identical. The aspects you are willing to ignore are more important than the aspects you are willing to accept. Robbery is not just another way of making a living, rape is not just another way of satisfying basic human needs, torture is not just another way of interrogation. And XML is not just another way of writing S-exps. There are some things in life that you do not do if you want to be a moral being and feel proud of what you have accomplished.

SGML was a major improvement on the markup languages that preceded it (including GML), which helped create better publishing systems and helped people think about information in much improved ways, but when the zealots forgot the publishing heritage and took the notion that information can be separated from presentation out of the world of publishing into general data representation because SGML had had some success in "database publishing", something went awry, not only superficially, but fundamentally. It is not unlike when a baby, whose mother satisfies its every need before it is even aware that it has been expressed, grows up to believe that the world in general is both influenced by and obliged to satisfy its whims. Even though nobody in their right mind would argue that babies should fend for themselves and earn their own living, at some point in the child's life, it must begin a progression towards independence, which is not merely a quantitative difference from having every need satisfied by crying, but a qualitative difference of enormous consequence. Many an idea or concept not only looks, but /is/ good in its infancy, yet turns destructive later in life. Scaling and maturation are not the obvious processes they appear to be because they take so much time that the accumulated effort is easy to overlook. To be successful, they must also be very carefully guided by people who can envision the end result, but that makes it appear to many as if it merely "happens". Take a good idea out of its infancy, let it age without guidance so it does not mature, and it generally goes bad. If GML was an infant, SGML is the bright youngster far exceeds expectations and made its parents too proud, but XML is the drug-addicted gang member who had committed his first murder before he had sex, which was rape.

SGML is a good idea when the markup overhead is less than 2%. Even attributes is a good idea when the textual element contents is the "real meat" of the document and attributes only aid processing, so that the printed version of a fully marked-up document has the same characters as the document sans tags. Explicit end-tags is a good idea when the distance between start- and end-tag is more than the 20-line terminal the document is typed on. Minimization is a good idea in an already sparsely tagged document, both because tags are hard to keep track of and because clusters of tags are so intrusive. Character entities is a good idea when your entire character set is EBCDIC or ASCII. Validating the input prior to processing is a good idea when processing would take minutes, if not hours, and consume costly resources, only to abend. SGML had an important potential in its ability to let the information survive changes in processing equipment or software where its predecessors clearly failed. But, to continue the baby metaphor, you have to go into fetishism to keep using diapers as you age but fail to mature. (I note in passing that the stereotypical American male longs for much larger than natural female breasts, presumably to maintain the proportion to his own size from his infancy, which has caused the stereotypical American female to feel a need for breasts that will give the next generation a demand for even more disproportionally large breasts.) When the markup overhead exceeds 200%, when attributes values and element contents compete for the information, when the distance between 99% of the "tags" is /zero/, when the character set is Unicode, and when validation takes more time than processing, not to mention the sorry fact that information longevity is more /threatened/ by XML than by any other data representation in the history of computing, then SGML has gone from good kid, via bad teenager, to malfunctioning, evil adult as XML. SGML was in many ways smarter than necessary at the time it was a bright idea, it was evidence of too much intelligence applied to the problems it solved. A problem mankind has not often had to deal with is that of excessive intelligence; more often than not, technological solutions are barely intelligent enough to solve the problem at hand. If a solution is much smarter than the problem and really stupid people notice it, they believe they have got their hands on something /great/, and so they destroy it, not unlike how giving stupid people too much power can threaten world peace and unravel legal concepts like due process and presumption of innocence.

I once believed that it would be very beneficial for our long-term information needs to adorn the text with as much meta-information as possible. I still believe that the world would be far better off if it had evolved standardized syntactic notations for time, location, proper names, language, etc, and that even prose text would be written in such a way that precision in these matters would not be sacrificed, but most people are so obsessively concerned with their immediate personal needs that anything that could be beneficial on a much larger scale have no chance of surviving. Look at the United States of America, with its depressingly moronic units instead of going metric, with its inability to write dates in either ascending or descending order of unit size, and with its insistence upon the 12-hour clock, clearly evidencing the importance of the short-term pain threshold and resistance to doing anyone else's bidding. And now the one-time freest nation of the world has turned dictatorship with a dangerous moron in charge, set to attack Iraq to revenge his father's loss. Those who laughed when I said that stupidity is the worst threat to mankind laugh no more; they wait with bated breath to see if the world's most powerful incoherent moron will launch the world into a world war simply because he is too fucking stupid. But what really pisses me off is the spineless American people who fails to stop this madness. Presidents have been shot and killed before. I seem to be digressing -- the focal point is that the masses, those who exert no effort to better themselves, cannot be expected to help solve any problems larger than their own, and so they must be forced by various means, such as compulsory education, spelling checkers, newspaper editors who do /not/ publish their letters to the editor, and not least by the courts that restrain the will to revenge, in order to keep a modicum of sanity in the frail structure that is human society. We are clearly not at the stage of human development where writers are willing to accept the burden of communicating to the machine what they are thinking. One has to marvel at the wide acceptance of our existing punctuation marks and the sociology of their acceptance. "Tagging" text for semantic constructs that the human mind is able to discern from context must be millennia off.

In many ways, the current American presidency and XML have much in common. Both have clear lineages back to very intelligent people. Both demonstrate what happens when you give retards the tools of the intelligent. Some Americans obsess over gun control, to limit the number of handguns in the hands of their civilians, but support the most out-of-control nutcase in the young history of the nation and rally behind his world-threatening abuse of guns. The once noble concern over validation to curb excessive costs of too powerful a tool for the people who used it, has turned into an equally insane abuse of power in the XML world. How could such staggering idiots as have become "leaders" of the XML world and the free world come to their power? Clearly, they gain support from the masses who have no concerns but their immediate needs, no ability to look for long-term solutions and stability, no desire to think further ahead than that each individual decision they make be the best for them. Lethargy and pessimism, lack of long-term goals, apathy towards consequences, they are all symptoms of depressed people, and it is perhaps no coincidence that the world economy is now in a depression. My take on it is that it is because too much growth also rewarded people of such miniscule intellectual prowess that they turned to fraud rather than tackle the coming negative trends intelligently. Whether Enron or W3C or the GOP, everyone knows that fraud does pay in the short term and that bad money drives out good. When even the staggering morons are rewarded, the honest and intelligent must lose, and even the best character will have a problem when being honest means that he forfeits a chance to received a hundred million dollars. In both the Bush administration and the W3C standards administration, we see evidence that large groups of people did not believe that it would matter who assumed power. I am quite certain that just as Bush is supposed to be a thoroughly /likable/ person, the people who work up the most demented "standards" in the W3C lack that personality trait that is both abrasive and exhibit leadership potential. When the overall growth of something is so rapid that an idiotic decision no longer causes any immediate losses, the number of such decisions will grow without bounds until the losses materialize, such as in an economic depression. When the losses are so diffused as to not even affect the idiots behind the decisions, they can stay in power for a very long time until they are blamed for a large number of ills they had no power to predict, but that is precisely what caused them.

I use XML on a daily basis and think it is a simple and intelligent way to represent data.

A comment on this statement is by now entirely superfluous.

I would like to hear why you think it is so bad, can you be more specific please?

If you really need more information, search the Net, please.

And how would you improve on it?

A brief summary, then: Remove the syntactic mess that is attributes. (You will then find that you do not need them at all.) Enclose the /element/ in matching delimiters, not the tag. These simple things makes people think differently about how they use the language. Contrary to the foolish notion that syntax is immaterial, people optimize the way they express themselves, and so express themselves differently with different syntaxes. Next, introduce macros that look exactly like elements, but that are expanded in place between the reader and the "object model". Then, remove the obnoxious character entities and escape special characters with a single character, like \, and name other entities with letters following the same character. If you need a rich set of publishing symbols, discover Unicode. Finally, introduce a language for micro-parsers than can take more convenient syntaxes for commonly used elements with complex structure and make them /return/ element structures more suitable for processing on the receiving end, and which would also make validation something useful. The overly simple regular expression look-alike was a good idea when processing was expensive and made all decisions at the start-tag, but with a DOM and less stream-like processing, a much better language should be specified that could also do serious computation before validating a document -- so that once again processing could become cheaper because of the "markup", not more expensive because of it.

But the one thing I would change the most from a markup language suitable for marking up the incidental instruction to a type-setter to the data representation language suitable for the "market" that XML wants, is to go for a binary representation. The reasons for /not/ going binary when SGML competed with ODA have been reversed: When information should survive changes in the software, it was an important decision to make the data format verbose enough that it was easy to implement a processor for it and that processors could liberally accept what other processors conservatively produced, but now that the data formats that employ XML are so easily changed that the software can no longer keep up with it, we need to slam on the breaks and tell the redefiners to curb their enthusiasm, get it right before they share their experiments with the world, and show some respect for their users. One way to do that is to increase the cost of changes to implementations without sacrificing readability and without making the data format more "brittle", by going binary. Our information infrastructure has become so much better that the nature of optimization for survivability has changed qualitatively. The question of what we humans need to read and write no longer has any bearing on what the computers need to work with. One of the most heinous crimes against computing machinery is therefore to force them to parse XML when all they want is the binary data. As an example, think of the Internet Protocol and Transmission Control Protocol in XML terms. Implementors of SNMP regularly complained that parsing the ASN.1 encodings took a disproportionate amount of processing time, but they also acknowledged that properly done, it mapped directly to the values they needed to exchange. Now, think of what would have happened had it not been a Simple, but instead some moronic excuse for an eXtensible Network Management Protocol.

Another thing is that we have long had amazingly rich standards for such "display attributes" as many now use HTML and the like. The choice to use SGML for web publication was not entirely braindead, but it should have been obvious from the outset that page display would become important, if not immediately, then after watching what people were trying to do with HTML. The Web provided me with a much needed realization that information cannot be /fully/ separated from its presentation, and showed me something I knew without verbalizing explicitly, that the presentation form we choose communicates real information. Encoding all of it via markup would require a very fine level of detail, not to mention /awareness/ of issues so widely dispersed in the population that only a handful of people per million grasp them. Therefore, to be successful, there must be an upper limit to the complexity of the language defined with SGML, and one must go on to solve the next problem, not sit idle with a set of great tools and think "I ought to use these tools for something". Stultifying as the language of content models may be, it amazes me that people do not grasp that they need to use something else when it becomes too painful to express with SGML, but I am in the highly privileged position of knowing a lot more than SGML when I pronounce my judgment on XML. For one thing, I knew Lisp before I saw SGML, so I know what brilliant minds can do under optimal conditions and when they ensure that the problem is still bigger than the solution.

Erik Naggum, Oslo, Norway

Act from reason, and failure makes you rethink and study harder.
Act from faith, and failure makes you blame someone and push harder.