A survey of XML standards: Part 1

来源:百度文库 编辑:神马文学网 时间:2024/06/12 13:03:41

A survey of XML standards: Part 1

The core standards -- a foundation for the wide world of XML

Document options

Print this page

E-mail this page


My developerWorks needs you!

Connect to your technical community


Rate this page

Help us improve this content


Level: Introductory

Uche Ogbuji (uche@ogbuji.net), Principal Consultant, Fourthought, Inc.

20 Jan 2004

The world of XML is vast and growing, with a huge variety of standards and technologies that interact in complex ways. It can be difficult for beginners to navigate the most important aspects of XML, and for users to keep track of new entries and changes in the space. In this series of articles, Uche Ogbuji provides a guide to XML standards, including a wide range of recommended resources for further information.


XML started strong and has grown quite rapidly. It has proven itself a very valuable technology, but it can be an intimidating one, when one considers all the moving parts that fall under the term "XML." In this series of articles, I provide a summary of what I see as the most important XML technologies, and discuss how they each fit into the greater scope of things in the XML world. I also recommend tutorials and other useful resources for evaluating and learning to use each technology.

All the technologies I present here are standards, although that word is itself a bit slippery. Standards come in all forms, and multiple standards often compete in the same space. I follow the practical approach of defining a standard as any specification that is significantly adopted by a diversity of vendors, or is recommended by a respectable, vendor-neutral organization.

In this first article, I focus on what I consider the core XML technologies. These are the technologies that form the basis of what is expressed in an XML document. In subsequent articles I will cover standards relating to XML processing by developers, and a selection of the most important XML applications (that is, vocabularies).

XML

XML 1.0 (Second Edition) [W3C Recommendation] is, of course, the trunk of the sprawling XML technology tree. It builds on Unicode [Unicode Consortium technical report and ISO standard] to define strict rules for text format as well as the Document Type Definition (DTD) validation language. The current (second) edition of the specification contains accumulated corrections to the specification. It has been widely translated, although the English version is the only normative one, meaning the only one that is intended to carry the force of standardization.

XML 1.1 [W3C Recommendation] is the first revision that changes the definition of a well-formed XML document. The primary change is to revise the treatment of characters in the XML specification to make it adapt more naturally to changes in the Unicode specification, and to provide for the normalization of characters across Unicode versions by referencing the Character Model for the World Wide Web 1.0 [in development]. XML 1.1 also adds to the list of line-end characters, adding NEL, a character used for end of line (EOL) in IBM mainframe systems. This change is controversial because some feel that the modest benefit to mainframe users is not worth such a fundamental change. There is additional controversy because some observers find all the changes too modest to introduce all the likely interoperability problems of an XML version change.

XML is based on Standard Generalized Markup Language (SGML), defined in ISO 8879:1986 [ISO Standard]. It represents a significant simplification of SGML, and includes adjustments that make it better suited to the Web environment.

Recommended introductions and tutorials

  • Start with Doug Tidwell's "Introduction to XML" (developerWorks, August 2002).
  • ZVON's XML tutorial and DTD tutorial are available in multiple languages.
  • Excerpts from Ken Sall's book XML Family of Specifications: A Practical Guide provide a simple introduction.
  • W3Schools, which has no affiliation with the W3C, offers a comprehensive XML tutorial.
  • Mike Brown's "skew.org XML Tutorial" is a reintroduction to XML with emphasis on encoding. It highlights topics that are too often glossed over in other treatments.

References and other resources

  • In "The Annotated XML Specification", Tim Bray provides very useful, in-line commentary and clarifications on the text of XML 1.0.
  • "The XML FAQ" is edited by Peter Flynn.
  • Markus Kuhn's "UTF-8 and Unicode FAQ for Unix/Linux" is actually an excellent reference for users on all platforms. UTF-8 is a very common encoding of Unicode.
  • "Unicode in XML and other Markup Languages" is a formal technical report for people (probably implementers) who need very rigorous discussion of the intersection of Unicode and XML.
  • IBM's "Introduction to Unicode" site covers Unicode basics in depth.
  • The open internationalization resources directory is an excellent reference site for all aspects of managing internationalized data, which is the core goal of XML in building on Unicode.


Back to top

Catalogs

XML Catalogs [OASIS Committee Specification] defines a format for instructions on how an XML processor resolves XML entity identifiers into actual documents. For example, an entity catalog can be used to specify the location from which an XML processor loads a DTD, given the system and public identifiers for that DTD. System identifiers are usually given by Uniform Resource Identifiers (URIs), which are governed by RFC 2396: Uniform Resource Identifiers [IETF RFC]. A URI is just an extension of the familiar URLs from use in Web browsers and the like. All URLs are also URIs, but URLs also add URNs, governed by RFC 2141: Uniform Resource Names [IETF RFC], which are a way to identify Web resources by name rather than location (see also "The URN Charter"). Public identifiers are usually specified as Formal Public Identifiers (FPIs), defined in SGML. Catalogs might be used in situations where the machine in use does not have network access to resources specified by a URL, or where an organization wants to substitute a local version of an external resource.

An XML catalog is itself an XML document, but an older format for SGML as well as XML defines a catalog format in simpler text: Entity Management, OASIS Technical Resolution 9401:1997 [OASIS Standard]. This format is often called OASIS Open Catalog.

Recommended introductions and tutorials

Catalog processing is often provided as an integral part of the XML parser, but some introductory resources focus on entity resolution using catalogs:

  • Norman Walsh covers both sorts of catalogs in his article "XML Entity and URI Resolvers."
  • Only XML Catalogs are covered in Chapter 4. XML catalogs, from Bob Stayton's electronic book DocBook XSL: The Complete Guide .


Back to top

XML Namespaces

The many flavors of standards

Several organizations and informal groups of people have been involved in the process of making standards for XML users. I provide links to most in Resources, but here I explain some of the terms you'll find used to qualify standards in this article.

The W3C formally issues Recommendations, which are technically just suggestions for further standardization, but tend to become de facto standards in their own right. Specifications gain this status after a Working Draft becomes a Candidate Recommendation (a final form presented for developers to test through implementation) and then a Proposed Recommendation (ready for recommendation pending W3C vote).

International Organization for Standardization (ISO) is probably the most authoritative standards body in the world. Many of its standards carry some force of law in relevant industries.

Organization for the Advancement of Structured Information Standards (OASIS) has evolved in structure somewhat from the SGML days, but the work product is similar. The highest level of approval at OASIS, is an OASIS Standard, representing approval after voting by the entire membership of OASIS. This is similar to a W3C Recommendation. The prior step is called a Committee Draft, which is approval of the spec by one technical committee (formerly called a Technical Resolution).

The Internet Engineering Task Force (IETF) is a model for an organization that thrives on the energy of the grass roots while trying to impose some measure of formal organization. Almost anyone with Internet access can submit an Internet Draft and propose it as a possible standard. A steering group reviews it and can recommend that it be published as a Request for Comment (RFC). RFCs can be marked as Standards Track RFCs or as outright Standard RFCs, but most publications that become RFCs are well regarded and often well implemented.

Finally, the XML community is celebrated for its activity in creating informal but important standards to fill gaps left by the big organizations. SAX, RDDL, and EXSLT are some notable examples. OASIS has worked to be attractive as a venue for such standards, but there is still no shortage of people willing to start a mailing list thread with the goal of hammering out a de-facto standard.

Namespaces in XML 1.0 [W3C Recommendation] provides a mechanism for universal naming of elements and attributes in XML documents. Here is a simple example that explains the motivation behind XML Namespaces: Imagine that you have an XML vocabulary in which elements named "head" and "body" are marked as anatomical descriptions, but you wish to embed XHTML (discussed later) snippets in the document. XHTML also defines "head" and "body" elements. How do you distinguish the XHTML elements from the host vocabulary elements of the same name? Using XML Namespaces, you would assign to each a vocabulary marker. In XML namespaces each vocabulary is called a namespace and there is a special syntax for expressing vocabulary markers. Each element or attribute name can be connected to one namespace, and in this way you could distinguish the anatomical "head" from the XHTML "head". Among XML experts, XML namespaces have been controversial because they add quite a bit of complexity to the XML processing model and some people think the gain does not warrant the problems. Nevertheless, XML namespaces have become almost universally accepted among XML users and they are addressed in almost all XML processing technologies.

Namespaces in XML 1.1 [W3C Recommendation] is an update that incorporates errata and adds, among other things, support for internationalized URIs.

One issue that often comes up in association with XML namespaces is what sorts of resources namespace URIs should identify. The XML expert community, led by Jonathan Borden and Tim Bray, came up with Resource Directory Description Language (RDDL) as a standard for packaging information on a namespace. RDDL uses XHTML to provide prose descriptions of the vocabulary with embedded XLink (covered in this article) to provide pointers to key resources for helping understand or process the namespace. RDDL 2.0 [in development] is an update that seeks to replace XLink with two options: Resource Description Framework (RDF) (covered later) and alternative XML linking suggestions developed on the mailing list for the W3C Technical Architecture Group (TAG).

Recommended introductions and tutorials

Some of the XML 1.0 tutorials above cover XML namespaces. In addition:

  • ZVON offers an XML namespace tutorial.
  • "XML Namespaces by Example", by Tim Bray, gives a simple illustration of namespaces.
  • "XML Namespaces, XInclude, and XML Base" by Anders Møller and Michael I. Schwartzbach starts with a gentle introduction to XML namespaces.

References and other resources

  • Ronald Bourret maintains the XML Namespaces FAQ.
  • James Clark offers a close examination of namespaces and introduces a popular notation for describing namespaces in his essay "XML Namespaces."
  • Elliotte Rusty Harold introduces RDDL in his article "RDDL Me This: What Does a Namespace URL Locate?"


Back to top

XML Base

XML Base [W3C Recommendation] provides a means of associating XML elements with URIs in order to more precisely specify how relative URIs are resolved in relevant XML processing actions. As an example, if an XML element contains a link that uses a relative URL, the absolute URL to be linked will be determined by referring to the base URI of the element. Most XML processors assume a base URI for each XML entity that makes up the document. You can override this default using XML Base.

Recommended introductions and tutorials

  • ZVON offers an XML Base tutorial.
  • My IBM developerWorks tutorial "Develop Python/XML with 4Suite, Part 4: Composition and updates" (October 2002) introduces XML Base as well as XPointer, XInclude (see below), and XUpdate (covered in this series).


Back to top

XInclude

XML Inclusions (XInclude) 1.0 [in development] provides a system for merging XML documents. XInclude is generally used when you wish to split XML documents into manageable chunks. You can split the documents up as you like and then use XInclude to merge the documents back together. External parsed entities, XML 1.0 constructs that allow you to load portions of the document from a separate file, can be used similarly, and some contend that XInclude is an unnecessary specification. XInclude offers some special facilities, including the ability to select portions of documents for inclusion.

Recommended introductions and tutorials

  • Elliotte Rusty Harold's "Using XInclude" is a strong introduction.
  • ZVON offers an XInclude tutorial.


Back to top

XML Infoset

XML Information Set [W3C Recommendation], also known as the XML Infoset, defines an abstract way of describing an XML document as a series of objects, called information items, with specialized properties. This abstract data set incorporates aspects of XML documents defined in XML 1.0, XML Namespaces, and XML Base. The XML Infoset is used as the foundation of several other specifications that try to break down XML documents into some collection of constituent objects.

Recommended introductions and tutorials

  • Ken Sall's article "Exploring the XML Infoset" is an excerpt from his book XML Family of Specifications: A Practical Guide .


Back to top

Canonical XML ("c14n")

Canonical XML Version 1.0 [W3C Recommendation] is a standard method for generating a physical representation of an XML document, called the canonical form, that accounts for the variations allowed in XML syntax without changing meaning. For example, attribute order in XML is insignificant, so if one document has all its attributes sorted in alphabetical order and another is the same except that its attributes are stored in some different way, then both documents are identical as far as XML 1.0 is concerned, despite the difference in the physical representation. This does present some practical problems. For example, if you want to have a digitally encrypted signature of a document to ensure that it isn't tampered with, a rearrangement of the attributes would break the signature, even though as far as XML 1.0 is concerned, the document has not really changed. The solution is to convert documents to canonical form (a process called "canonicalization (c14n)") before signature, text comparison, or any other such operation. This ensures that changes insignificant in XML 1.0 will be correctly accommodated.

Sometimes the XML that needs to be compared or signed is actually just a portion of a bigger document. Even then, c14n generally has to account for this in order to handle details such as namespace declarations. If you require c14n to be strictly limited to a document subset, then you must use the related algorithm Exclusive XML Canonicalization Version 1.0 [W3C Recommendation].



Back to top

XPath

XML Path Language (XPath) 1.0 [W3C Recommendation] is a syntax and a data model for addressing parts of an XML document. It includes some features of a general-purpose expression language and is designed to be a little language that can be used for application-neutral processing within XML systems. As an example, one could use XPath to locate all the section-title elements in a document.

XPath is probably the most successful XML technology, besides XML 1.0 itself. It is the core of XSLT (covered later in this series), the very successful XML transformation language, and it is provided for in almost every platform for XML processing. XPath 2.0 [in development] adds many new features, including support for W3C XML Schema (to be covered) and many new core functions. It is a very controversial specification because of its enormous added complexity; many users and implementors (including me) say they will avoid XPath 2.0 unless it is greatly simplified.

Recommended introductions and tutorials

Almost every introduction to XSLT covers XPath as well. Here I list tutorials that focus on XPath alone:

  • The ZVON XPath tutorial is example-driven.
  • W3Schools' XPath tutorial provides explanations of the various sections of the spec.
  • Chapter 9: XPath from XML in a Nutshell , by Elliotte Rusty Harold and W. Scott Means, is a more prosaic introduction.


Back to top

XPointer

The XPointer Framework [W3C Recommendation] defines a language that can be used to refer to fragments of an XML document. You are perhaps already familiar with how you can use URLs with hashes ("#") in them to link to a particular section of an HTML document. XPointer brings similar but much broader capabilities when linking or referring to XML documents. The framework can be used with the xpointer() scheme [in development], element() scheme [W3C Recommendation], and xmlns() scheme [W3C Recommendation], which define specific instructions for expressing the document fragments of interest within the XPointer framework.

XPointer has had a rather chaotic road with a lot of dissenting activity. Members of the XPointer working group themselves developed a counter-proposal, FIXptr [Community Standard]. Several alternative XPointer schemes include the xpath1() scheme [IETF Internet Draft].

Recommended introductions and tutorials

XPointer changed quite significantly just before it became a recommendation, so be very careful of the many tutorials out there that cover older versions.

  • ZVON offers an XPointer tutorial.


Back to top

XLink

XML Linking Language (XLink) 1.0 [W3C Recommendation] provides a generic framework for expressing links in XML documents. Hypertext, which requires linking, is the foundation of the Web, and adding sophisticated linking abilities has always been expected to be a cornerstone of XML. In fact, XLink was originally called "XML part 2." Unfortunately, defining a linking system for XML has proven to be far more complex than doing so for a static vocabulary such as HTML. XLink was developed through a long process that was charged with discord. For example, the developers of XHTML (covered in this series) decided not to use XLink and instead created their own system called HLink [in development]. Even now, a couple of years after its completion, adoption of XLink has been slow.

Nevertheless, XLink is important for being at the center of many important XML-related projects and it allows for much richer linking than basic, one-way HTML links. XLink offers such links (simple links), as well as more complex links that can have multiple end-points (extended links), and even links that are not expressed in the linked documents, but rather in special hub documents (called linkbases).

Recommended introductions and tutorials

You can find XLink tutorials that cover older, obsolete drafts of the language. The following are up to date:

  • ZVON provides separate tutorials on XLink simple links and extended links.

References and other resources

  • ZVON also offers an "XLink Reference."
  • Bob DuCharme discusses the history of XLink and offers a survey of implementations in his article "XLink: Who Cares?"


Back to top

RELAX NG

RELAX NG [OASIS Committee Specification and ISO Draft Standard] is an XML schema language, meaning it is a language that can be used to define and limit XML vocabularies. The original XML schema language is the Document Type Definition (DTD), defined in XML 1.0 itself. However, some people dislike DTD for its awkward syntax, limitations in the text and markup constructs it can express, and the difficulty of handling XML Namespaces. Several new XML schema languages have emerged to supplant or augment DTDs, including RELAX NG, which is renowned for its simplicity and expressiveness. The core specification of RELAX NG defines an XML syntax for schemata; also a RELAX NG Compact Syntax [OASIS Committee Specification] defines a simple text syntax for RELAX NG schemata. The text syntax is expected to be incorporated into the ISO standard as a later addendum. RELAX NG is part of an overall ISO initiative for XML schema processing systems called Document Schema Definition Languages (DSDL).

Recommended introductions and tutorials

  • Read Nicholas Chase's introductory tutorial "Understanding RELAX NG", which gets you up to speed on both RELAX NG's simplicity and it's power, including both its full XML-based and its compact syntax (developerWorks, December 2003).
  • David Mertz's XML Matters column on developerWorks focuses on RELAX NG in his series "Kicking back with RELAX NG":
    • Part 1 looks at the general semantics of RELAX NG, and touches on datatyping (February 2003).
    • Part 2 continues the discussion by addressing a few additional semantic issues and looking at tools for working with RELAX NG (March 2003).
    • Part 3 explores the RELAX NG compact syntax in detail, and explains the exact correspondences between compact syntax and XML syntax (May 2003).
  • Official tutorials cover RELAX NG's core and its compact syntax.
  • ZVON offers a combined tutorial for RELAX NG and W3C XML Schema language (covered in this series).

References and other resources

  • Many resources are linked from the RELAX NG home page.
  • ZVON offers a "RELAX NG Reference".


Back to top

W3C XML Schema

XML Schema Part 1: Structures and XML Schema Part 2: Datatypes [W3C Recommendations] define another schema language for XML. The first part allows one to constrain the structure of the document, and the second part allows one to constrain the contents of simple elements and attributes. W3C XML Schema (WXS) has faced criticism for complexity and a lack of expressiveness; the result is competition from other languages such as RELAX NG. Increasingly, people are just using whatever schema language suits them best and turning to an impressive crop of emerging tools to convert from one form to another according to need. Many other specifications have used the WXS Datatypes specification, although there have been calls to develop alternative data type systems. The working group has started work on WXS 1.1.

Recommended introductions and tutorials

  • Nicholas Chase's developerWorks tutorial "Validating XML" covers both DTD and WXS (August 2003).
  • W3Schools has a WXS tutorial.
  • The W3C XML Schema working group offers a very thorough and prosaic introduction to the technology in XML Schema Part 0: Primer.

References and other resources

  • ZVON offers a WXS reference.
  • W3Schools has a WXS Elements Reference.


Back to top

Schematron

The Schematron Assertion Language 1.5 [Community standard and draft ISO standard] is a schema language that uses a different approach from DTD, RELAX NG, or WXS. In Schematron, you register a collection of rules against which the XML document is to be checked, rather than mapping out the entire tree structure of the XML format you're trying to express from root node to the leaves. This makes Schematron very useful not only as a standalone schema language, but also as a complement to other schema languages. Schematron can express constraints that cannot be expressed in the other languages I've covered, so it is often used in tandem with the others.

Recommended introductions and tutorials

  • ZVON offers a Schematron tutorial.
  • Chimezie Thomas-Ogbuji wrote an introduction, "Validating XML with Schematron".

References and other resources

  • The Schematron home page and resource directory provide many useful links.
  • ZVON also offers a Schematron reference.


Back to top

More to come

In this article I have surveyed the most important core XML standards. In Part 2, I shall survey standards important to those using XML in applications processing.



Resources

  • Read the second installment of this series on XML standards, in which Uche Ogbuji focuses on XML processing technologies. (developerWorks, February 2004). In Part 3 of this series on XML standards, the author looks at the most important XML vocabularies. (developerWorks, February 2004). Part 4 is detailed cross-reference of all the standards covered in this series on XML standards. (developerWorks, March 2004).

  • Read The XML Bible, 2nd Edition, by Elliotte Rusty Harold (John Wiley & Sons, 2001), if you need to gain as solid a foundation in XML as possible, but are only willing to buy one book.

  • Visit Web sites of the most significant organizations where XML standards are developed:
    • W3C (World Wide Web Consortium)
    • OASIS (Organization for the Advancement of Structured Information Standards)
    • The ISO (International Organization for Standards), especially through the project ISO/IEC 19757 - Document Schema Definition Languages (DSDL)

  • Simon St. Laurent's Outsider's Guide to the W3C is a FAQ that clarifies many aspects of the organization that brought you HTML and XML.

  • Look up nearly any aspect of XML technology in Robin Cover's The Cover Pages, an XML resource guide of staggering comprehensiveness.

  • Visit the xmlhack news site for XML developers, which Uche helps to edit.

  • Find more XML resources on the developerWorks XML content area, including Uche Ogbuji's Thinking XML column.

  • Find out how you can become an IBM Certified Developer in XML 1.1 and related technologies.


About the author

Uche Ogbuji is a consultant and co-founder of Fourthought Inc., a software vendor and consultancy specializing in XML solutions for enterprise knowledge management. Fourthought develops 4Suite, an open source platform for XML, RDF, and knowledge-management applications. Mr. Ogbuji is a computer engineer and writer born in Nigeria, living and working in Boulder, Colorado, USA. You can contact Mr. Ogbuji at uche@ogbuji.net.