An XML Schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents of that type, above and beyond the basic syntactical constraints imposed by XML itself. These constraints are generally expressed using some combination of grammatical rules governing the order of elements, Boolean predicates that the content must satisfy, data types governing the content of elements and attributes, and more specialized rules such as uniqueness and referential integrity constraints.
There are languages developed specifically to express XML Schemas. The document type definition (DTD) language, which is native to the XML specification, is a schema language that is of relatively limited capability, but that also has other uses in XML aside from the expression of schemas. Two more expressive XML Schema languages in widespread use are XML Schema (with a capital S) and RELAX NG.
The mechanism for associating an XML document with a schema varies according to the schema language. The association may be achieved via markup within the XML document itself, or via some external means.
The XML Schema Definition is commonly referred to as XSD.
Validation
The process of checking to see if an XML document conforms to a schema is called validation, which is separate from XML's core concept of syntactic well-formedness. All XML documents must be well-formed, but it is not required that a document be valid unless the XML parser is "validating", in which case the document is also checked for conformance with its associated schema. DTD-validating parsers are most common, but some support XML Schema or RELAX NG as well.
Validation of an instance document against a schema can be regarded as a conceptually separate operation from XML parsing. In practice, however, many schema validators are integrated with an XML parser.
Languages
There are several different languages available for specifying an XML Schema. Each language has its strengths and weaknesses.
The primary purpose of a schema language is to specify what the structure of an XML document can be. This means which elements can reside in which other elements, which attributes are and are not legal to have on a particular element, and so forth. A schema is analogous to a grammar for a language; a schema defines what the vocabulary for the language may be and what a valid "sentence" is.
There are historic and current XML schema languages:
{|class="wikitable"
!Language
!Abbrev.
!Versions
!Authority
|-
| Constraint Language in XML
| CLiX
| 2005
|Independent
|-
| Document Content Description facility for XML, an RDF framework
|DCD
| v1.0 (1998)
| W3C (Note)
|-
| Document Definition Markup Language
| DDML
| v0 (1999)
| W3C (Note)
|-
| Document Structure Description
| DSD
| 2002, 2005
| BRICS (defunct)
|-
|rowspan="2"| Document Type Definition
|rowspan="2"| DTD
| 1986 (SGML)
| ISO
|-
| 2008 (XML)
| ISO/IEC
|-
| Namespace-based Validation Dispatching Language
| NVDL
| 2006
| ISO/IEC
|-
| Content Assembly Mechanism
| CAM
| 2007
| OASIS
|-
|rowspan="2"| REgular LAnguage for XML Next Generation
|rowspan="2"| RELAX NG, RelaxNG
| 2001, Compact Syntax (2002)
| OASIS
|-
| v1 (2003), v1 Compact Syntax (2006), v2 (2008)
| ISO/IEC
W3C XML Schema has a rich "simple type" system built-in (xs:number, xs:date, etc., plus derivation of custom types), while RELAX NG has an extremely simplistic one because it is meant to use type libraries developed independently of RELAX NG, rather than grow its own. This is seen by some as a disadvantage. In practice it is common for a RELAX NG schema to use the predefined "simple types" and "restrictions" (pattern, maxLength, etc.) of W3C XML Schema.
In W3C XML Schema a specific number or range of repetitions of patterns can be expressed whereas it is practically not possible to specify at all in RELAX NG (<oneOrMore> or <zeroOrMore>).
Disadvantages
W3C XML Schema is complex and hard to learn, although that is partially because it tries to do more than mere validation (see PSVI).
Although being written in XML is an advantage, it is also a disadvantage in some ways. The W3C XML Schema language, in particular, can be quite verbose, while a DTD can be terse and relatively easily editable.
Likewise, WXS's formal mechanism for associating a document with a schema can pose a potential security problem. For WXS validators that will follow a URI to an arbitrary online location, there is the potential for reading something malicious from the other side of the stream.
W3C XML Schema does not implement most of the DTD ability to provide data elements to a document.
Although W3C XML Schema's ability to add default attributes to elements is an advantage, it is a disadvantage in some ways as well. It means that an XML file may not be usable in the absence of its schema, even if the document would validate against that schema. In effect, all users of such an XML document must also implement the W3C XML Schema specification, thus ruling out minimalist or older XML parsers. It can also slow down the processing of the document, as the processor must potentially download and process a second XML file (the schema); however, a schema would normally then be cached, so the cost comes only on the first use.
Tool Support
WXS support exists in a number of large XML parsing packages. Xerces and the .NET Framework's Base Class Library both provide support for WXS validation.
RELAX NG
RELAX NG provides for most of the advantages that W3C XML Schema does over DTDs.
Advantages over W3C XML Schema
While the language of RELAX NG can be written in XML, it also has an equivalent form that is much more like a DTD, but with greater specifying power. This form is known as the compact syntax. Tools can easily convert between these forms with no loss of features or even commenting. Even arbitrary elements specified between RELAX NG XML elements can be converted into the compact form.
RELAX NG provides very strong support for unordered content. That is, it allows the schema to state that a sequence of patterns may appear in any order.
RELAX NG also allows for non-deterministic content models. What this means is that RELAX NG allows the specification of a sequence like the following:
<syntaxhighlight lang="xml">
<zeroOrMore>
and
DeRose (1997).
;Consistency: One obvious consideration is that tags and attribute names should use consistent conventions. For example, it would be unusual to create a schema where some element names are camelCase but others use underscores to separate parts of names, or other conventions.
;Clear and mnemonic names: As in other formal languages, good choices of names can help understanding, even though the names per se have no formal significance. Naming the appropriate tag "chapter" rather than "tag37" is a help to the reader. At the same time, this brings in issues of the choice of natural language. A schema to be used for Irish Gaelic documents will probably use the same language for element and attribute names, since that will be the language common to editors and readers.
;Tag vs attribute choice: Some information can "fit" readily in either an element or an attribute. Because attributes cannot contain elements in XML, this question only arises for components that have no further sub-structure that XML needs to be aware of (attributes do support multiple tokens, such as multiple IDREF values, which can be considered a slight exception). Attributes typically represent information associated with the entirety of the element on which they occur, while sub-elements introduce a new scope of their own.
;Text content: Some XML schemas, particularly ones that represent various kinds of documents, ensure that all "text content" (roughly, any part that one would speak if reading the document aloud) occurs as text, and never in attributes. However, there are many edge cases where this does not hold: First, there are XML documents which do not involve "natural language" at all, or only minimally, such as for telemetry, creation of vector graphics or mathematical formulae, and so on. Second, information like stage directions in plays, verse numbers in Classical and Scriptural works, and correction or normalization of spelling in transcribed works, all pose issues of interpretation that schema designers for such genres must consider.
;Schema reuse: A new XML Schema can be developed from scratch, or can reuse some fragments of other XML Schemas. All schema languages offer some tools (for example, <code>include</code> and modularization control over namespaces) and recommend reuse where practical. Various parts of the extensive and sophisticated Text Encoding Initiative schemas are also re-used in an extraordinary variety of other schemas.
;Semantic vs syntactic: Except for a RDF-related one, no schema language express formally semantic, only structure and data-types. Despite being the ideal, the inclusion of RDF assumptions is very poor and is not a recommendation in the schema development frameworks.
See also
- Data structure
- List of XML schemas
- Schema (disambiguation) (for other uses of the term)
- Structuring information
- XML Information Set
- XML log
- JSON Schema
Languages:
- Document Structure Description
- Document Type Definition
- Namespace Routing Language
- Namespace-based Validation Dispatching Language
- OASIS CAM
- RELAX NG
- Schematron
- W3C XML Schema
References
- Comparative Analysis of Six XML Schema Languages by Dongwon Lee, Wesley W. Chu, In ACM SIGMOD Record, Vol. 29, No. 3, page 76-87, September 2000
- Taxonomy of XML Schema Languages using Formal Language Theory by Makoto Murata, Dongwon Lee, Murali Mani, Kohsuke Kawaguchi, In ACM Trans. on Internet Technology (TOIT), Vol. 5, No. 4, page 1-45, November 2005
External links
- Comparing XML Schema Languages by Eric van der Vlist (2001)
- Application of XML Schema in Web Services Security by Sridhar Guthula, W3C Schema Experience Report, May 2005
- March 2009 DEVX article "Taking XML Validation to the Next Level: Introducing CAM" by Michael Sorens
<!--Interwikies-->
<!--Categories-->
