https://github.com/TEIC/TEI
Raw File
Tip revision: 347c64fac3fa1a64ade0d8d1842813d4b8f7acec authored by Hugh Cayless on 12 May 2017, 16:57:59 UTC
Updates.
Tip revision: 347c64f
SG.html

<!DOCTYPE html
  SYSTEM "about:legacy-compat">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><!--THIS FILE IS GENERATED FROM AN XML MASTER. DO NOT EDIT (4)--><title>v. A Gentle Introduction to XML - The TEI Guidelines</title><meta property="Language" content="en" /><meta property="DC.Title" content="v. A Gentle Introduction to XML - The TEI Guidelines" /><meta property="DC.Language" content="SCHEME=iso639 en" /><meta property="DC.Creator.Address" content="tei@oucs.ox.ac.uk" /><meta charset="utf-8" /><link href="guidelines.css" rel="stylesheet" type="text/css" /><link href="odd.css" rel="stylesheet" type="text/css" /><link rel="stylesheet" media="print" type="text/css" href="guidelines-print.css" /><script type="text/javascript" src="jquery-1.2.6.min.js"></script><script type="text/javascript" src="columnlist.js"></script><script type="text/javascript" src="popupFootnotes.js"></script><script type="text/javascript">
        $(function() {
         $('ul.attrefs-class').columnizeList({cols:3,width:30,unit:'%'});
         $('ul.attrefs-element').columnizeList({cols:3,width:30,unit:'%'});
         $(".displayRelaxButton").click(function() {
           $(this).parent().find('.RNG_XML').toggle();
           $(this).parent().find('.RNG_Compact').toggle();
         });
         $(".tocTree .showhide").click(function() {
          $(this).find(".tocShow,.tocHide").toggle();
          $(this).parent().find("ul.continuedtoc").toggle();
	  });
        })
    </script><script type="text/javascript"><!--
var displayXML=0;
states=new Array()
states[0]="element-a"
states[1]="element-b"
states[2]="element-c"
states[3]="element-d"
states[4]="element-e"
states[5]="element-f"
states[6]="element-g"
states[7]="element-h"
states[8]="element-i"
states[9]="element-j"
states[10]="element-k"
states[11]="element-l"
states[12]="element-m"
states[13]="element-n"
states[14]="element-o"
states[15]="element-p"
states[16]="element-q"
states[17]="element-r"
states[18]="element-s"
states[19]="element-t"
states[20]="element-u"
states[21]="element-v"
states[22]="element-w"
states[23]="element-x"
states[24]="element-y"
states[25]="element-z"

function startUp() {

}

function hideallExcept(elm) {
for (var i = 0; i < states.length; i++) {
 var layer;
 if (layer = document.getElementById(states[i]) ) {
  if (states[i] != elm) {
    layer.style.display = "none";
  }
  else {
   layer.style.display = "block";
      }
  }
 }
 var mod;
 if ( mod = document.getElementById('byMod') ) {
     mod.style.display = "none";
 }
}

function showall() {
 for (var i = 0; i < states.length; i++) {
   var layer;
   if (layer = document.getElementById(states[i]) ) {
      layer.style.display = "block";
      }
  }
}

function showByMod() {
  hideallExcept('');
  var mod;
  if (mod = document.getElementById('byMod') ) {
     mod.style.display = "block";
     }
}

	--></script></head><body><div id="container"><div id="banner"><img src="Images/banner.jpg" alt="Text Encoding Initiative logo and banner" /></div></div><div class="mainhead"><h1>P5: 
    Guidelines for Electronic Text Encoding and Interchange</h1><p>Version 3.1.1a. Last updated on
	10th May 2017, revision bd8dda3</p></div><div id="onecol" class="main-content"><h2><span class="headingNumber">v. </span>A Gentle Introduction to XML</h2><div class="div1" id="SG"><div class="miniTOC miniTOC_left"><p><span class="subtochead">Table of contents</span></p><div class="subtoc"><ul class="subtoc"><li class="subtoc"><a class="subtoc" href="SG.html#SG11" title="Whats Special about XML">v.1. What's Special about XML?</a></li><li class="subtoc"><a class="subtoc" href="SG.html#SG12" title="Textual Structures">v.2. Textual Structures</a></li><li class="subtoc"><a class="subtoc" href="SG.html#SG13" title="XML Structures">v.3. XML Structures</a></li><li class="subtoc"><a class="subtoc" href="SG.html#SG152" title="Complicating the Issue">v.4. Complicating the Issue</a></li><li class="subtoc"><a class="subtoc" href="SG.html#SG16" title="Attributes">v.5. Attributes</a></li><li class="subtoc"><a class="subtoc" href="SG.html#SG-oth" title="Other Components of an XML Document">v.6. Other Components of an XML Document</a></li><li class="subtoc"><a class="subtoc" href="SG.html#SG18" title="Putting It All Together">v.7. Putting It All Together</a></li></ul></div><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="AB.html"><span class="headingNumber">iv. </span>About These Guidelines</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="CH.html"><span class="headingNumber">vi. </span>Languages and Character Sets</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><p>The encoding scheme defined by these Guidelines is formulated as an application of the Extensible Markup Language (XML) (<a class="link_ptr" href="BIB.html#XMLREC" title="Tim Bray Jean Paoli C. M. SperbergMcQueen Eve Maler François Yergau Extensible Markup Language (XML) Version 1.0 (Fourth editio...">Bray et al. (eds.) (2006)</a>). XML is widely used for the definition of device-independent, system-independent methods of storing and processing texts in electronic form. It is now also the interchange and communication format used by many applications on the World Wide Web. In the present chapter we informally introduce some of its basic concepts and attempt to explain to the reader encountering them for the first time how and why they are used in the TEI scheme. More detailed technical accounts of TEI practice in this respect are provided in chapters <a class="link_ptr" href="USE.html" title="Using the TEI"><span class="headingNumber">23 </span>Using the TEI</a>, <a class="link_ptr" href="ST.html" title="3"><span class="headingNumber">1 </span>The TEI Infrastructure</a>, and <a class="link_ptr" href="TD.html" title="27"><span class="headingNumber">22 </span>Documentation Elements</a> of these Guidelines.</p><p>Strictly speaking, XML is a <span class="term">metalanguage</span>, that is, a language used to describe other languages, in this case, <span class="term">markup</span> languages. Historically, the word <span class="term">markup</span> has been used to describe annotation or other marks within a text intended to instruct a compositor or typist how a particular passage should be printed or laid out. Examples include wavy underlining to indicate boldface, special symbols for passages to be omitted or printed in a particular font, and so forth. As the formatting and printing of texts was automated, the term was extended to cover all sorts of special codes inserted into electronic texts to govern formatting, printing, or other processing.</p><p>Generalizing from that sense, we define <span class="term">markup</span>, or (synonymously) <span class="term">encoding</span>, as any means of making explicit an interpretation of a text. Of course, all printed texts are implicitly encoded (or marked up) in this sense: punctuation marks, capitalization, disposition of letters around the page, even the spaces between words all might be regarded as a kind of markup, the purpose of which is to help the human reader determine where one word ends and another begins, or how to identify gross structural features such as headings or simple syntactic units such as dependent clauses or sentences. Encoding a text for computer processing is, in principle, like transcribing a manuscript from <span lang="la" class="foreign">scriptio continua</span><span id="Note5_return"><a class="notelink" title="In the continuous writing characteristic of manuscripts from the early classical period, words are written continuously with no intervening spaces or …" href="#Note5"><sup>3</sup></a></span>; it is a process of making explicit what is conjectural or implicit, a process of directing the user as to how the content of the text should be (or has been) interpreted.</p><p>By <span class="term">markup language</span> we mean a set of markup conventions used together for encoding texts. A markup language must specify how markup is to be distinguished from text, what markup is allowed, what markup is required, and what the markup means. XML provides the means for doing the first three; documentation such as these Guidelines is required for the last.</p><p>The present chapter attempts to give an informal introduction to those parts of XML of which a proper understanding is necessary to make best use of these Guidelines. The interested reader should also consult one or more of the many excellent introductory textbooks and web sites now available on the subject.<span id="Note6_return"><a class="notelink" title="New textbooks and websites about XML appear at regular intervals and to select any one of them would be invidious. some recommended online courses in…" href="#Note6"><sup>4</sup></a></span></p><div class="div2" id="SG11"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG12"><span class="headingNumber">v.2. </span>Textual Structures</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h3><span class="bookmarklink"><a class="bookmarklink" href="#SG11" title="link to this section "><span class="invisible">TEI: What's Special about XML?</span><span class="pilcrow">¶</span></a></span><span class="headingNumber">v.1. </span><span class="head">What's Special about XML?</span></h3><p>XML has three highly distinctive advantages: </p><ol class="numbered"><li class="item">it places emphasis on descriptive rather than procedural markup;</li><li class="item">it distinguishes the concepts of syntactic correctness and of <span class="term">validity</span> with respect to a <span class="term">document type definition</span>;</li><li class="item">it is independent of any one hardware or software system.</li></ol><p> These three aspects are discussed briefly below, and then in more depth in the remainder of this chapter.</p><p>XML is frequently compared with HTML, the language in which web pages have generally been written, which shares some of the above characteristics. Compared with HTML, however, XML has some other important features: </p><ul class="bulleted"><li class="item">XML is <span class="bulleted">extensible</span>: it does not consist of a fixed set of tags;</li><li class="item">XML documents must be <span class="bulleted">well-formed</span> according to a defined syntax;</li><li class="item">an XML document can be formally <span class="bulleted">validated</span> against a schema of some kind;</li><li class="item">XML is more interested in the meaning of data than in its presentation.</li></ul><div class="div3" id="SG111"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG112"><span class="headingNumber"></span>Types of Document</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG111" title="link to this section "><span class="invisible">TEI: Descriptive Markup</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Descriptive Markup</span></h4><p>In a descriptive markup system, the markup codes used do little more than categorize parts of a document. Markup codes such as <span class="tag">&lt;para&gt;</span> or <code>\end{list}</code> simply identify a portion of a document and assert of it that <span class="q">‘the following item is a paragraph’</span>, or <span class="q">‘this is the end of the most recently begun list’</span>, etc. By contrast, a procedural markup system defines what processing is to be carried out at particular points in a document: <span class="q">‘call procedure PARA with parameters 42, b, and x here’</span> or <span class="q">‘move the left margin 2 quads left, move the right margin 2 quads right, skip down one line, and go to the new left margin,’</span> etc. In XML, the instructions needed to process a document for some particular purpose (for example, to format it) are sharply distinguished from the markup used to describe it.</p><p>Usually, the markup or other information needed to process a document will be maintained separately from the document itself, typically in a distinct document called a <span class="term">stylesheet</span>, though it may do much more than simply define the rendition or visual appearance of a document.<span id="Note7_return"><a class="notelink" title="We do not here discuss in any detail the ways that a stylesheet can be used or defined, nor do we discuss the popular W3C Stylesheet Languages XSLT an…" href="#Note7"><sup>5</sup></a></span></p><p>When descriptive markup is used, the same document can readily be processed in many different ways, using only those parts of it which are considered relevant. For example, a content analysis program might disregard entirely the footnotes embedded in an annotated text, while a formatting program might extract and collect them all together for printing at the end of each chapter. Different kinds of processing can be carried out with the same part of a file. For example, one program might extract names of persons and places from a document to create an index or database, while another, operating on the same text, but using a different stylesheet, might print names of persons and places in a distinctive typeface.</p></div><div class="div3" id="SG112"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG111"><span class="headingNumber"></span>Descriptive Markup</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG113"><span class="headingNumber"></span>Data Independence</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG112" title="link to this section "><span class="invisible">TEI: Types of Document</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Types of Document</span></h4><p>A second key aspect of XML is its notion of a <span class="term">document type</span>: documents are regarded as having types, just as other objects processed by computers do. The type of a document is formally defined by its constituent parts and their structure. The definition of a ‘report’, for example, might be that it consisted of a ‘title’ and possibly an ‘author’, followed by an ‘abstract’ and a sequence of one or more ‘paragraphs’. Anything lacking a title, according to this formal definition, would not formally be a report, and neither would a sequence of paragraphs followed by an abstract, whatever other report-like characteristics these might have for the human reader.</p><p>If documents are of known types, a special-purpose program (called a <span class="term">parser</span>), once provided with an unambiguous definition of a document type, can check that any document claiming to be of that type does in fact conform to the specification. A parser can check that all elements specified for a particular document type are present and no others, that they are combined in appropriate ways, correctly ordered, and so forth. More significantly, different documents of the same type can be processed in a uniform way. Programs can be written which take advantage of the knowledge encapsulated in the document type information, and which can thus behave in a more ‘intelligent’ fashion.</p></div><div class="div3" id="SG113"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG112"><span class="headingNumber"></span>Types of Document</a></li><li class="subtoc"></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG113" title="link to this section "><span class="invisible">TEI: Data Independence</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Data Independence</span></h4><p>A basic design goal of XML is to ensure that documents encoded according to its provisions can move from one hardware and software environment to another without loss of information. The two features discussed so far both address this requirement at an abstract level; the third feature addresses it at the level of the strings of data characters that make up a document. All XML documents, whatever languages or writing systems they employ, use the same underlying character encoding (that is, the same method of representing as binary data those graphic forms making up a particular writing system).<span id="Note8_return"><a class="notelink" title="See Extensible Markup Language (XML) 1.0, available from , Section 2.2 Characters." href="#Note8"><sup>6</sup></a></span> This encoding is defined by an international standard,<span id="Note9_return"><a class="notelink" title="ISO/IEC 10646-1993 Information Technology — Universal Multiple-Octet Coded Character Set (UCS)" href="#Note9"><sup>7</sup></a></span> which is implemented by a universal character set maintained by an industry group called the Unicode Consortium, and known as Unicode.<span id="Note10_return"><a class="notelink" title="See" href="#Note10"><sup>8</sup></a></span> Unicode provides a standardized way of representing any of the many thousands of discrete symbols making up the world's writing systems, past and present.</p><p>Most modern computing systems now support Unicode directly; for those which do not, XML provides a mechanism for the indirect representation of single characters by means of their character number, known as <span class="term">character references</span>; see further <a class="link_ptr" href="SG.html#SG-er" title="Character References"><span class="headingNumber"></span>Character References</a>.</p></div></div><div class="div2" id="SG12"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG11"><span class="headingNumber">v.1. </span>What's Special about XML?</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG13"><span class="headingNumber">v.3. </span>XML Structures</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h3><span class="bookmarklink"><a class="bookmarklink" href="#SG12" title="link to this section "><span class="invisible">TEI: Textual Structures</span><span class="pilcrow">¶</span></a></span><span class="headingNumber">v.2. </span><span class="head">Textual Structures</span></h3><p>A text is not an undifferentiated sequence of words, much less of bytes. For different purposes, it may be divided into many different units, of different types or sizes. A prose text such as this one might be divided into sections, chapters, paragraphs, and sentences. A verse text might be divided into cantos, stanzas, and lines. Once printed, sequences of prose and verse might be divided into volumes, gatherings, and pages.</p><p>Structural units of this kind are most often used to identify specific locations or refer to points within a text (<span class="q">‘the third sentence of the second paragraph in chapter ten’</span>; <span class="q">‘canto 10, line 1234’</span>; <span class="q">‘page 412’</span>, etc.) but they may also be used to subdivide a text into meaningful fragments for analytic purposes (<span class="q">‘is the average sentence length of section 2 different from that of section 5?’</span> <span class="q">‘how many paragraphs separate each occurrence of the word <span class="mentioned">nature</span>? how many pages?’</span>). Other structural units are more clearly analytic, in that they characterize a section of a text. A dramatic text might regard each speech by a different character as a unit of one kind, and stage directions or pieces of action as units of another kind. Such an analysis is less useful for locating parts of the text (<span class="q">‘the 93rd speech by Horatio in Act 2’</span>) than for facilitating comparisons between the words used by one character and those of another, or those used by the same character at different points of the play.</p><p>In a prose text one might similarly wish to regard as units of different types passages in direct or indirect speech, passages employing different stylistic registers (narrative, polemic, commentary, argument, etc.), passages of different authorship and so forth. And for certain types of analysis (most notably textual criticism) the physical appearance of one particular printed or manuscript source may be of importance: paradoxically, one may wish to use descriptive markup to describe presentational features such as typeface, line breaks, use of whitespace and so forth.</p><p>These textual structures overlap with one another in complex and unpredictable ways. Particularly when dealing with texts as instantiated by paper technology, the reader needs to be aware of both the physical organization of the book and the logical structure of the work it contains. Many great works (Sterne's <span class="titlem">Tristram Shandy</span> for example) cannot be fully appreciated without an awareness of the interplay between narrative units (such as chapters or paragraphs) and presentational ones (such as page divisions). For many types of research, the interplay among different levels of analysis is crucial: the extent to which syntactic structure and narrative structure mesh, or fail to mesh, for example, or the extent to which phonological structures reflect morphology.</p></div><div class="div2" id="SG13"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG12"><span class="headingNumber">v.2. </span>Textual Structures</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG152"><span class="headingNumber">v.4. </span>Complicating the Issue</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h3><span class="bookmarklink"><a class="bookmarklink" href="#SG13" title="link to this section "><span class="invisible">TEI: XML Structures</span><span class="pilcrow">¶</span></a></span><span class="headingNumber">v.3. </span><span class="head">XML Structures</span></h3><p>This section describes the simple and consistent mechanism for the markup or identification of textual structure provided by XML. It also describes the methods XML provides for the expression of rules defining how units of textual structure can meaningfully be combined in a text.</p><div class="div3" id="SG131"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG132"><span class="headingNumber"></span>Content Models: an Example</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG131" title="link to this section "><span class="invisible">TEI: Elements</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Elements</span></h4><p>The technical term used in XML for a textual unit, viewed as a structural component, is <span class="term">element</span>. Different types of elements are given different names, but XML provides no way of expressing the meaning of a particular type of element, other than its relationship to other element types. That is, all one can say about an element called (say) <span class="gi">&lt;blort&gt;</span> is that instances of it may (or may not) occur within elements of type <span class="gi">&lt;farble&gt;</span>, and that it may (or may not) be decomposed into elements of type <span class="gi">&lt;blortette&gt;</span>. It should be stressed that XML is entirely unconcerned with the <em>semantics</em> of textual elements, because these are considered to be application dependent. It is up to the creators of XML vocabularies (such as these Guidelines) to choose intelligible element names and to define their intended use in text markup. That is the chief purpose of documents such as the TEI Guidelines. From the need to choose element names indicative of function comes the technical term for the name of an element type, which is <span class="term">generic identifier</span>, or GI.</p><div class="p">Within a marked-up text (a <span class="term">document instance</span>), each element must be explicitly marked or tagged in some way. This is done by inserting a tag at the beginning of the element (a <span class="term">start-tag</span>) and another at its end (an <span class="term">end-tag</span>). The start- and end-tag pair are used to bracket off element occurrences within the running text, in rather the same way as different types of parentheses or quotation marks are used in conventional punctuation. For example, a quotation element in a text might be tagged as follows: <div id="index-egXML-d52e1423" class="pre egXML_valid">... Rosalind's<br /> remarks <span class="element">&lt;quote&gt;</span>This is the silliest stuff that ere I heard<br />   of!<span class="element">&lt;/quote&gt;</span> clearly indicate ...</div> As this example shows, a start-tag takes the form <span class="tag">&lt;quote&gt;</span>, where the opening angle bracket indicates the start of the start-tag, <span class="q">‘quote’</span> is the generic identifier of the element that is being delimited, and the closing angle bracket indicates the end of the start-tag. An end-tag takes an identical form, except that the opening angle bracket is followed by a solidus (slash) character, so that the corresponding end-tag is <span class="tag">&lt;/quote&gt;</span>.<span id="Note11_return"><a class="notelink" title="Because the opening angle bracket has this special function in an XML document, special steps must be taken to use that character for other purposes (…" href="#Note11"><sup>9</sup></a></span> The material between the start-tag and the end-tag (the string of words <span class="q">‘This is the silliest stuff that ere I heard of’</span> in the example above) is known as the <span class="term">content</span> of the element. Sometimes there may be nothing between the start and the end-tag; in this case the two may optionally be merged together into a single composite tag with the solidus at the end, like this: <span class="tag">&lt;quote/&gt;</span>.</div></div><div class="div3" id="SG132"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG131"><span class="headingNumber"></span>Elements</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG14"><span class="headingNumber"></span>Validating a Document's Structure</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG132" title="link to this section "><span class="invisible">TEI: Content Models: an Example</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Content Models: an Example</span></h4><p>An element may be <span class="term">empty</span>, that is, it may have no content at all, or it may contain just a sequence of characters with no other elements. Often, however, elements of one type will be <span class="term">embedded</span> (contained entirely) within elements of a different type.</p><div class="p">To illustrate this, we will consider a very simple structural model. Let us assume that we wish to identify within an anthology only poems, their headings, and the stanzas and lines of which they are composed. In XML terms, our <span class="term">document type</span> is the <span class="term">anthology</span>, and it consists of a series of <span class="term">poem</span>s. Each poem has embedded within it one element, a <span class="term">heading</span>, and several occurrences of another, a <span class="term">stanza</span>, each stanza having embedded within it a number of <span class="term">line</span> elements. Fully marked up, a text conforming to this model might appear as follows: <div id="index-egXML-d52e1485" class="pre egXML_invalid"><span class="element">&lt;anthology&gt;</span><br /> <span class="element">&lt;poem&gt;</span><br />  <span class="element">&lt;heading&gt;</span>The SICK ROSE<span class="element">&lt;/heading&gt;</span><br />  <span class="element">&lt;stanza&gt;</span><br />   <span class="element">&lt;line&gt;</span>O Rose thou art sick.<span class="element">&lt;/line&gt;</span><br />   <span class="element">&lt;line&gt;</span>The invisible worm,<span class="element">&lt;/line&gt;</span><br />   <span class="element">&lt;line&gt;</span>That flies in the night<span class="element">&lt;/line&gt;</span><br />   <span class="element">&lt;line&gt;</span>In the howling storm:<span class="element">&lt;/line&gt;</span><br />  <span class="element">&lt;/stanza&gt;</span><br />  <span class="element">&lt;stanza&gt;</span><br />   <span class="element">&lt;line&gt;</span>Has found out thy bed<span class="element">&lt;/line&gt;</span><br />   <span class="element">&lt;line&gt;</span>Of crimson joy:<span class="element">&lt;/line&gt;</span><br />   <span class="element">&lt;line&gt;</span>And his dark secret love<span class="element">&lt;/line&gt;</span><br />   <span class="element">&lt;line&gt;</span>Does thy life destroy.<span class="element">&lt;/line&gt;</span><br />  <span class="element">&lt;/stanza&gt;</span><br /> <span class="element">&lt;/poem&gt;</span><br /><span class="comment">&lt;!-- more poems go here --&gt;</span><br /><span class="element">&lt;/anthology&gt;</span><div style="float: right;"><a href="BIB.html#SG132-neg-1">bibliography</a> </div></div></div><p>It should be stressed that this example does <em>not</em> use the names proposed for corresponding elements elsewhere in these Guidelines: the above is thus <em>not</em> a valid TEI document.<span id="Note12_return"><a class="notelink" title="The element names here have been chosen for clarity of exposition; there is, however, a TEI element corresponding to each." href="#Note12"><sup>10</sup></a></span> It will, however, serve as an introduction to the basic notions of XML. Whitespace and line breaks have been added to the example for the sake of visual clarity only; they have no particular significance in the XML encoding itself. Also, the line </p><pre class="pre_eg cdata">&lt;!-- more poems go here --&gt;</pre><p> is an XML <span class="term">comment</span> and is not treated as part of the text.</p><p>As it stands, the above example is what is known as a <span class="term">well-formed</span> XML document because it obeys the following simple rules: </p><ol class="numbered"><li class="item">there is a single element enclosing the whole document: this is known as the <span class="term">root element</span> (<span class="gi">&lt;anthology&gt;</span> in our case);</li><li class="item">each element is completely contained by the root element, or by an element that is so contained; elements do not partially overlap one another;</li><li class="item">a tag explicitly marks the start and end of each element.</li></ol><p>A well-formed XML document can be processed in a number of useful ways. A simple indexing program could extract only the relevant text elements in order to make a list of headings, first lines, or words used in the poem text; a simple formatting program could insert blank lines between stanzas, perhaps indenting the first line of each, or inserting a stanza number. Different parts of each poem could be typeset in different ways. A more ambitious analytic program could relate the use of punctuation marks to stanzaic and metrical divisions.<span id="Note13_return"><a class="notelink" title="Note that this simple example has not addressed the problem of marking elements such as sentences explicitly; the implications of this are discussed i…" href="#Note13"><sup>11</sup></a></span> Scholars wishing to see the implications of changing the stanza or line divisions chosen by the editor of this poem can do so simply by altering the position of the tags. And of course, the text as presented above can be transported from one computer to another and processed by any program (or person) capable of making sense of the tags embedded within it with no need for the sort of transformations and translations needed for files which have been saved in one or other of the proprietary formats preferred by most word-processing programs.</p><p>As we noted above, one of the attractions of XML is that it enables us to make up our own names for the elements rather than requiring us always to use names predefined by other agencies. Clearly, however, if we wish to exchange our poems with others, or to include poems others have marked up in our anthology, we will need to know a bit more about the names used for the tags. The means that XML provides for this is called a <span class="term">namespace</span>. In our simple example, the tags just contain a simple name. As we shall see, it is also possible to use tags that include a <span class="term">qualified name</span>, that is, a name with an optional prefix identifying the set of names to which it belongs. For example, we have defined an element <span class="gi">&lt;line&gt;</span> for the purpose of marking lines of verse. Another person might, however, define an element called <span class="gi">&lt;line&gt;</span> for the purpose of marking typographic lines, or drawn lines. Because of these different meanings, if we wish to share data it will be necessary to distinguish the two <span class="q">‘line’</span> components in our marked-up texts. This is achieved by including a <span class="term">namespace prefix</span> within the markup, for example like this: </p><pre class="pre_eg cdata"> &lt;my:line&gt;This is one of my lines&lt;/my:line&gt; 
&lt;!-- ... --&gt; 
&lt;yr:line&gt;This is one of your lines&lt;/yr:line&gt;</pre><p> This feature is particularly important if we have different definitions of what a ‘line’ is, of course, but there are many occasions when it is useful to distinguish groups of tags belonging to different ‘markup vocabularies’; we discuss this further below (<a class="link_ptr" href="SG.html#SGname" title="Namespaces"><span class="headingNumber"></span>Namespaces</a>). One particularly useful namespace prefix is predefined for XML: it is <code>xml</code> and we will see examples of its use below.</p><p>Namespaces allow us to represent the fact that a name belongs to a group of names, but don't allow us to do much more by way of checking the integrity or accuracy of our tagging. Simple well-formedness alone is not enough for the full range of what might be useful in marking up a document. It might well be useful if, in the process of preparing our digital anthology, a computer system could check some basic rules about how stanzas, lines, and headings can sensibly co-occur in a document. It would be even more useful if the system could check that stanzas are always tagged <span class="tag">&lt;stanza&gt;</span> and not occasionally <span class="tag">&lt;canto&gt;</span> or <span class="tag">&lt;Stanza&gt;</span>. An XML document in which such rules have been checked is technically known as a <span class="term">valid</span> document, and the ability to perform such validation is one of the key advantages of using XML. To carry this out, some way of formally stating the criteria for successful validation is necessary: in XML this formal statement is provided by an additional document known as a <span class="term">schema</span>.<span id="Note14_return"><a class="notelink" title="The older terms Document Type Declaration and Document Type Definition, both abbreviated as DTD, may also be encountered. Throughout these Guidelines …" href="#Note14"><sup>12</sup></a></span></p></div><div class="div3" id="SG14"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG132"><span class="headingNumber"></span>Content Models: an Example</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG141bis"><span class="headingNumber"></span>An Example Schema</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG14" title="link to this section "><span class="invisible">TEI: Validating a Document's Structure</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Validating a Document's Structure</span></h4><p>The design of a schema may be as lax or as restrictive as the occasion warrants. A balance must be struck between the convenience of following simple rules and the complexity of handling real texts. This is particularly the case when the rules being defined relate to texts that already exist: the designer may have only the haziest of notions as to an ancient text's original purpose or meaning and hence find it very difficult to specify consistent rules about its structure. On the other hand, where a new text is being prepared to an exact specification, for entry into a textual database of some kind for example, the more precisely stated the rules, the better they can be enforced. Even in the case where an existing text is being marked up, it may be beneficial to define a restrictive set of rules relating to one particular view or hypothesis about the text—if only as a means of testing the usefulness of that view or hypothesis. A schema designed for use by a small project or team is likely to take a different position on such issues than one intended for use by a large and possibly fragmented community. It is important to remember that every schema results from an interpretation of a text. There is no single schema encompassing the absolute truth about any text, although it may be convenient to privilege some schemas above others for particular types of analysis.</p><p>XML is widely used in environments where uniformity of document structure is a major desideratum. In the production of technical documentation, for example, it is of major importance that sections and subsections should be properly nested, that cross-references should be properly resolved and so forth. In such situations, documents are seen as raw material to match against predefined sets of rules. As discussed above, however, the use of simple rules can also greatly simplify the task of tagging accurately elements of less rigidly constrained texts. By making these rules explicit, the scholar reduces his or her own burdens in marking up and verifying the electronic text, while also being forced to make explicit an interpretation of the structure and significant particularities of the text being encoded.</p></div><div class="div3" id="SG141bis"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG14"><span class="headingNumber"></span>Validating a Document's Structure</a></li><li class="subtoc"></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG141bis" title="link to this section "><span class="invisible">TEI: An Example Schema</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">An Example Schema</span></h4><p>A schema can be expressed in a number of different ways; frequently-encountered methods include the Document Type Definition (DTD) language which XML inherited from SGML; the XML Schema language (<a class="link_ptr" href="http://www.w3.org/XML/Schema"><span>http://www.w3.org/XML/Schema</span></a>) defined by the W3C; and the RELAX NG language (<a class="link_ptr" href="http://relaxng.org/"><span>http://relaxng.org/</span></a>) originally developed within the OASIS Technical Committee and now an ISO standard<span id="Note15_return"><a class="notelink" title="ISO/IEC FDIS 19757-2 Document Schema Definition Language (DSDL) -- Part 2: Regular-grammar-based validation -- RELAX NG" href="#Note15"><sup>13</sup></a></span>. In this chapter, and throughout these Guidelines, we give examples using the ‘compact syntax’ of RELAX NG, but the specifications within these Guidelines are expressed in a way that is largely independent of the specific language in which a schema generated from them is expressed.<span id="Note16_return"><a class="notelink" title="See further and . In practice, the only part of a TEI element specification not expressed using TEI-defined syntax is the content model for an elemen…" href="#Note16"><sup>14</sup></a></span> Although we will use the RELAX NG compact syntax for illustration in what follows, the reader should bear in mind that analogous concepts are expressed differently in other schema languages.</p><p>The following schema might be used to validate our example poem: </p><pre class="pre_eg cdata">
anthology_p = element anthology { poem_p+ }
poem_p = element poem { heading_p?, stanza_p+ }
stanza_p = element stanza {line_p+}
heading_p = element heading { text }
line_p = element line { text }
start = anthology_p
</pre><p>Note that this is not the only way in which a RELAX NG schema might be written;<span id="Note17_return"><a class="notelink" title="For a good tutorial introduction to RELAX NG, see ." href="#Note17"><sup>15</sup></a></span> we have adopted this idiom, however, because it matches that used throughout the rest of these Guidelines.</p><p>A RELAX NG schema expresses rules about the possible structure of a document in terms of <span class="term">patterns</span>; that is, it defines a number of named patterns, each of which acts as a kind of template against which an input document can be matched. The meaning of a pattern is expressed in a schema by reference to other patterns, or to a small number of built-in fundamental concepts, as we shall see. In the example above, the word to the left of the equals sign is the pattern's name, and the material following it declares a meaning for the pattern. Patterns may also be of particular types; the ones that interest us here are called <span class="term">element patterns</span> and <span class="term">attribute patterns</span>. In this example we see definitions for five element patterns. Note that we have used similar names for the pattern and the element which the pattern describes: so, for example, the line <code>anthology_p = element anthology {poem_p+}</code> defines an element pattern called <code>anthology_p</code>, the value of which defines an element called <code>anthology</code>. These naming conventions are arbitrary; we could use the same name for the pattern as for the element, since the two are syntactically quite distinct. The name, or <span class="term">generic identifier</span>, of the element follows the word <span class="q">‘element’</span>, and the <span class="term">content model</span> for the element is given within the curly braces following that. Each of these parts is discussed further below.</p><p>The last line of the schema above tells a RELAX NG validator which element (or elements) in a document can be used as the root element: in our case only <span class="gi">&lt;anthology&gt;</span>. This enables the validator to detect whether a particular document is well-formed but incomplete; it also simplifies the processing task by providing an <span class="q">‘entry point’</span>.</p><div class="div4" id="SG141x"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG143"><span class="headingNumber"></span>Content Model</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h5><span class="bookmarklink"><a class="bookmarklink" href="#SG141x" title="link to this section "><span class="invisible">TEI: Generic Identifier</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Generic Identifier</span></h5><p>Following the word <span class="q">‘element’</span> each pattern declaration gives the generic identifier (often abbreviated to GI) of the element being defined, for example <span class="mentioned">poem</span>, <span class="mentioned">heading</span>, etc. A GI may contain letters, digits, hyphens, underscore characters, or full stops, but must begin with a letter.<span id="Note18_return"><a class="notelink" title="In XML, a single colon may also appear in a GI, where it has a special significance related to the use of namespaces, as further discussed in section …" href="#Note18"><sup>16</sup></a></span> Uppercase and lowercase letters are quite distinct: an element with the GI <span class="gi">&lt;foo&gt;</span> is <em>not</em> the same as an element with the GI <span class="gi">&lt;Foo&gt;</span>; the root element of a TEI-conformant document is <a class="gi" title="(TEI document) contains a single TEI-conformant document, combining a single TEI header with one or more members of the model.resourceLike class. Multiple &lt;TEI&gt; elements may be combined to form a &lt;teiCorpus&gt; element." href="ref-TEI.html">TEI</a>, <em>not</em> <span class="gi">&lt;tei&gt;</span>.</p></div><div class="div4" id="SG143"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG141x"><span class="headingNumber"></span>Generic Identifier</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG144"><span class="headingNumber"></span>Occurrence Indicators</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h5><span class="bookmarklink"><a class="bookmarklink" href="#SG143" title="link to this section "><span class="invisible">TEI: Content Model</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Content Model</span></h5><p>The second part of each declaration, enclosed in curly braces, is called the <span class="term">content model</span> of the element being defined, because it specifies what may legitimately be contained within it. In RELAX NG, the content model is defined in terms of other patterns, either by embedding them, or (as in our examples above) by naming or referring to them. The RELAX NG compact syntax also uses a small number of reserved words to identify other possible contents for an element, of which by far the most commonly encountered is <code>text</code>, as in this example: it means that the element being defined may contain any valid character data, but no elements. If an XML document is thought of as a structure like a family tree, with a single ancestor at the top (in our case, this would be <span class="tag">&lt;anthology&gt;</span>), then almost always, following the branches of the tree downwards (for example, from <span class="gi">&lt;anthology&gt;</span> to <span class="gi">&lt;poem&gt;</span> to <span class="gi">&lt;stanza&gt;</span> to <span class="gi">&lt;line&gt;</span> and <span class="gi">&lt;heading&gt;</span>) will lead eventually to <code>text</code>. In our example, <span class="gi">&lt;heading&gt;</span> and <span class="gi">&lt;line&gt;</span> are so defined, since their content models say <code>text</code> only and name no embedded elements.</p></div><div class="div4" id="SG144"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG143"><span class="headingNumber"></span>Content Model</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG145"><span class="headingNumber"></span>Connectors</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h5><span class="bookmarklink"><a class="bookmarklink" href="#SG144" title="link to this section "><span class="invisible">TEI: Occurrence Indicators</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Occurrence Indicators</span></h5><p>The declaration for <span class="gi">&lt;stanza&gt;</span> in the example above states that a stanza consists of one or more lines. It uses an <span class="term">occurrence indicator</span> (the plus sign) to indicate how many times something matching the pattern <code>line_p</code> may be repeated. There are three occurrence indicators: the plus sign, the question mark, and the asterisk or star. The plus sign means that the pattern can match one or more times; the question mark means that it may match at most once but is not mandatory; the star means that the pattern concerned is not mandatory, but may match more than once. Thus, if the content model for <span class="gi">&lt;stanza&gt;</span> were <code>{line_p*}</code>, stanzas with no lines would be possible as well as those with more than one line. If it were <code>{line_p?}</code>, again empty stanzas would be countenanced, but no stanza could have more than a single line. The declaration for <span class="gi">&lt;poem&gt;</span> in the example above thus states that a <span class="gi">&lt;poem&gt;</span> cannot have more than one heading, but may have none, and that it must have at least one <span class="gi">&lt;stanza&gt;</span> and may have several.</p></div><div class="div4" id="SG145"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG144"><span class="headingNumber"></span>Occurrence Indicators</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG146"><span class="headingNumber"></span>Groups</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h5><span class="bookmarklink"><a class="bookmarklink" href="#SG145" title="link to this section "><span class="invisible">TEI: Connectors</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Connectors</span></h5><p>The content model <code>{heading_p?, stanza_p+}</code> contains more than one component, and thus needs additionally to specify the order in which these patterns (<span class="gi">&lt;heading_p&gt;</span> and <span class="gi">&lt;stanza_p&gt;</span>) may appear. This ordering is determined by the <span class="term">connector</span> (the comma) used between its components. The comma connector indicates that the patterns concerned must appear in the sequence given. Another commonly encountered connector is the vertical bar, representing alternation. If the comma in this example were replaced by a vertical bar, then a <span class="gi">&lt;poem&gt;</span> would consist of either a heading or just stanzas—but not both!</p></div><div class="div4" id="SG146"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG145"><span class="headingNumber"></span>Connectors</a></li><li class="subtoc"></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h5><span class="bookmarklink"><a class="bookmarklink" href="#SG146" title="link to this section "><span class="invisible">TEI: Groups</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Groups</span></h5><p>In our example so far, the components of each content model have been either single patterns or <code>text</code>. It is quite permissible, however, to define content models in which the components are lists of patterns, combined by connectors. Such lists may also be modified by occurrence indicators and themselves combined by connectors. To demonstrate these facilities, let us expand our example to include non-stanzaic types of verse. For the sake of demonstration, we will categorize poems as one of the following: <span class="term">stanzaic</span>, <span class="term">couplets</span>, or <span class="term">blank</span> (or <span class="term">stichic</span>). A blank-verse poem consists simply of lines (we ignore the possibility of verse paragraphs for the moment),<span id="Note19_return"><a class="notelink" title="It will not have escaped the astute reader that the fact that verse paragraphs need not start on a line boundary seriously complicates the issue; see …" href="#Note19"><sup>17</sup></a></span> so no additional elements need be defined for it. A couplet is defined as a <span class="gi">&lt;firstLine&gt;</span> followed by a <span class="gi">&lt;secondLine&gt;</span>. </p><pre class="pre_eg cdata">couplet_p = element couplet {firstLine_p, secondLine_p}</pre><p>   </p><p>The patterns firstLine_p and secondLine_p define elements <span class="gi">&lt;firstLine&gt;</span> and <span class="gi">&lt;secondLine&gt;</span> (which are distinguished to enable studies of rhyme scheme, for example<span id="Note20_return"><a class="notelink" title="This is however a rather artificial example; XPath, for example, provides ways of distinguishing elements in an XML structure by their position withou…" href="#Note20"><sup>18</sup></a></span> ); these will have exactly the same content model as the existing <span class="gi">&lt;line&gt;</span> element. We will therefore add the following two lines to our example schema: </p><pre class="pre_eg cdata">firstLine_p = element firstLine {text} secondLine_p = element secondLine {text}</pre><p> Next, we can change the declaration for the <span class="gi">&lt;poem&gt;</span> element to include all three possibilities: </p><pre class="pre_eg cdata"> poem_p = element poem
{ heading_p?, (stanza_p+ | couplet_p+ | line_p+) }</pre><p> That is, a poem consists of an optional heading, followed by one or several stanzas, or one or several couplets, or one or several lines. Note the difference between this declaration and the following: </p><pre class="pre_eg cdata">poem_p = element poem
{heading_p?, (stanza_p | couplet_p | line_p)+ }</pre><p> The second version, by applying the occurrence indicator to the group rather than to each element within it, would allow a single poem to contain a mixture of stanzas, couplets, and lines.</p><p>A group of this kind can contain <code>text</code> as well as named elements: this combination, known as <span class="term">mixed content</span>, allows for elements in which the sub-components appear with intervening stretches of character data. For example, if we wished to mark place names wherever they appear inside our verse lines, then, assuming we have also added a pattern for the <span class="gi">&lt;name&gt;</span> element, we could change the definition for <span class="gi">&lt;line&gt;</span> to </p><pre class="pre_eg cdata">line_p = element
line { (text | name_p )* }</pre><p>Some XML schema languages place no constraints on the way that mixed content models may be defined, but in the XML DTD language, when <code>text</code> appears with other elements in a content model, it must always appear as the first option in an alternation; it may appear once only, and in the outermost model group; and if the group containing it is repeated, the star operator must be used. Although these constraints do not apply to (for example) schemas expressed in the RELAX NG language, all TEI content models currently obey them.</p><p>Quite complex models can easily be built up in this way, to match the structural complexity of many types of text. As a further example, consider the case of stanzaic verse in which a refrain or chorus appears. Like a stanza, a refrain consists of repetitions of the line element. A refrain can appear at the start of a poem only, or as an optional addition following each stanza. This could be expressed by a pattern such as the following: </p><pre class="pre_eg cdata">refrain_p = element refrain {line_p+}
poem_p = element poem   {heading_p?, ( line_p+ | (refrain_p?, (stanza_p,
refrain_p?)+ )) }</pre><p> That is, a poem consists of an optional heading, followed by either a sequence of lines or an unnamed group, which starts with an optional refrain and is followed by one or more occurrences of another group, each member of which is composed of a stanza followed by an optional refrain. A sequence such as <span class="mentioned">refrain - stanza - stanza - refrain</span> follows this pattern, as does the sequence <span class="mentioned">stanza - refrain - stanza - refrain</span>. The sequence <span class="mentioned">refrain - refrain - stanza - stanza</span> does not, however, and neither does the sequence <span class="mentioned">stanza - refrain - refrain - stanza.</span> Among other conditions made explicit by this content model are the requirements that at least one stanza must appear in a poem, if it is not composed simply of lines, and that if there is both a heading and a stanza they must appear in that order.</p><p>Note that the apparent complexity of this model derives from the constraints expressed informally above. A simpler model, such as </p><pre class="pre_eg cdata">poem_p =
element poem {heading_p?, (line_p | refrain_p | stanza_p)+ }</pre><p> would not enforce any of them, and would therefore permit such anomalies as a poem consisting only of refrains, or an arbitrary mixture of lines and refrains.</p></div></div></div><div class="div2" id="SG152"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG13"><span class="headingNumber">v.3. </span>XML Structures</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG16"><span class="headingNumber">v.5. </span>Attributes</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h3><span class="bookmarklink"><a class="bookmarklink" href="#SG152" title="link to this section "><span class="invisible">TEI: Complicating the Issue</span><span class="pilcrow">¶</span></a></span><span class="headingNumber">v.4. </span><span class="head">Complicating the Issue</span></h3><p>In the simple cases described so far, we have assumed that one can identify the immediate constituents of every element in a textual structure. A poem consists of stanzas, and an anthology consists of poems. Stanzas do not float around unattached to poems or combined into some other unrelated element; a poem cannot contain an anthology. All the elements of a given document type may be arranged into a hierarchic structure like a family tree, with a single ancestor at one end and many children (mostly the elements containing simple text) at the other. For example, we could represent an anthology containing two poems, the first of which contains two four-line stanzas and the second a single stanza, by a tree structure like the following figure:</p><figure class="figure"><img src="Images/xmlpic.png" alt="" class="graphic" style=" width:80%;" /></figure><p>This graphic representation of the structure of an XML document is close to the abstract model implicit in most XML processing systems. Most such systems now use a standardized way of accessing parts of an XML document called <span class="term">XPath</span>.<span id="Note21_return"><a class="notelink" title="The official specification is at ; many introductory tutorials are available in the XML references cited above and elsewhere on the Web: good beginner…" href="#Note21"><sup>19</sup></a></span> XPath gives us a non-graphical way of referring to any part of an XML document: for example, we might refer to the last line of Blake's poem as <code>/anthology/poem[1]/stanza[2]/line[4]</code>. The square brackets here indicate a numerical selection: we are talking about the fourth line in the second stanza of the first poem in the anthology. If we left out all the square-bracketted selections, the corresponding XPath expression would refer to all lines contained by stanzas contained by poems contained by anthologies. An XPath expression can refer to any collection of elements: for example, the expression <code>/anthology/poem</code> refers to all poems in an anthology and the expression <code>/anthology/poem/heading</code> refers to all their headings.</p><p>The solidus within an XPath expression behaves in much the same way as the solidus or backslash in a filename specification: it indicates that the item to the left directly contains the item to the right of it. In XPath it is also possible to indicate that any number of other items may intervene by repeating the solidus. For example, the XPath expression <code>/anthology/poem//line[1]</code> will refer to the first line of each poem in the anthology, irrespective of whether it is in a stanza.</p><p>Clearly, there are many such trees that might be drawn to describe the structure of this or other anthologies. Some of them might be representable as further subdivisions of this tree: for example, we might subdivide the lines into individual words, since in our simple example no word crosses a line boundary. Surprisingly perhaps, this grossly simplified view of what text is (memorably termed an <span class="term">ordered hierarchy of content objects</span> (OHCO) view of text by Renear <span class="foreign">et al.</span><span id="Note22_return"><a class="notelink" title="See ." href="#Note22"><sup>20</sup></a></span>) turns out to be very effective for a large number of purposes. It is not, however, adequate for the full complexity of real textual structures, for which more complex mechanisms need to be employed. There are many other trees that might be drawn which do <em>not</em> fit within the anthology model which we have presented so far. We might, for example, be interested in syntactic structures or other linguistic constructs, which rarely respect the formal boundaries of verse. Or, to take a simpler example, we might want to represent the pagination of different editions of the same text.</p><p>In the OHCO model of text, representation of cases where different elements overlap so that several different trees may be identified in the same document is generally problematic. All the elements marked up in a document, no matter what namespace they belong to, must fit within a single hierarchy. To represent overlapping structures, therefore, a single hierarchy must be chosen, and the points at which other hierarchies intersect with it marked. For example, we might choose the verse structure as our primary hierarchy, and then mark the pagination by means of empty elements inserted at the boundary points between one page and the next. Or we could represent alternative hierarchies by means of the pointing and linking mechanisms described in chapter <a class="link_ptr" href="SA.html" title="14"><span class="headingNumber">16 </span>Linking, Segmentation, and Alignment</a> of these Guidelines. These mechanisms all depend on the use of <span class="term">attributes</span>, which may be used both to identify particular elements within a document and to point to, link, or align them into arbitrary structures.</p></div><div class="div2" id="SG16"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG152"><span class="headingNumber">v.4. </span>Complicating the Issue</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG-oth"><span class="headingNumber">v.6. </span>Other Components of an XML Document</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h3><span class="bookmarklink"><a class="bookmarklink" href="#SG16" title="link to this section "><span class="invisible">TEI: Attributes</span><span class="pilcrow">¶</span></a></span><span class="headingNumber">v.5. </span><span class="head">Attributes</span></h3><p>In the XML context, the word <span class="mentioned">attribute</span>, like some other words, has a specific technical sense. It is used to describe information that is in some sense descriptive of a specific element occurrence but not regarded as part of its content. For example, you might wish to add a <span class="att">status</span> attribute to occurrences of some elements in a document to indicate their degree of reliability, or to add an <span class="att">identifier</span> attribute so that you could refer to particular element occurrences from elsewhere within a document. Attributes are useful in precisely such circumstances.</p><p>Although different elements may have attributes with the same name (for example, in the TEI scheme, every element is defined as having an attribute named <span class="att">n</span>), they are always regarded as different, and may have different values assigned to them. If an element has been defined as having attributes, the attribute values are supplied in the document instance as <span class="term">attribute-value pairs</span> inside the start-tag for the element occurrence. An end-tag cannot contain an attribute-value specification, since it would be redundant.</p><p>The order in which attribute-value pairs are supplied inside a tag has no significance; they must, however, be separated by at least one whitespace (blank, newline, or tab) character. The value part must always be given inside matching quotation marks, either single or double<span id="Note23_return"><a class="notelink" title="In the unlikely event that both kinds of quotation marks are needed within the quoted string, either or both can also be presented in escaped form, us…" href="#Note23"><sup>21</sup></a></span>.</p><div class="p">For example: <div id="index-egXML-d52e2062" class="pre egXML_valid"><span class="element">&lt;poem <span class="attribute">xml:id</span>="<span class="attributevalue">Poem1</span>" <span class="attribute">status</span>="<span class="attributevalue">draft</span>"&gt;</span> ... <span class="element">&lt;/poem&gt;</span></div> Here attribute values are being specified for two attributes previously declared for the <span class="gi">&lt;poem&gt;</span> element: <span class="att">xml:id</span> and <span class="att">status</span>. For the instance of a <span class="gi">&lt;poem&gt;</span> in this example, represented here by an ellipsis, the <span class="att">xml:id</span> attribute has the value <span class="val">P1</span> and the <span class="att">status</span> attribute has the value <span class="val">draft</span>. An XML processor can use the values of the attributes in any way it chooses; for example, a <span class="gi">&lt;poem&gt;</span> in which the <span class="att">status</span> attribute has the value <span class="val">draft</span> might be formatted differently from one in which the same attribute has the value <span class="val">revised</span>; another processor might use the same attribute to determine whether or not poem elements are to be processed at all. The <span class="att">xml:id</span> attribute is a slightly special case in that, by convention, it is always used to supply a unique value to identify a particular element occurrence, which may be used for cross-reference purposes, as discussed further below (<a class="link_ptr" href="SG.html#SG-id" title="Identifiers and Indicators"><span class="headingNumber"></span>Identifiers and Indicators</a>).</div><div class="div3" id="SG-att"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG-id"><span class="headingNumber"></span>Identifiers and Indicators</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG-att" title="link to this section "><span class="invisible">TEI: Declaring Attributes</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Declaring Attributes</span></h4><p>Attributes are declared in a schema in the same way as elements. As well as specifying an attribute's name and the element to which it is to be attached, it is possible to specify (within limits) what kind of value is acceptable for an attribute.</p><p>In the compact syntax of RELAX NG, an attribute is defined by means of an attribute pattern, like the following: </p><pre class="pre_eg cdata">att.status = attribute status {"draft" | "revised" | "published"}</pre><p> This defines a new pattern, called <code>att.status</code>, whose value is an attribute pattern defining an attribute named <span class="att">status</span>. Attribute names are subject to the same restrictions as other names in XML; they need not be unique across the whole schema, however, but only within the list of attributes for a given element.</p><p>A pattern defining the possible values for this attribute is given within the curly braces, in just the same way as a content model is given for an element pattern. In this case, the attribute's value must be one of the strings presented explicitly above. </p><p>The attribute pattern definition must be included or referenced within the definition for every element to which the attribute is attached. We therefore modify the definition for the <code>poem_p</code> pattern given above as follows: </p><pre class="pre_eg cdata">poem_p = element poem {att.status?, heading_p?, stanza_p+}
</pre><p> In RELAX NG, an element pattern simply includes any attribute patterns applicable to it along with its other constituents, as shown above. Attribute patterns can also be grouped and alternated in the same way as element patterns, though this particular feature is not widely used in the TEI scheme, since it is not available to the same extent in all schema languages. Because a question mark follows the reference to the <code>att.status</code> pattern in our example, a document in which the <span class="att">status</span> attribute is not specified will still be valid; without this occurrence indicator the <span class="att">status</span> attribute would be required.</p><p>Instead of supplying a list of explicit values, an attribute pattern can specify that the attribute must have a value of a particular type, for example a text string, a numeric value, a normalized date, etc. This is accomplished by supplying a pattern that refers to a <span class="term">datatype</span>. In the example above, because a list of acceptable values is predefined, a parser can check that no <span class="gi">&lt;poem&gt;</span> is defined for which the <span class="att">status</span> attribute does not have one of <span class="val">draft</span>, <span class="val">revised</span>, or <span class="val">published</span> as its value. By contrast, with a definition such as </p><pre class="pre_eg cdata">att.status =
attribute status {text}</pre><p> a parser would accept almost any unbroken string of characters (<code>status="awful"</code>, <code>status="awe-ful"</code>, or <code>status="12345678"</code>) as valid for this attribute. Sometimes, of course, the set of possible values cannot be predefined. Where it can, as in this case, it is generally better to do so.</p><p>Schema languages vary widely in the extent to which they support validation of attribute values. Some languages predefine a small set of possibilities. Others allow the schema designer to use values from a predefined ‘library’ of possible datatypes, or to add their own definitions, possibly of great complexity. A ‘datatype’ might be something fairly general (any positive integer), something very specific or idiosyncratic (any four-character string ending with "T"), or somewhere between the two. In the RELAX NG schemas used by the TEI, general patterns have been defined for about half a dozen datatypes (using the W3C Schema <a class="link_ref" href="BIB.html#XSD2" title="Paul V. Biron Ashok Malhotra XML Schema Part 2 Datatypes Second EditionW3C28 October 2004">Datatype Library</a>, and discussed further in <a class="link_ptr" href="ST.html#DTYPES" title="Datatype Specifications"><span class="headingNumber">1.4.2 </span>Datatype Specifications</a>). In addition to the two possibilities already mentioned—plain text or an explicit list of possible strings—other datatypes likely to be encountered include the following: </p><dl><dt><span>boolean</span></dt><dd>values must be either true or false</dd><dt><span>numeric</span></dt><dd>values must represent a numeric quantity of some kind</dd><dt><span>date</span></dt><dd>values must represent a possible date and time in some calendar</dd></dl><p>Two further datatypes of particular usefulness in managing XML documents are commonly known as <code>ID</code>—for identifier—and <code>URI</code>—for Universal Resource Indicator, or pointer for short. These are discussed in the next section.</p></div><div class="teidiv2" id="SG-id"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG-att"><span class="headingNumber"></span>Declaring Attributes</a></li><li class="subtoc"></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG-id" title="link to this section "><span class="invisible">TEI: Identifiers and Indicators</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Identifiers and Indicators</span></h4><p>It is often necessary to refer to an occurrence of one textual element from within another, an obvious example being phrases such as <span class="q">‘see note 6’</span> or <span class="q">‘as discussed in chapter 5’</span>. When a text is being produced the actual numbers associated with the notes or chapters may not be certain. If we are using descriptive markup, such things as page or chapter numbers, being entirely matters of presentation, will not in any case be present in the marked-up text: they will be assigned by whatever processor is operating on the text (and may indeed differ in different applications). XML therefore predefines an attribute that may be used to provide any element occurrence with a special identifier, a kind of label, which may be used to refer to it from anywhere else: since it is defined in the XML namespace, the name of this attribute is <span class="att">xml:id</span> and it is used throughout the TEI schema. Because it is intended to act as an identifier, its values must be unique within a given document. The cross-reference itself will be supplied by an element bearing an attribute of a specific kind, which must also be declared in the schema.</p><div class="p">Suppose, for example, we wish to include a reference within the notes on one poem that refers to another poem. We will first need to provide some way of attaching a label to each poem: this is easily done using the <span class="att">xml:id</span> attribute. Note that not every poem need carry an <span class="att">xml:id</span> attribute and the parser may safely ignore the lack of one in those that do not. Only poems to which we intend to refer need use this attribute; for each such poem we should now include in its start-tag some unique identifier, for example: <div id="index-egXML-d52e2234" class="pre egXML_invalid"><span class="element">&lt;poem <span class="attribute">xml:id</span>="<span class="attributevalue">Rose</span>"&gt;</span> ... <span class="element">&lt;/poem&gt;</span><br /><span class="element">&lt;poem <span class="attribute">xml:id</span>="<span class="attributevalue">P40</span>"&gt;</span> ... <span class="element">&lt;/poem&gt;</span><br /><span class="element">&lt;poem&gt;</span> ... <span class="element">&lt;/poem&gt;</span></div></div><p>Next we need to define a new element for the cross-reference itself. This will not have any content—it is only a pointer—but it has an attribute, the value of which will be the identifier of the element pointed at. This is achieved by the following definition: </p><pre class="pre_eg cdata">poemRef_p = element poemRef {attribute target {anyURI}, empty}</pre><p>The <span class="gi">&lt;poemRef&gt;</span> element has no content, but a single attribute called <span class="att">target</span>. The value of this attribute must be a pointer or web reference of type <code>anyURI</code>;<span id="Note24_return"><a class="notelink" title="The word anyURI is a predefined name, used in schema languages to mean that any Uniform Resource Identifier (URI) may be supplied here. The accepted s…" href="#Note24"><sup>22</sup></a></span> furthermore, because there is no indication of optionality on the attribute pattern, it must be supplied on each occurrence—a <span class="gi">&lt;poemRef&gt;</span> with no referent is an impossibility.</p><div class="p">With these declarations in force, we can now encode a reference to the poem whose <span class="att">xml:id</span> attribute specifies that its identifier is <span class="val">Rose</span> as follows: <div id="index-egXML-d52e2286" class="pre egXML_invalid">Blake's poem on the sick rose<br /><span class="element">&lt;poemRef <span class="attribute">target</span>="<span class="attributevalue">#Rose</span>"/&gt;</span> ...</div> </div><p>A processor may take any number of actions when it encounters a link encoded in this way: a formatter might construct an exact page and line reference for the location of the poem in the current document and insert it, or just quote the poem's title or first lines. A hypertext style processor might use this element as a signal to activate a link to the poem being referred to, for example by displaying it in a new window. Note, however, that the purpose of the XML markup is simply to indicate that a cross-reference exists: it does not necessarily determine what the processor is to do with it.</p><div class="p">The target of a URI can be located anywhere: it may not necessarily be part of the same document, nor even located on the same computer system. Equally, it can be a resource of any kind, not necessarily an XML document or document fragment. It is thus a very convenient way of including references to non-XML data such as image files within a document. If, for example, we wished to include an illustration containing a reproduction of Blake's original in our anthology, the most appropriate method would probably be to define a new element called (for the sake of argument) <span class="gi">&lt;graphic&gt;</span> with a <span class="att">target</span> attribute of datatype URI: <pre class="pre_eg cdata">graphic_p = element graphic {att.url, empty} att.url =
attribute url {anyURI}</pre> With these additions to the schema, we can now represent the location of the illustration within our text like this: <div id="index-egXML-d52e2306" class="pre egXML_invalid"><span class="element">&lt;poem&gt;</span><br /> <span class="element">&lt;graphic <span class="attribute">url</span>="<span class="attributevalue">http://en.wikisource.org/wiki/Image:Blake_sick_rose.jpg</span>"/&gt;</span><br /><span class="element">&lt;/poem&gt;</span></div> By providing a location from which a reproduction of the required image can be downloaded, this encoding makes it possible for appropriate software able to display the image as well as record its existence.</div><p>Attributes form part of the structure of an XML document in the same way as elements, and can therefore be accessed using XPath. For example, to refer to all the poems in our anthology whose <span class="att">status</span> attribute has the value <span class="val">draft</span>, we might use an XPath such as <code>/anthology/poem[@status='draft']</code>. To find the headings of all such poems, we would use the XPath <code>/anthology/poem[@status='draft']/heading</code>.</p></div></div><div class="div2" id="SG-oth"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG16"><span class="headingNumber">v.5. </span>Attributes</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG18"><span class="headingNumber">v.7. </span>Putting It All Together</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h3><span class="bookmarklink"><a class="bookmarklink" href="#SG-oth" title="link to this section "><span class="invisible">TEI: Other Components of an XML Document</span><span class="pilcrow">¶</span></a></span><span class="headingNumber">v.6. </span><span class="head">Other Components of an XML Document</span></h3><p>In addition to the elements and attributes so far discussed, an XML document can contain a few other formally distinct things. An XML document may contain references to predefined strings of data that a validator must resolve before attempting to validate the document's structure; these are called <span class="term">entity references</span>. They may be useful as a means of providing ‘boilerplate’ text or representing character data which cannot easily be keyboarded. An XML document may also contain arbitrary signals or flags for use when the document is processed in a particular way by some class of processor (a common example in document production is the need to force a formatter to start a new page at some specific point in a document); such flags are called <span class="term">processing instructions</span>. And, as noted earlier, an XML document may also contain instances of elements taken from some other <span class="term">namespace</span>. We discuss each of these three cases in the rest of this section.</p><div class="div3" id="SG-er"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG-pi"><span class="headingNumber"></span>Processing Instructions</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG-er" title="link to this section "><span class="invisible">TEI: Character References</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Character References</span></h4><p>As mentioned above, all XML documents use the same internal character encoding. Since not all computer systems currently support this encoding directly, a special syntax is defined that can be used to represent individual characters from the Unicode character set in a portable way by providing their numeric value, in decimal or hexadecimal notation.</p><p>For example, the character <span class="mentioned">é</span> is represented within an XML document as the Unicode character with hexadecimal value 00E9. If such a document is being prepared on (or exported to) a system using a different character set in which this character is not available, it may instead be represented by the character reference <code>&amp;#x00E9;</code> (the <code>x</code> indicating that what follows is a hexadecimal value) or <code>&amp;#0233;</code> (its decimal equivalent). References of this type do not need to be predefined, since the underlying character encoding for XML is always the same.</p><p>To aid legibility, however, it is also possible to use a mnemonic name (such as <code>eacute</code>) for such character references, provided that each such name is mapped to the required Unicode value by means of a construct known as an <span class="term">entity declaration</span>. A reference to a named character entity always takes the form of an ampersand, followed by the name, followed by a semicolon. For example an XML document containing the string <span class="q">‘T&amp;C’</span> might be encoded as <code>T&amp;amp;C</code>.</p><p>There is a small set of such character entity references that do not have to be declared because they form part of the definition of XML. These include the names used for characters such as the ampersand (<code>amp</code>) and the open angle bracket or less-than sign (<code>lt</code>), which could not easily otherwise be included in an XML document without ambiguity. Other predeclared entity names are those for quotation marks (<code>quot</code> and <code>apos</code> for double and single respectively), and for completeness the closing angle bracket or greater-than sign (<code>gt</code>).</p><p>For all other named character entities, a set of entity declarations must be provided to an XML processor before the document referring to them can be validated. The declaration itself uses a non-XML syntax inherited from SGML; for example, to define an entity named <span class="ident-ge">eacute</span> with the replacement value é, the declaration could have any of the following forms: </p><pre class="pre_eg cdata">&lt;!ENTITY eacute "é"&gt;</pre><p> or, using hexadecimal notation: </p><pre class="pre_eg cdata">&lt;!ENTITY eacute "&amp;#xe9;"&gt;</pre><p> or, using decimal notation: </p><pre class="pre_eg cdata">&lt;!ENTITY eacute "&amp;#233;"&gt;</pre><p>Entities of this kind are useful also for <span class="term">string substitution</span> purposes, where the same text needs to be repeated uniformly throughout a text. For example, if a declaration such as </p><pre class="pre_eg cdata">&lt;!ENTITY TEI "Text Encoding Initiative"&gt;</pre><p> is included with a document, then references such as <code>&amp;TEI;</code> may be used within it, each of which will be expanded in the same way and replaced by the string <span class="q">‘Text Encoding Initiative’</span> before the text is validated.</p></div><div class="div3" id="SG-pi"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG-er"><span class="headingNumber"></span>Character References</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SGname"><span class="headingNumber"></span>Namespaces</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG-pi" title="link to this section "><span class="invisible">TEI: Processing Instructions</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Processing Instructions</span></h4><p>Although one of the aims of using XML is to remove any information specific to the processing of a document from the document itself, it is occasionally very convenient to be able to include such information—if only so that it can be clearly distinguished from the structure of the document. As suggested above, one common example is the need, when processing an XML document for printed output, to include a suggestion that the formatting processor might use to determine where to begin a new page of output. Page-breaking decisions are usually best made by the formatting engine alone, but there will always be occasions when it may be necessary to override these. An XML processing instruction inserted into the document is one very simple and effective way of doing this without interfering with other aspects of the markup.</p><p>Here is an example XML processing instruction: </p><pre class="pre_eg cdata">&lt;?tex \newpage ?&gt;</pre><p> It begins with <code>&lt;?</code> and ends with <code>?&gt;</code>. In between are two space-separated strings: by convention, the first is the name of some processor (<code>tex</code> in the above example) and the second is some data intended for the use of that processor (in this case, the instruction to start a new page). The only constraint placed by XML on the strings is that the first one must be a valid XML name; the other can be any arbitrary sequence of characters, not including the closing character-sequence <code>?&gt;</code>.</p><p>A construct which looks like a processing instruction (but is not) is the <span class="term">XML declaration</span> which can be supplied at the beginning of an XML document, for example: </p><pre class="pre_eg cdata">&lt;?xml version="1.0" encoding="iso-8859-1"?&gt;</pre><p> The XML declaration specifies the version number of the XML Recommendation applicable to the document it introduces (in this case, version 1.0), and optionally also the character encoding used to represent the Unicode characters within it. By default an XML document uses the character encoding UTF-8 or UTF-16; in this case, the 16-bit characters of Unicode have been mapped to the 8-bit character set known as ISO 8859-1; any characters present in the document but not available in the target character set will therefore need to be represented as character references (<a class="link_ptr" href="SG.html#SG-er" title="Character References"><span class="headingNumber"></span>Character References</a>). The XML declaration is purely documentary, but if it is wrong many XML-aware processors will be unable to process the associated text.</p></div><div class="div3" id="SGname"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG-pi"><span class="headingNumber"></span>Processing Instructions</a></li><li class="subtoc"></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SGname" title="link to this section "><span class="invisible">TEI: Namespaces</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Namespaces</span></h4><p>A valid XML document necessarily specifies the schema in which its constituent elements are defined. However, a well-formed XML document is not required to specify its schema (indeed, it may not even have a schema). It would still be useful to indicate that the element names used in it have some defined provenance. Furthermore, it might be desirable to include in a document elements that are defined (possibly differently) in different schemas. A cabinet-maker's schema might well define an element called <span class="gi">&lt;table&gt;</span> with very different characteristics from those of a documentalist's.</p><p>The concept of <span class="term">namespace</span> was introduced into the XML language as a means of addressing these and related problems. If the markup of an XML document is thought of as an expression in some language, then a namespace may be thought of as analogous to the lexicon of that language. Just as a document can contain words taken from different languages, so a well-formed XML document can include elements taken from different namespaces. A namespace resembles a schema in that we may say that a given set of elements ‘belongs to’ a given namespace, or are ‘defined by’ a given schema. However, a schema is a set of element definitions, whereas a namespace is really only a property of a collection of elements: the only tangible form it takes in an XML document is its distinctive <span class="term">prefix</span> and the identifying <span class="term">name</span> associated with it.</p><p>Suppose for example that we wish to extend our anthology to include a complex diagram. We might start by considering whether or not to extend our simple schema to include XML markup for such features as arcs, polygons, and other graphical elements. XML can be used to represent any kind of structure, not simply text, and there are clear advantages to having our text and our diagrams all expressed in the same way.</p><p>Fortunately we do not need to invent a schema for the representation of graphical components such as diagrams; it already exists in the shape of the Scalable Vector Graphics (SVG) language defined by the W3C.<span id="Note25_return"><a class="notelink" title="The W3C Recommendation is defined at ." href="#Note25"><sup>23</sup></a></span> SVG is a widely used and rich XML vocabulary for representing all kinds of two-dimensional graphics; it is also well supported by existing software. Using an SVG-aware drawing package, we can easily draw our diagram and save it in XML format for inclusion within our anthology. When we do so, we need to indicate that this part of the document contains elements taken from the SVG namespace, if only to ensure that processing software does not confuse our <span class="gi">&lt;line&gt;</span> element with the SVG <span class="gi">&lt;line&gt;</span>, which means something quite different.</p><div class="p">An XML document need not specify any namespace: it is then said to use the ‘null’ namespace. Alternatively, the root element of a document may supply a default namespace, understood to apply to all elements which have no namespace prefix. This is the function of the <span class="att">xmlns</span> attribute which provides a unique name for the default namespace, in the form of a URI: <div id="index-egXML-d52e2504" class="pre egXML_invalid"><span class="element">&lt;anthology&gt;</span><br /><span class="comment">&lt;!-- anthology markup elements here --&gt;</span><br /><span class="element">&lt;/anthology&gt;</span></div> In exactly the same way, on the root element for each part of our document which uses the SVG language, we might introduce the SVG namespace name: <div id="index-egXML-d52e2510" class="pre egXML_invalid"><span class="element">&lt;anthology&gt;</span><br /> <br /><span class="comment">&lt;!-- anthology markup elements here --&gt;</span><br /> <span class="element">&lt;svg xmlns="http://www.w3.org/2000/svg"&gt;</span><br /><span class="comment">&lt;!-- SVG markup elements here --&gt;</span><br /> <span class="element">&lt;/svg&gt;</span><br /><span class="comment">&lt;!-- more anthology markup elements here --&gt;</span><br /><span class="element">&lt;/anthology&gt;</span></div> Although a namespace name usually uses the URI (Uniform Resource Identifier) syntax, it is not treated as an online address and an XML processor regards it just as a string, providing a longer name for the namespace.</div><div class="p">The <span class="att">xmlns</span> attribute can also be used to associate a short prefix name with the namespace it defines. This is very useful if we want to mingle elements from different namespaces within the same document, since the prefix can be attached to any element, overriding the implicit namespace for itself (but not its children): <div id="index-egXML-d52e2528" class="pre egXML_invalid"><span class="element">&lt;anthology&gt;</span><br /> <br /><span class="comment">&lt;!-- anthology markup elements here --&gt;</span><br /> <span class="element">&lt;svg:svg&gt;</span><br /><span class="comment">&lt;!-- SVG markup elements here --&gt;</span><br /> <span class="element">&lt;/svg:svg&gt;</span><br /><span class="comment">&lt;!-- more anthology markup elements here --&gt;</span><br /><span class="element">&lt;/anthology&gt;</span></div></div><div class="p">There is no limit on the number of namespaces that a document can use. Provided that each is uniquely identified, an XML processor can identify those that are relevant, and validate them appropriately. To extend our example further, we might decide to add a linguistic analysis to each of the poems, using a set of elements such as <span class="gi">&lt;aux&gt;</span>, <span class="gi">&lt;adj&gt;</span>, etc., derived from some pre-existing XML vocabulary for linguistic analysis. <div id="index-egXML-d52e2549" class="pre egXML_invalid"><span class="element">&lt;anthology&gt;</span><br /> <br /><span class="comment">&lt;!-- anthology markup elements here --&gt;</span><br /> <span class="element">&lt;svg:svg&gt;</span><br /><span class="comment">&lt;!-- SVG markup elements here --&gt;</span><br /> <span class="element">&lt;/svg:svg&gt;</span><br /> <span class="element">&lt;line&gt;</span><br />  <span class="element">&lt;gram:itj&gt;</span>O<span class="element">&lt;/gram:itj&gt;</span><br />  <span class="element">&lt;gram:nom&gt;</span>Rose<span class="element">&lt;/gram:nom&gt;</span><br />  <span class="element">&lt;gram:pron&gt;</span>thou<span class="element">&lt;/gram:pron&gt;</span><br />  <span class="element">&lt;gram:aux&gt;</span>art<span class="element">&lt;/gram:aux&gt;</span><br />  <span class="element">&lt;gram:adj&gt;</span>sick<span class="element">&lt;/gram:adj&gt;</span><br /> <span class="element">&lt;/line&gt;</span><br /><span class="element">&lt;/anthology&gt;</span></div></div><div class="div3" id="SG-ms"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"></li><li class="subtoc"></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h5><span class="bookmarklink"><a class="bookmarklink" href="#SG-ms" title="link to this section "><span class="invisible">TEI: Marked Sections</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Marked Sections</span></h5><p>We mentioned above that the syntax of XML requires the encoder to take special action if characters with a syntactic meaning in XML (such as the left angle bracket or ampersand) are to be used in a document to stand for themselves, rather than to signal the start of a tag or an entity reference respectively. The predefined entities <code>&amp;amp;</code>, <code>&amp;lt;</code>, and <code>&amp;gt;</code> provide one method of dealing with this problem, if the number of occurrences of such things is small. Other methods may be considered when the number is large, as in an XML document like the present Guidelines, which contains hundreds of examples of XML markup. One is to label the XML examples as belonging to a different namespace from that of the document itself, which is the approach taken in the present Guidelines. Another and simpler approach is provided by one of the features inherited by XML from its parent SGML: the ‘marked section’.</p><p>A marked section is a block of text within an XML document introduced by the characters <code>&lt;![CDATA[</code> and terminated by the characters <code>]]&gt;</code>. Between these rather strange brackets, markup recognition is turned off, and any tags or entity references encountered are therefore treated as if they were plain text. For example, when we come to write the users' manual for our anthology, we may find ourselves often producing text like the following: </p><pre class="pre_eg cdata">
Here is an example of the use of the &lt;gi&gt;line&lt;/gi&gt; element:
&lt;![CDATA[&lt;line&gt;[...]&lt;/line&gt;]]&gt;
</pre></div></div></div><div class="div2" id="SG18"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG-oth"><span class="headingNumber">v.6. </span>Other Components of an XML Document</a></li><li class="subtoc"></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h3><span class="bookmarklink"><a class="bookmarklink" href="#SG18" title="link to this section "><span class="invisible">TEI: Putting It All Together</span><span class="pilcrow">¶</span></a></span><span class="headingNumber">v.7. </span><span class="head">Putting It All Together</span></h3><p>In this chapter we have discussed most of the components of an XML document and its associated schema. We have described informally how an XML document is represented, and also introduced one way of representing the rules a RELAX NG validator might use to validate it. In a working system, the following issues will also need to be addressed: </p><ul class="bulleted"><li class="item">how does a processor determine the schema (or schemas) that should be used to validate a given XML document instance?</li><li class="item">if a document contains entity references that must be processed before the document can be validated, where are those entities defined?</li><li class="item">an XML document instance may be stored in a number of different operating system files; how should they be assembled together?</li><li class="item">how does a processor determine which stylesheets it should use when processing an XML document, or how to interpret any processing instructions it contains?</li><li class="item">how does a processor enforce more exact validation than simple datatypes permit (for example of element content)?</li></ul><p>Different schema languages and different XML processing systems take very different positions on all of these topics, since none of them is explicitly addressed in the XML specification itself. Consequently, the best answer is likely to be specific to a particular software environment and schema language. Since this chapter is concerned with XML considered independently of its processing environment, we only address them in summary detail here.</p><div class="div3" id="SG-ass1"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG-assoc"><span class="headingNumber"></span>Associating a Document Instance with Its Schema</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG-ass1" title="link to this section "><span class="invisible">TEI: Associating Entity Definitions with a Document
Instance</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Associating Entity Definitions with a Document Instance</span></h4><p>In <a class="link_ptr" href="SG.html#SG-er" title="Character References"><span class="headingNumber"></span>Character References</a> we introduced the syntax used for the definition of named character entities such as <code>eacute</code>, which XML inherited from SGML. Different schema languages vary in the ways they make a collection of such definitions available to an XML processor, but fortunately there is one method that all current schema languages support.</p><p>As well as, and following, the XML declaration (<a class="link_ptr" href="SG.html#SG-pi" title="Processing Instructions"><span class="headingNumber"></span>Processing Instructions</a>), an XML document instance may be prefixed with a special <code>DOCTYPE</code> statement. This declarative statement has been inherited by XML from SGML; in its full form it provides a large number of facilities, but we are here concerned only with the small subset of those facilities recognized by all schema languages.</p><p>Here is an example DOCTYPE statement which we might consider prefixing to the final version of our anthology: </p><pre class="pre_eg cdata">&lt;!DOCTYPE anthology [
&lt;!ENTITY mdash "&amp;#2014;"&gt;
&lt;!ENTITY legalese "This document is available under a Creative Commons
Share and Enjoy Licence"&gt;
]&gt;</pre><p> Any XML processor encountering this statement will use it to add the two named entities it defines to those already predefined for XML. Before the document instance itself is validated, any references to these entities will be expanded to the character string given. Thus, wherever in the document instance the string <code>&amp;legalese;</code> appears, it will be replaced by the formulation above. This makes life a little easier for those keyboarding our anthology.<span id="Note26_return"><a class="notelink" title="And, indeed, for those responsible for deciding the licensing conditions if they change their minds later." href="#Note26"><sup>24</sup></a></span> The word <code>anthology</code> following the string DOCTYPE in this example is, of course, the name of the root element of the document to which this declaration is prefixed; however, only an XML DTD processor will take note of this fact.</p></div><div class="div3" id="SG-assoc"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG-ass1"><span class="headingNumber"></span>Associating Entity Definitions with a Document Instance</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG-mult"><span class="headingNumber"></span>Assembling Multiple Resources into a Single Document</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG-assoc" title="link to this section "><span class="invisible">TEI: Associating a Document Instance with Its Schema</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Associating a Document Instance with Its Schema</span></h4><p>In the past, different schema languages adopted entirely different attitudes to this question, leading to a variety of different methods of associating schemas with document instances. However, a W3C Working Group Note, <span class="titlem">Associating Schemas with XML documents</span>, (<a class="link_ptr" href="http://www.w3.org/TR/xml-model/"><span>http://www.w3.org/TR/xml-model/</span></a>) now provides a standardized method of doing this through the use of a processing instruction: </p><pre class="pre_eg cdata">&lt;?xml-model href="http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng"?&gt;</pre><p> The <span class="att">href</span> <span class="term">pseudo-attribute</span> points to the location of the schema. This is the only mandatory pseudo-attribute, but others can be added to give more information about the schema: </p><pre class="pre_eg cdata">&lt;?xml-model href="burgess.rng" 
                       title="Anthony Burgess Project Schema" 
                       schematypens="http://relaxng.org/ns/structure/1.0" 
                       type="application/xml"
                     ?&gt;</pre><p> See the XML Model WG Note for more information on the pseudo-attributes available and how to use them.</p><p>A document instance may be valid according to many different schemas, each appropriate to a different processing task. All of these may be expressed in the same way: </p><pre class="pre_eg cdata">&lt;?xml-model href="tei_tite.xsd" type="application/xml" ?&gt;
&lt;?xml-model href="checkNames.sch" type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron" ?&gt;</pre><p> This example includes a standard schema in XML Schema format, along with a Schematron schema which might be used for checking the format and linking of names.</p><p>Any modern XML processing software tool will provide convenient methods of validating documents which are appropriate to the particular schema language chosen. In the interests of maximizing portability of document instances, they should contain as little processing-specific information as possible.</p></div><div class="div3" id="SG-mult"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG-assoc"><span class="headingNumber"></span>Associating a Document Instance with Its Schema</a></li><li class="subtoc"><span class="nextLink"> » </span><a class="navigation" href="SG.html#SG-style"><span class="headingNumber"></span>Stylesheet Association and Processing</a></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG-mult" title="link to this section "><span class="invisible">TEI: Assembling Multiple Resources into a Single Document</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Assembling Multiple Resources into a Single Document</span></h4><p>As we have already indicated, a single XML document may be made up of several different operating system files that need to be pulled together by a processor before the whole document can be validated. The XML DTD language defines a special kind of entity (a <span class="term">system entity</span>) that can be used to embed references to whole files into a document for this purpose, in much the same way as the character or string entities discussed in <a class="link_ptr" href="SG.html#SG-er" title="Character References"><span class="headingNumber"></span>Character References</a>. Neither RELAX NG nor W3C Schema directly supports this mechanism, however, and we do not discuss it further here.</p><p>An alternative way of achieving the same effect is to use a special kind of pointer element to refer to the resources that need to be assembled, in exactly the same way as we proposed for the illustration in our anthology. The W3C Recommendation <span class="titlem">XML Inclusions </span> (XInclude)<span id="Note27_return"><a class="notelink" title="." href="#Note27"><sup>25</sup></a></span> defines a generic mechanism for this purpose, which is supported by an increasing number of XML processors.</p></div><div class="div3" id="SG-style"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"><span class="previousLink"> « </span><a class="navigation" href="SG.html#SG-mult"><span class="headingNumber"></span>Assembling Multiple Resources into a Single Document</a></li><li class="subtoc"></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h4><span class="bookmarklink"><a class="bookmarklink" href="#SG-style" title="link to this section "><span class="invisible">TEI: Stylesheet Association and Processing</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Stylesheet Association and Processing</span></h4><p>As mentioned above, the processing of an XML document will usually involve the use of one or more stylesheets, often but not exclusively to provide specific details of how the document should be displayed or rendered. In general, there is no reason to associate a document instance with any specific stylesheet and the schema languages we have discussed so far do not therefore make any special provision for such association. The association is made when the stylesheet processor is invoked, and is thus entirely application-specific.</p><p>However, since one very common application for XML documents is to serve them as browsable documents over the Web, the W3C has defined a procedure and a syntax for associating a document instance with its stylesheet (see <a class="link_ptr" href="http://www.w3.org/TR/xml-stylesheet/"><span>http://www.w3.org/TR/xml-stylesheet/</span></a>). This Recommendation allows a document to supply a link to a default stylesheet and also to categorize the stylesheet according to its <span class="term">MIME type</span>, for example to indicate whether the stylesheet is written in CSS or XSLT, using a specialized form of processing instruction.</p><p>Assuming therefore that we have made a CSS-conformant stylesheet for our anthology and stored it in a file called <code>anthology.css</code> which is available from the same location as the anthology itself, we could make it available over the Web simply by adding a processing instruction like the following to the anthology: </p><pre class="pre_eg cdata">&lt;?xml-stylesheet href="anthology.css" type="text/css"?&gt;</pre><p>Multiple stylesheets can be defined for the same document, and options are available to specify how a web browser should select amongst them. For example, if the document also contained a directive: </p><pre class="pre_eg cdata">&lt;?xml-stylesheet href="anthology_m.css" type="text/css" media="mobile"?&gt;</pre><p>a different stylesheet called <code>anthology_m.css</code> could be used when rendering the document on a handheld device such as a mobile phone.</p><p>Most modern web browsers support CSS (although the extent of their implementation varies), and some of them support XSLT.</p><div class="div3" id="SG-val"><div class="miniTOC miniTOC_right"><ul class="subtoc"><li class="subtoc"></li><li class="subtoc"></li><li class="subtoc"><a class="navigation" href="index.html">Home</a></li></ul></div><h5><span class="bookmarklink"><a class="bookmarklink" href="#SG-val" title="link to this section "><span class="invisible">TEI: Content Validation</span><span class="pilcrow">¶</span></a></span><span class="headingNumber"></span><span class="head">Content Validation</span></h5><p>As we noted above, most schema languages provide some degree of datatype validation for attribute values (<a class="link_ptr" href="SG.html#SG-att" title="Declaring Attributes"><span class="headingNumber"></span>Declaring Attributes</a>). They vary greatly in the validation facilities they offer for the content of elements, other than the syntactic constraints already discussed. Thus, while we may very easily check that our <span class="gi">&lt;stanza&gt;</span> elements contain only <span class="gi">&lt;line&gt;</span> elements, we cannot easily check that <span class="gi">&lt;line&gt;</span> elements contain between five and 500 correctly-spelled English words, should we wish to constrain our poetry in such a way. Also, because attributes and elements are treated differently, it is difficult or impossible to express co-occurrence constraints: for example, if the <span class="att">status</span> of a poem is <span class="val">draft</span> we might wish to permit elements such as <span class="gi">&lt;editorialQuery&gt;</span> within its content, but not otherwise.</p><p>The XML DTD language offers very little beyond syntactic checking of element content. By contrast, a major impetus behind the design and development of the W3C schema language was the addition of a much more general and powerful constraint language to the existing structural constraints of XML DTDs. In RELAX NG the opposite approach was taken, in that all datatype validation, whether of attributes or element content, is regarded as external to the schema language. For attributes, as we have seen, RELAX NG makes use of the W3C Schema Datatype Library (but permits use of others). Because RELAX NG treats both elements and attributes as special cases of patterns, the same datatype validation facilities are available for element content as for attribute values; it is unlike other schema languages in this respect. In addition, for content validation, a different component of DSDL known as Schematron can be used. Schematron is a pattern matching (rather than a grammar-based) language, which allows us to test the components of a document against templates that express constraints such as those mentioned above.</p><p>Like other XML processors, Schematron uses XPath to identify parts of an XML document; in addition, it provides elements that describe assertions to be tested and conditions which must be validated, as well as elements to report the results of the test. </p></div></div></div></div><nav class="left"><span class="upLink"> ↑ </span><a class="navigation" href="index.html">TEI P5 Guidelines</a><span class="previousLink"> « </span><a class="navigation" href="AB.html"><span class="headingNumber">iv. </span>About These Guidelines</a><span class="nextLink"> » </span><a class="navigation" href="CH.html"><span class="headingNumber">vi. </span>Languages and Character Sets</a></nav><!--Notes in [div]--><div class="notes"><div class="noteHeading">Notes</div><div class="note" id="Note5"><span class="noteLabel">3 </span><div class="noteBody">In the <span class="q">‘continuous writing’</span> characteristic of manuscripts from the early classical period, words are written continuously with no intervening spaces or punctuation.</div> <a class="link_return" title="Go back to text" href="#Note5_return">↵</a></div><div class="note" id="Note6"><span class="noteLabel">4 </span><div class="noteBody">New textbooks and websites about XML appear at regular intervals and to select any one of them would be invidious.  some recommended online courses include <a class="link_ptr" href="http://www.w3schools.com/xml/default.asp"><span>http://www.w3schools.com/xml/default.asp</span></a> and <a class="link_ptr" href="http://www.ibm.com/developerworks/training/ondemandskills/index.html"><span>http://www.ibm.com/developerworks/training/ondemandskills/index.html</span></a>.</div> <a class="link_return" title="Go back to text" href="#Note6_return">↵</a></div><div class="note" id="Note7"><span class="noteLabel">5 </span><div class="noteBody">We do not here discuss in any detail the ways that a stylesheet can be used or defined, nor do we discuss the popular W3C Stylesheet Languages XSLT and CSS. See further <a class="link_ptr" href="BIB.html#XSL11" title="Anders Berglund Extensible Stylesheet Language (XSL) Version 1.1W3C5 December 2006">Berglund (ed.) (2006)</a>, <a class="link_ptr" href="BIB.html#XSLT" title="James Clark XSL Transformations (XSLT) Version 1.0W3C16 November 1999">Clark (ed.) (1999)</a>, and <a class="link_ptr" href="BIB.html#CSS1" title="Håkon Wium Lie Bert Bos Cascading Style Sheets Level 1W3C11 January 1999">Lie and Bos (eds.) (1999)</a>.</div> <a class="link_return" title="Go back to text" href="#Note7_return">↵</a></div><div class="note" id="Note8"><span class="noteLabel">6 </span><div class="noteBody">See <span class="titlem">Extensible Markup Language (XML) 1.0</span>, available from <a class="link_ptr" href="http://www.w3.org/TR/REC-xml/"><span>http://www.w3.org/TR/REC-xml/</span></a>, Section 2.2 Characters.</div> <a class="link_return" title="Go back to text" href="#Note8_return">↵</a></div><div class="note" id="Note9"><span class="noteLabel">7 </span><div class="noteBody">ISO/IEC 10646-1993 <span class="titlem">Information Technology — Universal Multiple-Octet Coded Character Set</span> (UCS)</div> <a class="link_return" title="Go back to text" href="#Note9_return">↵</a></div><div class="note" id="Note10"><span class="noteLabel">8 </span><div class="noteBody">See <a class="link_ptr" href="http://www.unicode.org/"><span>http://www.unicode.org/</span></a></div> <a class="link_return" title="Go back to text" href="#Note10_return">↵</a></div><div class="note" id="Note11"><span class="noteLabel">9 </span><div class="noteBody">Because the opening angle bracket has this special function in an XML document, special steps must be taken to use that character for other purposes (for example, as the mathematical less-than operator); see further section <a class="link_ptr" href="SG.html#SG-er" title="Character References"><span class="headingNumber"></span>Character References</a>.</div> <a class="link_return" title="Go back to text" href="#Note11_return">↵</a></div><div class="note" id="Note12"><span class="noteLabel">10 </span><div class="noteBody">The element names here have been chosen for clarity of exposition; there is, however, a TEI element corresponding to each.</div> <a class="link_return" title="Go back to text" href="#Note12_return">↵</a></div><div class="note" id="Note13"><span class="noteLabel">11 </span><div class="noteBody">Note that this simple example has not addressed the problem of marking elements such as sentences explicitly; the implications of this are discussed in section <a class="link_ptr" href="SG.html#SG152" title="Complicating the Issue"><span class="headingNumber">v.4. </span>Complicating the Issue</a>.</div> <a class="link_return" title="Go back to text" href="#Note13_return">↵</a></div><div class="note" id="Note14"><span class="noteLabel">12 </span><div class="noteBody">The older terms <span class="term">Document Type Declaration</span> and <span class="term">Document Type Definition</span>, both abbreviated as DTD, may also be encountered. Throughout these Guidelines we use the term <span class="term">schema</span> for any kind of formal document grammar.</div> <a class="link_return" title="Go back to text" href="#Note14_return">↵</a></div><div class="note" id="Note15"><span class="noteLabel">13 </span><div class="noteBody">ISO/IEC FDIS 19757-2 Document Schema Definition Language (DSDL) -- Part 2: Regular-grammar-based validation -- RELAX NG</div> <a class="link_return" title="Go back to text" href="#Note15_return">↵</a></div><div class="note" id="Note16"><span class="noteLabel">14 </span><div class="noteBody">See further <a class="link_ptr" href="TD.html" title="27"><span class="headingNumber">22 </span>Documentation Elements</a> and <a class="link_ptr" href="USE.html#IM" title="Implementation of an ODD System"><span class="headingNumber">23.5 </span>Implementation of an ODD System</a>. In practice, the only part of a TEI element specification not expressed using TEI-defined syntax is the content model for an element, which is expressed using the RELAX NG schema language for reasons of processing convenience. RELAX NG uses its own XML vocabulary to define content models, which is adopted by the TEI for the same purpose.</div> <a class="link_return" title="Go back to text" href="#Note16_return">↵</a></div><div class="note" id="Note17"><span class="noteLabel">15 </span><div class="noteBody">For a good tutorial introduction to RELAX NG, see <a class="link_ptr" href="BIB.html#SG-BIBL-1" title="Eric van der Vlist RELAX NGOReilly2004">van der Vlist (2004)</a>.</div> <a class="link_return" title="Go back to text" href="#Note17_return">↵</a></div><div class="note" id="Note18"><span class="noteLabel">16 </span><div class="noteBody">In XML, a single colon may also appear in a GI, where it has a special significance related to the use of <span class="term">namespaces</span>, as further discussed in section <a class="link_ptr" href="SG.html#SGname" title="Namespaces"><span class="headingNumber"></span>Namespaces</a>. The characters defined by Unicode as <span class="term">combining characters</span> and as <span class="term">extenders</span> are also permitted, as are logograms such as Chinese characters.</div> <a class="link_return" title="Go back to text" href="#Note18_return">↵</a></div><div class="note" id="Note19"><span class="noteLabel">17 </span><div class="noteBody">It will not have escaped the astute reader that the fact that verse paragraphs need not start on a line boundary seriously complicates the issue; see further section <a class="link_ptr" href="SG.html#SG152" title="Complicating the Issue"><span class="headingNumber">v.4. </span>Complicating the Issue</a>.</div> <a class="link_return" title="Go back to text" href="#Note19_return">↵</a></div><div class="note" id="Note20"><span class="noteLabel">18 </span><div class="noteBody">This is however a rather artificial example; XPath, for example, provides ways of distinguishing elements in an XML structure by their position without the need to give them distinct names. </div> <a class="link_return" title="Go back to text" href="#Note20_return">↵</a></div><div class="note" id="Note21"><span class="noteLabel">19 </span><div class="noteBody">The official specification is at <a class="link_ptr" href="BIB.html#XPATH" title="James Clark Steve DeRose XML Path Language (XPath) Version 1.0W3C16 November 1999">Clark and DeRose (eds.) (1999)</a>; many introductory tutorials are available in the XML references cited above and elsewhere on the Web: good beginners' tutorials include <a class="link_ptr" href="http://www.w3schools.com/xml/xpath_intro.asp"><span>http://www.w3schools.com/xml/xpath_intro.asp</span></a> and <a class="link_ptr" href="http://www.zvon.org/xxl/XPathTutorial/"><span>http://www.zvon.org/xxl/XPathTutorial/</span></a>, the latter being available in several languages.</div> <a class="link_return" title="Go back to text" href="#Note21_return">↵</a></div><div class="note" id="Note22"><span class="noteLabel">20 </span><div class="noteBody">See <a class="link_ptr" href="BIB.html#SG-BIBL-2" title="Renear A. Mylonas E. Durand D. Refining our notion of what text really is the problem of overlapping hierarchies Nancy Ide Susa...">Renear et al. (1996)</a>.</div> <a class="link_return" title="Go back to text" href="#Note22_return">↵</a></div><div class="note" id="Note23"><span class="noteLabel">21 </span><div class="noteBody">In the unlikely event that both kinds of quotation marks are needed within the quoted string, either or both can also be presented in escaped form, using the predefined character entities <code>&amp;apos;</code> or <code>&amp;quot;</code></div> <a class="link_return" title="Go back to text" href="#Note23_return">↵</a></div><div class="note" id="Note24"><span class="noteLabel">22 </span><div class="noteBody">The word <span class="q">‘anyURI’</span> is a predefined name, used in schema languages to mean that any <span class="term">Uniform Resource Identifier</span> (URI) may be supplied here. The accepted syntax for URIs is an Internet Standard, defined in <a class="link_ptr" href="http://tools.ietf.org/html/rfc3986"><span>http://tools.ietf.org/html/rfc3986</span></a>. <code>anyURI</code> is one of the <span class="term">datatypes</span> defined by the W3C Schema datatype library.</div> <a class="link_return" title="Go back to text" href="#Note24_return">↵</a></div><div class="note" id="Note25"><span class="noteLabel">23 </span><div class="noteBody">The W3C Recommendation is defined at <a class="link_ptr" href="http://www.w3.org/Graphics/SVG/"><span>http://www.w3.org/Graphics/SVG/</span></a>.</div> <a class="link_return" title="Go back to text" href="#Note25_return">↵</a></div><div class="note" id="Note26"><span class="noteLabel">24 </span><div class="noteBody">And, indeed, for those responsible for deciding the licensing conditions if they change their minds later.</div> <a class="link_return" title="Go back to text" href="#Note26_return">↵</a></div><div class="note" id="Note27"><span class="noteLabel">25 </span><div class="noteBody"><a class="link_ptr" href="http://www.w3.org/TR/xinclude/"><span>http://www.w3.org/TR/xinclude/</span></a>.</div> <a class="link_return" title="Go back to text" href="#Note27_return">↵</a></div></div><div class="stdfooter autogenerated"><p>
    [<a href="../../en/html/SG.html">English</a>]
    [<a href="../../de/html/SG.html">Deutsch</a>]
    [<a href="../../es/html/SG.html">Español</a>]
    [<a href="../../it/html/SG.html">Italiano</a>]
    [<a href="../../fr/html/SG.html">Français</a>]
    [<a href="../../ja/html/SG.html">日本語</a>]
    [<a href="../../ko/html/SG.html">한국어</a>]
    [<a href="../../zh-TW/html/SG.html">中文</a>]
    </p><hr /><div class="footer"><a class="plain" href="http://www.tei-c.org/About/">TEI Consortium</a> | <a class="plain" href="http://www.tei-c.org/About/contact.xml">Feedback</a></div><hr /><address><br />TEI Guidelines <a class="link_ref" href="AB.html#ABTEI4">Version</a> <a class="link_ref" href="../../readme-3.1.1.html">3.1.1a</a>. Last updated on <span class="date">10th May 2017</span>, revision <a class="link_ref" href="https://github.com/TEIC/TEI/commit/bd8dda3">bd8dda3</a>. This page generated on 2017-05-12T12:30:09Z.</address></div></div></body></html>
back to top