TEI P4 Home
30 Rules for Interchange
30.1 Negotiated Interchange
30.2 Some Simple Examples
30.3 Non-Negotiated Interchange
30.4 Notes for Implementors
Introductory Note (March 2002)
1 About These Guidelines
2 A Gentle Introduction to XML
3 Structure of the TEI Document Type Definition
4 Languages and Character Sets
5 The TEI Header
6 Elements Available in All TEI Documents
7 Default Text Structure
8 Base Tag Set for Prose
9 Base Tag Set for Verse
10 Base Tag Set for Drama
11 Transcriptions of Speech
12 Print Dictionaries
13 Terminological Databases
14 Linking, Segmentation, and Alignment
15 Simple Analytic Mechanisms
16 Feature Structures
17 Certainty and Responsibility
18 Transcription of Primary Sources
19 Critical Apparatus
20 Names and Dates
21 Graphs, Networks, and Trees
22 Tables, Formulae, and Graphics
23 Language Corpora
24 The Independent Header
25 Writing System Declaration
26 Feature System Declaration
27 Tag Set Documentation
28 Conformance
29 Modifying and Customizing the TEI DTD
30 Rules for Interchange
31 Multiple Hierarchies
32 Algorithm for Recognizing Canonical References
33 Element Classes
34 Entities
35 Elements
36 Obtaining the TEI DTD
37 Obtaining TEI WSDs
38 Sample Tag Set Documentation
39 Formal Grammar for the TEI-Interchange-Format Subset of SGML
Appendix A Bibliography
Appendix B Index
Appendix C Prefatory Notes
Appendix D Colophon
|
This chapter discusses issues related almost
exclusively to the use of SGML-encoded TEI documents in
interchange. XML-encoded TEI documents may be safely interchanged
without formality over current networks, largely without concern for
any of the issues discussed here. This chapter has not therefore been
revised, and will probably be
withdrawn or substantially modified at the next release.
This chapter describes how interested parties can determine and agree
on the proper format for the successful interchange of TEI-conformant
documents over a given communications link, and how to translate from
normal TEI form to the transmission format and back. It also includes
recommendations for formats to be used when private arrangements cannot
be made, in non-negotiated or `blind' interchange.
30.1 Negotiated Interchange
When the sender and receiver of a given text know each other's
identity and can make appropriate special arrangements for the
interchange of the text, the following procedures may be used to
ensure the successful interchange of the text without information loss.
The sender and receiver together must first:
- agree on whether to exchange the document in
TEI interchange form or in some other form (e.g.
TEI local-processing form)
- identify the communications link: the method to be
used to transmit and receive the document (transmission over a given
network, physical transmission of a disk, tape, or other medium, etc.)
- identify (by experimentation if necessary) the set of characters
which can be transmitted successfully, without corruption, over the
communications link; this is the
transmission character set.
- identify the set of characters present in the document which lie
outside the transmission character set; this is the set of
non-transmissible characters. For each character in this
set, the local writing system declaration should identify an entity name
or a transliteration into the transmission character set, which is to be
used to transmit the character. The set of entities needed for this
purpose is the transmission entity set.
The transmission character set is defined as the set of characters
in the sender's system character set(s) which survive transmission and
are properly recognized in the recipient's system character set. It is
therefore by definition a subset of both the sender's and the
recipient's system character sets. The bit patterns used to represent
the characters may differ in the two systems (e.g. one may use ASCII,
the other EBCDIC) if the communications link performs the proper
translations.
Current network standards allow — indeed, require — gateway nodes
to translate material passing through the gateway from one coded
character set into another, when the networks joined by the gateway use
different coded character sets. Since there is no universally
satisfactory translation among all coded character sets in common use,
the transmission character set will normally be the subset which is
satisfactorily translated by the gateways encountered in transit between
the sender and the receiver of the data.
When material is transmitted on a physical storage medium
(e.g. disk or tape), then those exchanging documents have far greater
control over the data. If both partners use compatible systems, the
transmission character set may be equivalent to their system character
sets; otherwise, the transmission character set will include those
characters which the recipient can successfully read into the local
system character set from the media provided by the sender. For
example, if a diskette created by an MS-DOS machine can be mailed to a
Macintosh user and read directly by the recipient's system, then the
transmission character set is likely to be ISO 646 IRV (equivalent to
ANSI X3.4, or ASCII), which both machines have in common; if the
recipient's disk-reading utilities are more sophisticated, however,
then it may be possible to include some or all of the two machines'
non-standard extended characters as well.
The mapping from non-transmissible characters to the transmission
entity set may be derived from the writing system declarations in use by
the sender and receiver.
After the transmission character set and entity character sets
have been defined, the sender must prepare and transmit the document:
- if the document is to be exchanged in TEI interchange format,
translate it from the local-processing form into the TEI interchange
form (e.g. by passing SGML through an SGML normalizer to supply omitted
SGML tags and interpret all short-reference sequences; XML documents are
required to obey these same constraints, and
are considered to be in the interchange format)
- in the case of SGML, pack the document for transmission by replacing every
non-transmissible character with the appropriate transliteration or
entity reference. At this point, every character in the document is
represented either by a character in the transmission character set or
by a reference to an entity in the transmission entity set; the document
is in TEI packed format. If the transmission character set
includes all the required delimiter characters, the document will be a
conforming SGML document; otherwise, it may not be SGML-conformant but
will be easily mappable to a conforming document. In the case of XML,
which is always in Unicode, the document may be packed to UTF-8 for
transmission.
- transmit the document over the agreed link
The translation of non-transmissible characters into the transmission
entity set may be accomplished under automatic control, by software
which reads the appropriate local writing system declarations, creates
the necessary mapping tables, and packs the document for transmission.
Upon receipt, the receiver must:
- unpack the document by expanding those entities which correspond
to characters in the local system character set(s); at this point, the
document should once more be conformant, and should probably be
validated to detect any problems in transmission
- optionally, material transliterated in the document as transmitted
may be translated into the local system character set(s) for more
convenient display, or from the transmission entity set into a local
transliteration scheme; in XML this should not typically be necessary
It is strongly recommended that when documents are interchanged they
be accompanied by any writing system declarations and feature system
declarations which are applicable. In TEI-conformant interchange,
it is required that documents be accompanied by any applicable tag-set
documentation files.
30.2 Some Simple Examples
As a first simple example, consider an SGML document containing English,
French, and German, to be transmitted from an IBM-compatible personal
computer to a Macintosh, over a long-distance network connection.
Uploading test files from the PC to the sender's local network node,
sending the file via the network, and downloading the document to the
Macintosh, reveal (let us assume) that while all the characters of ISO
646 IRV survive intact, the accented characters of French and German do not
survive transmission. In this case, the transmission character set is
composed of all the characters of 7-bit ISO 646. The entity set
required to handle the non-transmissible characters will include the
following (assuming they actually occur in the document):
- agrave
- ccedille
- eacute
- egrave
- ocirc
- oelig
- auml, Auml
- ouml, Ouml
- uuml, Uuml
- szlig
When `packing' the file for transmission, the sender
must replace the non-transmissible characters in the document with
references to these entities. After this subsitution, the document is a
conforming document written entirely in the transmission character
set, which can be sent over the communications link without any garbling
or loss of information. Upon receipt, the recipient can replace the
entity references with the specific coded characters used on the
Macintosh to write French and German.
As a second example, consider the same document being transmitted
from a VAX running VMS to an IBM mainframe running VM/CMS. Here, the
accented characters might be represented on the VAX using the coded
character set ISO 8859-1, but since ISO 8859-1 is not always
supported, it may be more likely that entity references will be used
instead. Let us assume that the network path between the two machines
accepts Latin characters and digits, and most punctuation, but garbles
square brackets, braces, the hash mark, and the pounds-sterling
symbol. In this case, the accented characters require no special work
by the sender, since they are already in a network-safe form. Square
brackets, etc., must however be replaced by entity references to
lbr, rbr, etc. After
this is done, the document is no longer conformant SGML, since the
square brackets used in certain markup declarations will not be
recognized. (It is important, therefore, for validation to be
performed before the square brackets are replaced by entity
references.)
Upon receipt, the document may be translated into a valid SGML
document by replacing all references to lbr
and rbr with the appropriate square brackets,
etc. If the local system supports one of the IBM code pages with
support for French and German characters, then the entity references
to those characters may be replaced by the characters in the system
character set. More commonly, the entities for French and German
characters will be left in place.
As a third example, consider a document containing Greek, as well
as Latin, German, French, and English. If the sender's system has a
full Greek character set, but the recipient's does not, then the Greek
characters must all be replaced either by references to entities
or by transliterations into Latin characters (e.g. using the beta code
transliteration developed for the Thesaurus Linguę
Gręcę). If the text is later transmitted to another
system which does have a full Greek character set, the transliterated
text or entity references may be translated, under control of the
relevant writing system declarations, into the local Greek character
set.
As a final example, consider a document written in Japanese, to be
transmitted over a network within Japan, or over an international
network to a recipient in Europe. Since networks within Japan transmit
Japanese text without information loss, and common utilities may be used
to recognize any of the existing coded character sets and translate into
another, the transmission character set for interchange within Japan may
be the same as the system character set. (If user-defined extensions
are defined, in order to allow the encoding of kanji not present in the
standard character sets, then these non-standard kanji may need to be
replaced either by entity references or by
`transliterations' into the standard character sets,
and the description of the kanji themselves should accompany the
documents in which they are used; the writing system declaration may be
used for this purpose.)
When transmitting Japanese text outside Japan, the limitations of the
networks at the time of transmission must be taken into account; it may
be necessary to transliterate the text, or to replace non-transmissible
characters with entity references.
30.3 Non-Negotiated Interchange
In some cases, no negotiation between interchange partners is
possible, because they do not know each other's identities. Since it is
impossible to discover an appropriate transmission entity set by
experiment, such interchange requires the use of extremely conservative
assumptions about the frailties of network gateways.
Specifically, it is recommended that for non-negotiated interchange
the following practices be adopted:
- The transmission character set should be the International
Reference Version of ISO 646 (commonly known as ASCII), or Unicode.
- The transmission entity set should include only entities
documented in ISO standard entity sets, published TEI writing system
declarations, or writing system declarations made available with the
material.
- All applicable writing system declarations should be distributed
together with the material they describe; in non-negotiated interchange,
all writing system declarations should assume ISO 646 (IRV) or Unicode as
the system character set.
- For SGML, the SGML declaration distributed with the material should assume
ISO 646 (IRV) as the system character set.
- If the receiver is not using ISO 646 IRV or Unicode as the system character
set, then the receiver (or some intervening network node) must translate
from ISO 646 into the receiver's system character set. For SGML, the receiver or
the receiver's unpacking software is responsible for rewriting those
parts of the SGML declaration which are dependent on the coded character
set, and ensuring that they work properly on the receiving system.
- As transmitted, the document should be valid.
By these restrictions,
these recommendations ensure that documents interchanged in this way
will be directly usable on a great variety of systems; moreover, since
the allowed character sets are widely known and well documented, users of other systems will
normally be able to adjust the documentation and data stream to their
local systems without difficulty.
It should be noted that ISO 646 imposes a very
restrictive and cumbersome encoding for researchers whose character
sets have a large repertoire; it is strongly recommended, therefore,
that such materials use XML and Unicode, or that arrangements for
their interchange involve explicitly
negotiated interchange formats wherever possible.
The rules given here for non-negotiated interchange are not
guaranteed to succeed, and negotiation of interchange formats is
therefore required, if any of the following apply:
- The communications link between the interchange partners garbles
or corrupts any characters of ISO 646 (IRV), (i.e., ISO 646 (IRV) is
not a subset of the transmission character set).
- The communications link maps characters not in the repertoire of
ISO 646 (IRV) onto characters in that repertoire.
- The document uses a non-standard SGML concrete syntax (XML
does not permit variant concrete syntaxes, so this case does not
apply when using XML).
30.4 Notes for Implementors
The descriptions of document interchange in this chapter from time to
time refer to software used to pack documents for interchange, or to
unpack documents upon receipt. The descriptions do not characterize any
specific existing software, but attempt to make clear how such software
must work, in a general way. It is hoped that the descriptions will be
useful to implementors of packing and unpacking software, but the full
specification of such packing and unpacking software is beyond the scope
of this chapter. All that can be attempted here is to describe some
complications which may arise in the packing and unpacking of documents
for interchange, of which implementors of such software should be aware.
Most of these difficulties do not arise with XML, because XML requires
that the character set be Unicode.
- if the sender and receiver use different SGML syntaxes, various
incompatibilities may be encountered which will require one partner or
the other to modify the SGML syntax used for the document, or the
document itself, or both. In general, unless other arrangements are
made, the responsibility for such modifications falls upon the receiver
of the document.
- in particular, if the SGML syntax has been modified to expand the
set of legal name characters (e.g. to allow characters with diacritic
marks to occur in SGML names), then either the recipient must similarly
modify the local SGML declaration, or the SGML names must be modified
(by sender or receiver) to make them legal under the recipient's SGML
declaration; name collisions must be carefully avoided when this is
done.
- if the transmission character set does not include all characters
with special meaning to the parser (name characters, delimiters, etc.), then
although the packed document will be in a one-to-one relationship with a
conforming document, it will not itself be conforming.
In this case, validation must be done by the
sender before packing, and the recipient cannot validate the received
document before unpacking it.
- the proper packing of characters for transport may vary from
language to language: in English text, a left square bracket may need
to be packed with an entity reference to lbr,
while the same bracket may have a different meaning, and thus a
different entity replacement, in Greek text. In general, this means
the packing should be done by an application able to detect element boundaries,
read the value of the
lang attribute from the start-tag or infer it from context,
and adjust its actions appropriately. If only one WSD is in use in a
given document and no language shifts are present, then both packing
and unpacking may be done by a much simpler string-replacement
algorithm.
|