defines one unit in a writing system, supplementing or overriding
information provided in the base coded character sets, writing system
declarations, and entity sets.
Attributes
(In addition to global attributes)
class
describes the function of the character using a prescribed
classification.
Datatype: (lexical | punc | lexpunc | digit
| space | DL | LD | dia | joiner | other)
Legal values are:
lexical
character is used in writing words (lexical items) of
the language (includes members of syllabaries and ideographic systems,
as well as composite letter-plus-diacritic combinations)
punc
character is a punctuation mark which does not appear
within lexical items
lexpunc
character can appear as a normal punctuation
mark, but can also appear within a lexical item (and should usually,
when occurring between two lexical characters, be treated as
lexical—in English, hyphen and apostrophe are typically treated as
members of this class)
digit
character is
an Arabic decimal numeral (0, 1, ... 9) (does not
include superscript numbers, circled numbers, numeric dingbats, etc.)
space
character represents some form of white space
(space character, horizontal or vertical tab, newline, etc.)
dl
character is a diacritic applying to the following
lexical character
ld
character is a diacritic applying to the preceding
lexical character
dia
character is a diacritic which is explicitly joined to
a lexical character by a joiner character
joiner
character is used to join a diacritic to the lexical
character to which it applies (in some encoding schemes, the
backspace control character may be used as a joiner; in others, a
graphic character is used for the same function)
other
character does not fall into any of the other classes
(dingbats and other unusual characters fall here)
Default: lexical
Example:
Note
The classification of characters provided by this
attribute serves both informative and normative purposes: it helps
identify the character being described, and the classification is used
to define the meaning of the special character-class codes in the TEI
extended pointer syntax described in chapter 14 Linking, Segmentation, and Alignment.
Example
Note
The notion of `characters' as units in a
writing system is widely spread, but not consistently defined; the
<character> element should be used to identify whatever units the
encoder wishes to distinguish as the meaningfully distinct graphic units
of the writing system. In most cases, these will correspond to the
units of coded character sets, but that this is not a requirement:
a-umlaut, for example, may be treated as one character or two, depending
on the user's preference, regardless of how the coded character set in
use treats it. In most cases, also, the units distinguished by the
<character> element will be the `graphemic'
units of the writing system in question; however, since experts disagree
on whether items like umlaut (let alone a given set of Chinese
characters with regional variations in China, Korea, and Japan) are best
treated as distinct graphemes or not, the association of
<character> elements with the graphemes of a writing system
provides at most a heuristic device for making reasonable decisions,
rather than a definitive unambiguous test.
Different forms of the same `character' may be
distinguished for whatever reason, as in the three-R example of chapter 4 Languages and Character Sets.
In this case the different letter forms are
distinguished by documenting them in different <form> elements;
the fact that the different letter shapes do not make a lexical
difference in the text may be expressed by grouping all three letter
forms under the same <character> element. (Alternatively, the
three forms may be treated as three distinct characters, for convenience
or for whatever reason, by defining a distinct <character>
element for each.)
Module
Declared in file teiwsd2; Auxiliary tag set for Writing System Declarations
Data Description
May contain one or more description elements (optional), a
series of one or more <form> elements identifying different forms
of the character, and an optional series of notes.
May contain
desc form note
May occur within
Declaration
<!ELEMENT character %om.RO; (desc*, form+, note*)>
<!ATTLIST character
%a.global;
class (lexical | punc | lexpunc | digit | space | DL | LD | dia | joiner | other)
"lexical">