Many types of uncertainty may be distinguished. The
<certainty> element is designed to encode the following sorts:
17.1.2 Structured Indications of Uncertainty
To record uncertainty in a more structured way, susceptible of at
least simple automatic processing, the <certainty> element may be
used:
-
<certainty> indicates the degree of certainty or uncertainty associated with
some aspect of the text markup.
target |
points at the
elements whose markup is
uncertain. |
locus |
indicates the precise location of the uncertainty in the
markup: applicability of the element, precise position of the
start- or end-tag, value of a specific attribute, etc. |
given |
indicates conditions assumed in the assignment of a
degree of confidence. |
degree |
indicates the degree of confidence assigned to the aspect
of the markup named by the locus attribute. |
assertedValue |
provides an alternative value for the aspect of the markup in
question—an alternative generic identifier, transcription,
or attribute value, or the identifier of an <anchor> element (to
indicate an alternative starting or ending location). If an
assertedValue is given, the confidence level specified by
degree applies to the alternative markup specified by
assertedValue; if none is given, it applies to the markup
in the text. |
desc |
further describes the uncertainty in prose, perhaps
indicating its nature, cause, or the justification for the
degree of confidence asserted. |
Returning to the example, the <certainty> element may be used to record doubts about
the proper encoding of ‘Essex’ in several ways of varying
precision. To record merely that we are not certain that ‘Essex’
is in fact a place name, as it is tagged, we use the target
attribute to identify the element in question, and the locus
attribute to indicate what aspect of the markup we are uncertain about
(in this case, whether we have used the correct element type):
Elizabeth went to
<placeName id="p1">Essex</placeName>.
<!-- ... elsewhere in the document ... -->
<certainty target="p1" locus="#gi" desc="possibly not a placename"/>
Because it is linked to the location of the uncertainty by a reference, the
<certainty> element will typically be included in the same
document as its target. It may be placed adjacent to the target
element, or elsewhere in the document.
To record the further information that we estimate, subjectively,
that there is a 60 percent chance of ‘Essex’ being a place name here, we
can add a value for our degree of confidence (usually a
number between 0 and 1, representing the estimated probability):
Elizabeth went to
<placeName id="p1">Essex</placeName>.
<!-- ... -->
<certainty target="p1" locus="#gi" desc="possibly not a placename" degree="0.6"/>
According to one expert, there is a 60 percent chance of ‘Essex’
being a place name here, and a 40 percent chance of its being a
personal name. We use two
<certainty> elements to indicate the
two probabilities independently. Both elements indicate the same location in the
text, but the second provides an alternative choice of generic
identifier (in this case
<persName>) is given as the
value of the
assertedValue attribute:
Elizabeth went to
<placeName id="p1">Essex</placeName>.
<!-- ... -->
<certainty target="p1" locus="#gi"
desc="probably a placename, but possibly not" degree="0.6"/>
<certainty target="p1" locus="#gi" assertedValue="persName"
desc="may refer to the Earl of Essex" degree="0.4"/>
Finally, we may wish to make our probability estimates contingent
on some condition. In the passage ‘Elizabeth went to Essex; she had
always liked Essex,’ for example, we may feel there is a 60 percent chance
that the county is meant, and a 40 percent chance that the earl is meant. But
the two occurrences of the word are not independent: there is (we may
feel) no chance at all that one occurrence refers to the county and one
to the earl. We can express this by using the given
attribute to list the identifiers of <certainty> elements.
Elizabeth went to <placeName id="p1">Essex</placeName>.
She had always liked <placeName id="p2">Essex</placeName>.
<!-- ... -->
<!-- 60% chance that P1 is a placename,
40% chance a personal name. -->
<certainty id="cert-1" target="p1" locus="#gi"
desc="probably a placename, but possibly not" degree="0.6"/>
<certainty id="cert-2" target="p1" locus="#gi"
desc="may refer to the Earl of Essex" assertedValue="persName" degree="0.4"/>
<!-- 60% chance that P2 is a placename,
40% chance a personal name.
100% chance that it agrees with P1. -->
<certainty target="p2" locus="#gi" given="cert-1"
desc="if P1 is a placename, P2 certainly is" degree="1.0"/>
<certainty target="p2" locus="#gi" assertedValue="persName" given="cert-2"
desc="if p1 refers to the Earl of Essex, so does P2" degree="1.0"/>
When
given conditions are listed, the
<certainty>
element is interpreted as claiming a given degree of confidence in a
particular markup given the assertional content of the
<certainty> elements indicated—that is,
if the markup
described in the indicated <certainty> elements is
correct.
Conditional confidence may be less that 100 percent: given the sentence
‘Ernest went to old Saybrook’, we may interpret ‘Saybrook’ as
a personal name or a place name, assigning a 60 percent probability to the
former. If it is a place name, there may be a 50 percent chance that the
place name actually in question is ‘Old Saybrook’ rather than
‘Saybrook’, while if it is correctly tagged as a personal name, it
is much more likely (say, 90 percent certain) that the name is ‘Saybrook’.
This state of affairs can be expressed using the
<certainty> element thus:
Earnest went to <anchor id="a1"/> old <persName id="p1">Saybrook</persName>.
<certainty id="c1" target="p1" locus="#gi" degree="0.6"/>
<certainty target="p1" locus="startloc" given="c1" degree="0.9"/>
<certainty id="c2" target="p1" locus="#gi" assertedValue="persName" degree="0.4"/>
<certainty target="p1" locus="startloc" given="c2" degree="0.5"/>
<certainty id="c3" target="p1" locus="startloc" assertedValue="a1" given="c1" degree="0.5"/>
In this case, the
assertedValue on
<certainty> element
c3 is a reference to an
<anchor> element at the alternative starting
point for the element.
Multiplying the numeric values out, this markup may be interpreted as
assigning specific probabilities to three different ways of
marking up the sentence:
Earnest went to old <persName>Saybrook</persName>. (0.6 * 0.9, or 0.54)
Earnest went to old <placeName>Saybrook</placeName>. (0.4 * 0.5, or 0.20)
Earnest went to <placeName>old Saybrook</placeName>. (0.4 * 0.5, or 0.20)
The probabilities do not add up to 1.00 because the markup indicates
that if ‘Saybrook’ is (part of) a personal name, there is a
10 percent likelihood that the element should start somewhere other than the
place indicated, without however giving an alternative location; there
is thus a 6 percent chance (0.1 × 0.6) that none of the alternatives given is
correct.
If an attribute value is uncertain, the locus attribute
takes as its value the name of the attribute in question. In this
example, there is only a 50 percent chance that the question was spoken by
participant A:
<u id="u1" who="a">Have you heard the election results?</u>
<!-- ... -->
<certainty target="u1" locus="who" degree="0.5"/>
Doubts about whether the transcription is correct may be expressed
by assigning to locus the value
‘#transcribedContent'. For example, if the source is
hard to read and so the transcription is uncertain:
I have a <emph id="p1">gub</emph>.
<certainty target="p1" locus="#transcribedContent" degree="0.5"/>
Degrees of confidence in the proper expansion of abbreviations may
also be expressed, by using the value ‘#suppliedContent':
You will want to use <expan id="e1" abbr="SGML">Standard
Generalized Markup Language</expan> ...
<!-- ... -->
<certainty target="e1" locus="#suppliedContent" degree="0.9"/>
The assertedValue attribute should be used to provide an
alternative value for whatever aspect of the markup is in doubt: an
alternative generic identifier, or the identifier of an alternative starting or
ending point, as already shown, an alternative attribute value, or
alternative element content, as in this example:
I have a <emph id="p1">gub</emph>.
<certainty target="p1" locus="#transcribedContent" assertedValue="gun" desc="a gun makes more sense in a holdup" degree="0.8"/>
Since attribute values have no internal substructure, the
assertedValue attribute is useful for specifying alternative
transcriptions only in relatively restricted circumstances
(specifically, when the alternative reading has no elements nested within
it). More robust methods of handling uncertainties of transcription are
the
<unclear> element and the
<app> and
<rdg>
elements described in chapter
19 Critical Apparatus.
The
<certainty> element allows for indications of uncertainty to
be structured with at least as much detail and clarity as appears to be
currently required in most ongoing text projects.
It is expected that in the future more adequate systems for expressing
uncertainty will be developed. These may extend the
<certainty>
element or they may make use of the feature-structure encoding
mechanisms described in chapter
16 Feature Structures.
The <certainty> element and the other TEI mechanisms for
indicating uncertainty provide a range of methods of graduated
complexity. Simple expressions of uncertainty may be made by using the
<note> element. This is simple and convenient, and can
accommodate either a discursive and unstructured indication of uncertainty, or
a complex and structured but probably project-specific expression of uncertainty. In
general, however, unless special steps are taken, the <note>
element does not provide as much expressive power as the
<certainty> element, and in cases where highly structured
certainty information must be given, it is recommended that the
<certainty> element be used.
The <certainty> element may be used for simple unqualified
indications of uncertainty, in which case only the locus
and target attributes might be specified.
In more complex cases, the
other attributes may be used to provide fuller information. While
these attributes may take any string of characters as value, the recommended
values should be used wherever possible; if they are not appropriate
in a given situation, encoders should provide their own controlled
vocabulary and document it in the <encodingDesc> or
<tagUsage> elements of the TEI header.
The <certainty> element has the following formal declaration:
<!-- 17.1.2: Certainty and uncertainty-->
<!--Text Encoding Initiative Consortium:
Guidelines for Electronic Text Encoding and Interchange.
Document TEI P4, 2002.
Copyright (c) 2002 TEI Consortium. Permission to copy in any form
is granted, provided this notice is included in all copies.
These materials may not be altered; modifications to these DTDs should
be performed only as specified by the Guidelines, for example in the
chapter entitled 'Modifying the TEI DTD'
These materials are subject to revision by the TEI Consortium. Current versions
are available from the Consortium website at http://www.tei-c.org-->
<!ELEMENT certainty %om.RO; EMPTY>
<!ATTLIST certainty
%a.global;
target IDREFS #REQUIRED
locus CDATA #REQUIRED
assertedValue CDATA #IMPLIED
desc CDATA #IMPLIED
given CDATA #IMPLIED
degree CDATA #IMPLIED
TEIform CDATA 'certainty' >
[declarations from 17.2: Responsibility for markup inserted here ]
<!-- end of 17.1.2-->