Leech, G., Garside, R., and Bryant, M. (1994). 
CLAWS4: The tagging of the
British National Corpus.  In  Proceedings of the 15th International
Conference on Computational Linguistics (COLING 94) Kyoto, Japan, pp622-628.
CLAWS4: THE TAGGING OF THE BRITISH NATIONAL CORPUS
 
Geoffrey Leech, Roger Garside and Michael Bryant
UCREL
Lancaster University
United Kingdom
 
1.  INTRODUCTION
The main purpose of this paper is to describe the 
CLAWS4 general-purpose grammatical tagger, used 
for the tagging of the 100-million-word British 
National Corpus, a task completed in July 1994 [Footnote 1].
We will emphasise the goals of (a) general-
purpose adaptability, (b) incorporation of 
linguistic knowledge to improve quality and 
consistency, and (c) accuracy, measured 
consistently and in a linguistically informed 
way. 
 
The British National Corpus (BNC) consists of 
c.100 million words of English written texts and 
spoken transcriptions, sampled from a 
comprehensive range of text types.  The BNC 
includes 10 million words of spoken language, 
c.45% of which is impromptu conversation (see 
Crowdy, forthcoming). It also includes an immense 
variety of written texts, including unpublished 
materials. The grammatical tagging of the corpus 
has therefore required the "super-robustness" of 
a tagger which can adapt well to virtually all 
kinds of text. The tagger also has had to be 
versatile in dealing with different tagsets (sets 
of grammatical category labels - see 3 below) and 
accepting text in varied input formats.  For the 
purposes of the BNC, the tagger has been required 
both to accept and to output text in a corpus-
oriented TEI-conformant mark-up format known as 
CDIF (Corpus Document Interchange Format), but 
within this format many variant formats 
(affecting, for example, segmentation into words 
and sentences) can be readily accepted. In 
addition, CLAWS allows variable output formats: 
for the current tagger, these include (a) a 
vertically-presented format suitable for manual 
editing, and (b) a more compact horizontally-
presented format often more suitable for end-
users. Alternative output formats are also 
allowed with (c) so-called "portmanteau tags", 
i.e. combinations of two alternative tags, where 
the tagger calculates there is insufficient 
evidence for safe disambiguation, and (d) with 
simplified "plain text" mark-up for the human 
reader.  
 
CLAWS4, the BNC tagger[Footnote 2],
incorporates many features of adaptability such as the above.  It 
also incorporates many refinements of linguistic 
analysis which have built up over 14 years: 
particularly in the construction and content of 
the idiom-tagging component (see 2 below).  At 
the same time, there are still many improvements 
to be made: the claim that "you can put together 
a tagger from scratch in a couple of months" 
(recently heard at a research conference) is, in 
our view, absurdly optimistic. 
 
2.   THE DESIGN OF THE GRAMMATICAL TAGGER (CLAWS4) 
The CLAWS4 tagger is a successor of the CLAWS1 
tagger described in outline in Marshall (1983), 
and more fully in Garside et al (1987), and has 
the same basic architecture. The system (if we 
include input and output procedures) has five 
major sections: 
 
- Section a:
- segmentation of text into word and sentence units.
- Section b:
- initial (non-contextual) part-of-speech assignment [using
a lexicon,  word-ending list, and various sets of rules
for tagging unknown items]
- Section c:
- rule-driven contextual part-of-speech assignment
- Section d:
- probabilistic tag disambiguation [Markov process]
- Section c':
-  [second pass of Stage C]
 
- Section e:
- output in intermediate form
The intermediate form of text output is the form 
suitable for post-editing (see 1 above), which 
can then be converted into other formats 
according to particular output needs, as already 
noted.
 
The pre-processing section (a) is not trivial, 
since, in any large and varied corpus, there is a 
need to handle unusual text structures (such as 
those of many popular and technical magazines), 
less usual graphic features (e.g. non-roman 
alphabetic characters, mathematical symbols), and 
features of conversation transcriptions: e.g. 
false starts, incomplete words and utterances, 
unusual expletives, unplanned repetitions, and 
(sometimes multiple) overlapping speech.   
 
Sections (b) and (d) apply essentially a Hidden 
Markov Model (HMM) to the assignment and 
disambiguation of tags.  But the intervening 
section (c) has become increasingly important as 
CLAWS4 has developed the need for versatility 
across a range of text types.  This task of rule-
driven contextual part-of-speech assignment began 
in 1981 as an "idiom-tagging" program for 
dealing, in the main, with parts of speech 
extending over more than one orthographic word 
(e.g. complex prepositions such as according to 
and complex conjunctions such as so that).  In 
the more fully developed form it now has, this 
section utilises several different idiom lexicons 
dealing, for example, with (i) general idioms 
such as as much as (which is on one analysis a 
single coordinator, and on another analysis, a 
word sequence), (ii) complex names such as Dodge 
City and Mrs Charlotte Green (where the capital 
letter alone would not be enough to show that 
Dodge and Green are proper nouns), (iii) foreign 
expressions such as annus horribilis.
   
 
These idiom lexicons (over 3000 entries in all) 
can match on both tags and word-tokens, employing 
a regular expression formalism at the level of 
the individual item and the sequence of items. 
Recognition of unspecified words with initial 
capitals is also incorporated. Conceptually, each 
entry has two parts: (a) a regular-expression-
based "template" specifying a set of conditions 
on sequences of word-tag pairs, and (b) a set of 
tag assignments or substitutions to be performed 
on any sequence matching the set of conditions in 
(a).  Examples of entries from each of the above 
kinds of idiom lexicon entry are: 
 
- as AV0, ([ ])5, as CJS, possible AJ0 
 
 
- Monte/Mount/Mt NP0, ([WIC])2 NP0, [WIC] NP0 
 
 
- ad AV021 AJ021, hoc AV022 AJ022 
 
 
Explanatory note on formalism used in above idiom rule examples
 
We have also now moved to a more complex, two-
pass application of these idiomlist entries. It 
is possible, on the first pass, to specify an 
ambiguous output of an idiom assignment (as is 
necessary, e.g., for as much as, mentioned 
earlier), so that this can then be input to the 
probabilistic disambiguation process (d). On the 
second pass, however, after probabilistic 
disambiguation, the idiom entry is deterministic 
in both its input and output conditions, 
replacing one or more tags by others.  In effect, 
this last kind of idiom application is used to 
correct an tagging error arising from earlier 
procedures.  For example, a not uncommon result 
from Sections (a)-(d) is that the base form of 
the verb (e.g. carry) is wrongly tagged as a 
finite present tense form, rather than an 
infinitive. This can be retrospectively corrected 
by replacing VVB (= finite base form) by VVI (= 
infinitive). 
 
While the HMM-type process employed in Sections 
(b) and (d) affirms our faith in probabilistic 
methods, the growing importance of  the 
contextual part-of-speech assignment in (c) and 
(c') demonstrates the extent to which it is 
important to transcend the limitations of the 
orthographic word, as the basic unit of 
grammatical tagging, and also to selectively 
adopt non-probabilistic solutions.  The term 
"idiom-tagging" was never particularly 
appropriate for these sections, which now handle 
more generally the interdependence between 
grammatical and lexical processing which NLP 
systems ultimately have to cope with, and are 
also able to incorporate parsing information 
beyond the range of the one-step Markov process 
(based on tag bigram frequences) employed in 
(d)[Footnote 3].  Perhaps
the term "phraseological component" 
would be more appropriate here.  The need to 
combine probabilistic and non-probabilistic 
methods in tagging has been widely noted (see, 
e.g., Voutilainen et al. 1992). 
 
3.   EXTENDING ADAPTABILITY: SPOKEN DATA AND TAGSETS 
The tagging of 10 million words of spoken data 
(including c.4.6 million words of conversation) 
presents particular challenges to the versatility 
of the system: renderings of spoken 
pronunciations such as 'avin'  (for having) cause 
difficulties, as do unplanned repetitions such as  
I er, mean, I mean, I mean to go.  Our solution 
to the latter problem has been to recognize 
repetitions by a special procedure, and to 
disregard the second and subsequent occurrences 
of the same word or phrase for the purposes of 
tagging.  It has become clear that the CLAWS4 
datastructures (lexicon, idiomlists, and tag 
transition matrix), developed for written 
English, need to be adapted if certain frequent 
and rather consistent errors in the tagging of 
spoken data are to be avoided (words such as I, 
well, and right are often wrongly tagged, because 
their distribution in conversation differs 
markedly from that in written texts).  We have 
moved in this direction by allowing CLAWS4 to 
"slot in" different datastructures according to 
the text type being processed, by e.g. providing 
a separate lexicon and idiomlist for the spoken 
material.  Eventually, probabilistic analysis of 
the tagged BNC will provide the necessary 
information for adapting datastructures at run 
time to the special demands of particular types 
of data, but there is much work to be done before 
this potential benefit of having tagged a large 
corpus is realised. 
 
The BNC tagging takes place within the context of 
a larger project, in which a major task 
(undertaken by OUCS at Oxford) is to encode the 
texts in a TEI-conformant mark-up (CDIF). Two 
tagsets have been employed: one, more detailed 
than the other, is used for tagging a 2-million-
word Core Corpus (an epitome of the whole BNC), 
which is being post-edited for maximum accuracy. 
Thus tagsets, like text formats and 
datastructures, are among the features which are 
task-definable in CLAWS4.  In general, the system 
has been revised to allow many adaptive decisions 
to be made at run time, and to render it suitable 
for non-specialist researchers to use. 
 
4.  ERROR RATES AND WHAT THEY MEAN 
Currently, judged in terms of major categories 
[Footnote 4], the system has an error-rate of approximately 
1.5%, and leaves c.3.3% ambiguities unresolved 
(as portmanteau tags) in the output. However, it 
is all too easy to quote error rates, without 
giving enough information to enable them to be 
properly assessed. We believe that any evaluation 
of the accuracy of automatic grammatical tagging 
should take account of a number of factors, some 
of which are extremely difficult to measure: 
 
- Consistency. It is necessary to measure 
tagging practice against some standard of  what 
is an appropriate tag for a given word in a 
given context.  For example, is horrifying in a 
horrifying adventure, or washing in  a washing 
machine an adjective, a noun, or a verb 
participle?  Only if this is specified 
independently, by an annotation scheme, can we 
feel confident in judging where the tagger is 
"correct" or "incorrect".  For the tagging of 
the LOB Corpus by the earliest version of 
CLAWS, the annotation scheme was published in 
some detail (Johansson et al 1986). We are 
working on a similar annotation scheme document 
(at present a growing in-house document) for 
the tagging of the BNC. 
 
 
- Size of Tagset. It might be supposed that 
tagging with a more fine-grained tagset which 
contains more tags is more likely to produce 
error than tagging with a smaller and cruder 
tagset.  In the BNC project, we have used a 
tagset of 58 tags (the C5 tagset) for the whole 
corpus, and in addition we have used a larger 
tagset of 138 tags (the C6 tagset) 
[Footnote 5] for the Core Corpus of 2 million words.
The evidence so far is that this makes little difference to 
the error rate. But size of tagset must, in the 
absence of more conclusive evidence, remain a 
factor to be considered. 
 
 
- Discriminative Value of Tags. The difficulty 
of grammatical tagging is directly related to 
the number of words for which a given tag 
distinction is made.  This measure may be 
called "discriminative value". For example, in 
the C5 tagset, one tag (VDI) is used for the 
infinitive of just one verb - to do - whereas 
another tag (VVI) is used for the infinitive of 
all lexical verbs.  On the other hand, VDB is 
used for finite base forms of to do (including 
the present tense, imperative, and 
subjunctive), whereas VVB is used of finite 
base forms of all lexical verbs.  It is clear 
the tags VDI and VDB have a low discriminative 
value, whereas VVI and VVB have a high one - 
since there are thousands of lexical verbs in 
English.  It is also clear that a tagset of the 
lowest possible discriminative value - one 
which assigned a single tag to each word and a 
single word to each tag - would be utterly 
valueless. 
 
 
- 	Linguistic Quality.  This is a very elusive, 
but crucial concept.  How far are the tags in a 
particular tagset valuable, by criteria either 
of linguistic theory/description, or of 
usefulness in NLP?  For example, the tag VDI, 
mentioned in c. above, appears trivial, but it 
can be argued that this is nevertheless a 
useful category for English grammar, where the 
verb do (unlike its equivalent in most other 
European languages) has a very special 
function, e.g. in forming questions and 
negatives. On the other hand, if we had decided 
to assign a special tag to the verb become, 
this would have been more questionable.  
Linguistic quality is, on the face of it, 
determined only in a judgemental manner.  
Arguably, in the long term, it can be 
determined only by the contribution a 
particular tag distinction makes to success in 
particular applications, such as speech 
recognition or machine-aided translation.  At 
present, this issue of linguistic quality is 
the Achilles' heel of grammatical tagging 
evaluation, and we must note that without 
judgement on linguistic quality, evaluation in 
terms of (2) and (3) is insecurely anchored. 
 
It seems reasonable, therefore, to lump criteria 
2-4 together as "quality criteria", and to say 
that evaluation of tagging accuracy must be 
undertaken in conjunction with (i) consistency 
[How far has the annotation scheme been 
consistently applied?], and (ii) quality of 
tagging [How good is the annotation scheme?] Footnote 6.  Error rates are useful
interim indications of success, but they have to
be corroborated by checking, if only impressionistically,
in terms of qualitative criteria.  Our work, since 1980, 
has been based on the assumption that qualitative 
criteria count, and that it is worth building 
"consensual" linguistic knowledge into the 
datastructures used by the tagger, to make sure 
that the tagger's decisions are fully informed by 
qualitative considerations.
 
References
Crowdy, S (forthcoming). The BNC Spoken Corpus.  In Leech, G.,
G. Myers and J. Thomas (eds) - Spoken English on Computer.
London: Longman.
Garside, R, G. Leech and G. Sampson (eds), (1987). The 
Computational Analysis of English: a Corpus-based 
Approach. London: Longman. 
Johansson, S., E. Atwell, R. Garside and G. Leech (1986). The 
Tagged LOB Corpus: User's Manual. Bergen: Norwegian 
Computing Centre for the Humanities. 
Marshall, I. (1983). Choice of grammatical word-class without 
global syntactic analysis: tagging words in the LOB 
Corpus, Computers and the Humanities, 17, 139-50. 
Voutilainen, A. , J. Heikkilä and A. Anttila (1992). 
Constraint Grammar of English: A Performance-Oriented 
Introduction. University of Helsinki: Department of 
General Linguistics. 
Explanatory Note
Let TT be any tag, and let ww be
any word.
	 
Let n, m be arbitrary integers. Then: 
 
- ww TT
- represents a word and its associated tag	 
- ,
- separates a word from its predecessor 
- [TT]
- represents an already assigned tag 
- [WIC]
- represents an unspecified word with a Word Initial 
Capital. 
- TT/TT
- means "either TT or TT" 
- ww TT TT
- represents an unresolved ambiguity between TT and TT 
- [ ]
- represents an unspecified word 
- TT*
- represents a tag with * marking the location of 
unspecified characters 
- ([TT])n
- represents the number of words (up to n) which may 
optionally intervene at a given point in the 	template 
- TTnm
- represents the "ditto tag" attached to an orthographic 
word to indicate it is part of a complex sequence (e.g. 
so that is tagged so CS021 , that CS022). The variable n 
indicates the number of orthographic words in the 
sequence, and m indicates that the current word is in the 
mth position in that sequence. 
Notes:
- The BNC is the result of a collaboration,
supported by the Science and Engineering Research Council (SERC
Grant No. GR/F99847) and the UK Department of Trade and Industry, 
between Oxford University Press (lead partner), Longman 
Group Ltd., Chambers Harrap, Oxford University Computer 
Services, the British Library and Lancaster University. We 
thank Elizabeth Eyes, Nick Smith, and Andrew Wilson for 
their help in the preparation of this paper. 
 
 
- CLAWS4 has been written by Roger Garside,
with CLAWS adjunct software programmed by Michael Bryant. 
 
 
- We have experimented with a two-step
Markov process model (using tag trigrams), and found little benefit
over the one-step model. 
 
 
- The error rate and ambiguity rate are less
favourable if we take account of errors and ambiguities
which take place within major categories.  E.g. the portmanteau
tag NP0-NN1 records confidently that a word ia a noun, but not
whether it is a proper or common noun.  If such cases are added to 
the count, then the estimated error rate rises to 1.78%, 
and the estimated ambiguity rate to 4.60%. 
 
 
- The tagset figures exclude punctuation tags
and portmanteau tags. 
 
 
- An example of a consistency issue is:
How is "Time" [the name of a magazine] tagged in the corpus?  Is
it tagged always NP0 (as a proper noun), or always NN1 (as a common 
noun), or  sometimes NP0 and sometimes NN1?   An example of a quality
issue is: Is it worth distinguishing between proper nouns and common 
nouns, anyway? 
 
UCREL Web pages contact: Paul Rayson, (email: P.Rayson@lancaster.ac.uk)
This page formatted by: Steve Fligelstone, (email:
S.Fligelstone@lancaster.ac.uk)
Copyright of the authors