Here, we provide guidelines which should be adhered to in order to obtain the best quality (and least error-prone) results from the system.
The input to the system is a plain text (ASCII) file. It consists of one or more lines, separated Unix-fashion by 'newline' characters and terminated by the standard 'EOF' (end of file) marker. Each line consists of the text in normal orthography, with each 'word' separated from the next by one or more spaces; a 'word' consists of a word proper, perhaps preceded and/or followed by punctuation without intervening spaces.
If the text originates on a PC or Macintosh, then the correct format is best obtained by using the 'Save as' option and specifying the 'Text Only w/line breaks' which is available in most word processing packages.
CLAWS determines sentence breaks according to normal orthography. Therefore a sentence-terminating full stop (or other character) should be followed by at least one space, and the first word in a sentence should be capitalised.
The restrictions which apply to the character set of a text are as follows:
SGML entities should be used to encode extended character set symbols and accented letters. Lists of these can be found on the web in the HTML 4 specification: http://www.w3.org/TR/REC-html40/sgml/entities.html#iso-88591
The standard entities are:
SGML tags may be included in the text to mark-up various features or act as a header giving information about the text. The tag names should be declared in a configuration file (see section on running CLAWS).
By default, a start and end marker for the text encoded as an SGML tag should be included in every file to be run through CLAWS. For example,
<text> The quick brown fox jumps over the lazy dog. </text>
CLAWS outputs a verticalised form of the text where each word has a list of possible POS tags. The most likely tag is the first in the list.
0000001 001 **6;0;START 01 NULL 0000001 002 ---------------------------------------------------- 0000003 010 The 93 AT 0000003 020 quick 93 [JJ/99] RR@/1 NN1%/0 0000003 030 brown 93 [JJ/93] NN1@/7 VV0%/0 0000003 040 fox 93 [NN1/100] VV0@/0 0000003 050 jumps 93 [VVZ/97] NN2@/3 0000003 060 over 93 [II/59] RP/41 NN1%/0 JJ%/0 0000003 070 the 93 AT 0000003 080 lazy 93 JJ 0000003 090 dog 93 [NN1/100] VV0%/0 0000003 091 . 03 . 0000004 001 **7;7;text 01 NULL
The reference number at the start of each line shows which line of the input file a word comes from. Sentence breaks are identified by lines of hyphens. The two digit number to the left of the POS tags is a decision code produced by CLAWS to aid manual postediting. Each POS tag on an ambiguous word is followed by a slash and a likelihood value, expressed as a percentage.
The first line of this example (**6;0;START) contains a reference to a supplementary (supp) file produced by CLAWS. The supp file contains words in the input text which are longer than 25 characters and SGML tags which contain a space. The start text symbol is always copied to the supp file along with any text in the file proceeding it. The two numbers in the supp file reference give the number of characters transferred (six in this case) and the starting point in the supp file where this reference points to.
|