element token, attribute skip is used
in two situations:
1. Simulate a simple chunker for languages with flexible word order,
e.g., for matching errors of rection; we could for example skip possible
adverbs in some rule. skip="1" works exactly as two rules, i.e.
<token skip="1">A</token>
<token>B</token>
is equivalent to the pair of rules:
<token>A</token>
<token/>
<token>B</token>
<token>A</token>
<token>B</token>
Using negative value, we can match until the B is found, no matter how
many tokens are skipped. This cannot be easily encoded using empty
tokens as above because the sentence could be of any length.
2. Match coordinated words, for example to match
"both... as well" we could write:
<token skip="-1">both<exception scope="next">and</exception></token>
<token>as</token>
<token>well</token>
Here the exception is applied only to the skipped tokens.
The scope attribute of the exception is used to make exception valid
only for the token the exception is specified (scope="current") or for
skipped tokens (scope="next"). Default behavior is scope="current".
Using scopes is useful where several different exceptions should be
applied to avoid false alarms. In some cases, it's usefule to use
scope="previous" in rules that already have skip="-1".
This way, you can set an exception against a single token that immediately
preceeds the matched token. For example, we want to match "tak" after "jak"
which is not preceeded by a comma:
<token>tak</token>
<token skip="-1">jak</token>
<token>tak<exception scope="previous">,</exception></token>
In this case, the rule excludes all sentences, where there is a comma
before "tak". Note that it's very hard to make such an exclusion otherwise.
3. Using variables in rules
In XML rules, you can refer to previously matched tokens in the pattern. For example:
<pattern mark_from="2">
<token regexp="yes" skip="-1">ani|ni|i|lub|albo|czy|oraz<exception scope="next">,</exception></token>
<token><match no="0"/></token>
</pattern>
This rule matches sequences like ani... ani, ni... ni, i... i but you don't have to
write all these cases explicitly. The first match (matches are numbered from zero, so it's
<match no="0"/>) is automatically inserted into the second token. Note
that this rule will match sentences like:
Nie kupiłem ani gruszek ani jabłek. Kupię to lub to lub tamto.
A similar mechanism could be used in suggestions, however there are more features, and tokens are
numbered from 1 (for compatibility with the older notation \1 for the first matched token). For example:
<suggestion><match no="1"/></suggestion>
A more complicated example:
<pattern>
<token regexp="yes">^(\p{Lu}{2}+[i]*\p{Lu}+[\p{L}&
&[^\p{Lu}]]{1,4}+)</token>
</pattern>
<message>Prawdopodobny błąd zapisu odmiany;
skrótowce odmieniamy z dywizem:
<suggestion><match no="1" regexp_match="^(\p{Lu}{2}+[i]*\p{Lu}+)([\p{L}&
&[^\p{Lu}]]{1,4}+)" regexp_replace="$1-$2"/></suggestion></message>
This rule matches Polish inflected acronyms such as "SMSem" that should be written with
a hyphen: "SMS-em". So the acronym is matched with a complicated regular expression, and the
match replaces the match using Java regular expression notation. Basically, the regular expression
only shows two parts and inserts a hyphen between them.
For some languages (currently Polish and English), element <match/> can be used to
insert an inflected matched token (or another word with a specified part of speech
tag). For example:
<pattern mark_from="1" mark_to="-1">
<token regexp="yes">has|have</token>
<token postag="VBD|VBP|VB" postag_regexp="yes"><exception postag="VBN|NN:U.*|JJ.*|RB" postag_regexp="yes"/></token>
<token><exception postag="VBG"/></token>
</pattern>
<message>Possible agreement error -- use past participle here: <suggestion><match no="2" postag="VBN"/></suggestion>.</message>
The above rule takes the second verb with a POS tag "VBN", "VBP" or "VB" and displays its
form with a POS tag "VBN" in the suggestion. You can also specify POS tags using
regular expressions (postag_regexp="yes") and replace POS tags – just like
in the above example with acronyms. This is useful for large and complicated
tagsets (for many examples, see Polish rule file: rules/pl/grammar.xml).
Sometimes the rule should change the case of the matched word. For this purpose,
you can use case_conversion attribute values: startlower, startupper,
allupper and alllower.
Another useful thing is that <match> can refer to a token, but apply its POS
to another word. This is useful for suggesting another word with the same part
of speech. There is a special abbreviated syntax used for this purpose:
<match no="1" postag="verb:.*perf">kierować</match>
This syntax means: take the POS tag of the first matched token that matches the regular expression specified
in the postag attribute, and then apply this POS tag to the verb "kierować". This way the verb
will be inflected just the way the matched verb was originally inflected. The reason why you
need to specify the POS tag is that the matched token can have several POS tags (several readings).
Note that by default <match> element inside the <token> element inserts only a string –
so it matches a string, and not part of speech tags. So even if it refers to
a token with a POS tag, it copies the matched token, and not its POS token. However,
you can use all above attributes to change the form of the token.
You can however use the <match> element to copy POS tags alone but to do so,
you must use the attribute setpos="yes". All other attributes can be applied so that
the POS could be converted appropriately. This can be useful for creating rules specifying grammatical
agreement. Currently, such rules must be quite wordy, somewhat more terse syntax is in
development.
4. Turning the rule off
Some rules can be optional, useful only in specific registers,
or very sensitive. You can turn them off by default by using an
attribute default="off". The user can turn the rule in the
Options dialog box, and this setting is being saved in the configuration
file.