LanguageTool

Development

This is a collection of the developer documentation available for LanguageTool. It's intended for people who want to understand LanguageTool so they can write their own rules or even add support for a new language. Software developers might also be interested in LanguageTool's API.

Help wanted!
We're looking for people who support us writing new rules so LanguageTool can detect more errors. The languages that LanguageTool already supports but for which support needs to be improved are: English, German, Polish, Spanish, French, Italian, Dutch, Czech, Lithuanian, Ukrainian, and Slovenian.

How can you help?

  1. Read this page
  2. If you want to write rules in Java or if you want to add support for another language, external link to check out LanguageTool from CVScheck out LanguageTool from CVS.
  3. Subscribe to the external link to mailing listmailing list
  4. Try writing rules. For English and German, see the lists of errors on the Links page. Many of those errors are not yet detected.
  5. See the wiki for more tips and tricks

Installation and usage
Please see the README file that comes with LanguageTool and the Usage page.

Language checking process

  1. The text to be checked is split into sentences
  2. Each sentence is split into words
  3. Each word is assigned its part-of-speech tag(s) (e.g. cars = plural noun, talked = simple past verb)
  4. The analyzed text is then matched against the built-in rules and against the rules loaded from the grammar.xml file

Adding new XML rules
Many rules are contained in rules/xx/grammar.xml, whereas xx is a language code like en or de. A rule is basically a pattern which shows an error message to the user if the pattern matches. A pattern can address words or part-of-speech tags. Here are some examples of patterns that can be used in that file:

  • <token bla="x">think</token>
    matches the word think
  • <token regexp="yes">think|say</token>
    matches the regular expression think|say, i.e. the word think or say
  • <token postag="VB" /> <token>house</token>
    matches a base form verb followed by the word house. See resource/en/tagset.txt for a list of possible part-of-speech tags.
  • <token>cause</token> <token regexp="yes" negate="yes">and|to</token>
    matches the word cause followed by any word that is not and or to
  • <token postag="SENT_START" /> <token>foobar</token>
    matches the word foobar only at the beginning of a sentence

A pattern's terms are matched case-insensitively by default, this can be changed by setting the case_sensitive attribute to yes.

Here's an example of a complete rule that marks "bed English", "bat attitude" etc as an error:

<rule id="BED_ENGLISH" name="Possible typo &apos;bed/bat(bad) English/...&apos;">
    <pattern mark_from="0" mark_to="-1">
      <token regexp="yes">bed|bat</token>
      <token regexp="yes">[Ee]nglish|attitude</token>
    </pattern>
    <message>Did you mean
      <suggestion>bad</suggestion>?
    </message>
    <example type="correct">
      Sorry for my <marker>bad</marker> English.
    </example>
    <example type="incorrect">
      Sorry for my <marker>bed</marker> English.
    </example>
</rule>

A short description of the elements and their attributes:

  • element rule, attribute id: an internal identifier used to address this rule
  • element rule, attribute name: the text displayed in the configuration
  • element pattern, attributes mark_from and mark_to: what part of the original text should be marked. The default, mark_from="0" and mark_to="0", means to mark the complete matching token. For example, if the pattern contains three token elements that match the input text, those three matching words will be marked in the text. mark_to="-1" in the example above means that the last token of the match will not be marked.
  • element token, attribute regexp: interpret the given token as a regular expression
  • element message: The text displayed to the user if this rule matches. Use sub-element suggestion to suggest a possible replacement that corrects the error.
  • element example: At least two examples that with one correct and one incorrect sentence. The incorrect sentence is supposed to be matched by this rule. The position of the error must be marked up with the sub-element marker. This is used by the automatic test cases that can be run using ant test.

There are more features not used in the example above:

  • element token, attribute skip is used in two situations:

    1. Simulate a simple chunker for languages with flexible word order, e.g., for matching errors of rection; we could for example skip possible adverbs in some rule. skip="1" works exactly as two rules, i.e.

    <token skip="1">A</token>
    <token>B</token>

    is equivalent to the pair of rules:

    <token>A</token>
    <token/>
    <token>B</token>

    <token>A</token>
    <token>B</token>

    Using negative value, we can match until the B is found, no matter how many tokens are skipped. This cannot be easily encoded using empty tokens as above because the sentence could be of any length.


    2. Match coordinated words, for example to match "both... as well" we could write:

    <token skip="-1">both<exception scope="next">and</exception></token>
    <token>as</token>
    <token>well</token>

    Here the exception is applied only to the skipped tokens.

    The scope attribute of the exception is used to make exception valid only for the token the exception is specified (scope="current") or for skipped tokens (scope="next"). Default behavior is scope="current". Using scopes is useful where several different exceptions should be applied to avoid false alarms. In some cases, it's usefule to use scope="previous" in rules that already have skip="-1". This way, you can set an exception against a single token that immediately preceeds the matched token. For example, we want to match "tak" after "jak" which is not preceeded by a comma:

    <token>tak</token>
    <token skip="-1">jak</token>
    <token>tak<exception scope="previous">,</exception></token>

    In this case, the rule excludes all sentences, where there is a comma before "tak". Note that it's very hard to make such an exclusion otherwise.

    3. Using variables in rules

    In XML rules, you can refer to previously matched tokens in the pattern. For example:

    <pattern mark_from="2">
     <token regexp="yes" skip="-1">ani|ni|i|lub|albo|czy|oraz<exception scope="next">,</exception></token>
     <token><match no="0"/></token>
    </pattern>

    This rule matches sequences like ani... ani, ni... ni, i... i but you don't have to write all these cases explicitly. The first match (matches are numbered from zero, so it's <match no="0"/>) is automatically inserted into the second token. Note that this rule will match sentences like: Nie kupiłem ani gruszek ani jabłek. Kupię to lub to lub tamto.

    A similar mechanism could be used in suggestions, however there are more features, and tokens are numbered from 1 (for compatibility with the older notation \1 for the first matched token). For example:

    <suggestion><match no="1"/></suggestion>

    A more complicated example:

    <pattern>
    <token regexp="yes">^(\p{Lu}{2}+[i]*\p{Lu}+[\p{L}&amp;
    &amp;[^\p{Lu}]]{1,4}+)</token>
    </pattern>
    <message>Prawdopodobny błąd zapisu odmiany;
      skrótowce odmieniamy z dywizem:
      <suggestion><match no="1" regexp_match="^(\p{Lu}{2}+[i]*\p{Lu}+)([\p{L}&amp;
    &amp;[^\p{Lu}]]{1,4}+)"
    regexp_replace="$1-$2"/>
    </suggestion></message>

    This rule matches Polish inflected acronyms such as "SMSem" that should be written with a hyphen: "SMS-em". So the acronym is matched with a complicated regular expression, and the match replaces the match using Java regular expression notation. Basically, the regular expression only shows two parts and inserts a hyphen between them.

    For some languages (currently Polish and English), element <match/> can be used to insert an inflected matched token (or another word with a specified part of speech tag). For example:

    <pattern mark_from="1" mark_to="-1">
     <token regexp="yes">has|have</token>
     <token postag="VBD|VBP|VB" postag_regexp="yes"><exception postag="VBN|NN:U.*|JJ.*|RB" postag_regexp="yes"/></token>
     <token><exception postag="VBG"/></token>
    </pattern>
    <message>Possible agreement error -- use past participle here: <suggestion><match no="2" postag="VBN"/></suggestion>.</message>

    The above rule takes the second verb with a POS tag "VBN", "VBP" or "VB" and displays its form with a POS tag "VBN" in the suggestion. You can also specify POS tags using regular expressions (postag_regexp="yes") and replace POS tags – just like in the above example with acronyms. This is useful for large and complicated tagsets (for many examples, see Polish rule file: rules/pl/grammar.xml).

    Sometimes the rule should change the case of the matched word. For this purpose, you can use case_conversion attribute values: startlower, startupper, allupper and alllower.

    Another useful thing is that <match> can refer to a token, but apply its POS to another word. This is useful for suggesting another word with the same part of speech. There is a special abbreviated syntax used for this purpose:

    <match no="1" postag="verb:.*perf">kierować</match>

    This syntax means: take the POS tag of the first matched token that matches the regular expression specified in the postag attribute, and then apply this POS tag to the verb "kierować". This way the verb will be inflected just the way the matched verb was originally inflected. The reason why you need to specify the POS tag is that the matched token can have several POS tags (several readings).

    Note that by default <match> element inside the <token> element inserts only a string – so it matches a string, and not part of speech tags. So even if it refers to a token with a POS tag, it copies the matched token, and not its POS token. However, you can use all above attributes to change the form of the token.

    You can however use the <match> element to copy POS tags alone but to do so, you must use the attribute setpos="yes". All other attributes can be applied so that the POS could be converted appropriately. This can be useful for creating rules specifying grammatical agreement. Currently, such rules must be quite wordy, somewhat more terse syntax is in development.

    4. Turning the rule off

    Some rules can be optional, useful only in specific registers, or very sensitive. You can turn them off by default by using an attribute default="off". The user can turn the rule in the Options dialog box, and this setting is being saved in the configuration file.

Adding new Java rules
Rules that cannot be expressed with a simple pattern in grammar.xml can be developed as a Java class. See rules/WordRepeatRule.java for a simple example which you can use to develop your own rules. You will also need to add your rule to JLanguageTool.java to activate it.

Translating the user interface
To translate the user interface, just copy MessagesBundle_en.properties to MessagesBundle_xx.properties (whereas xx is the code of your language) and translate the text. Note that hot keys for menu items are specified with the & character (for example, &File). The next time you start LanguageTool, it should show your translation (assuming your computer is configured to use your language -- if that's not the case, start LanguageTool with java -Duser.language=xx -jar LanguageToolGUI.jar).

Adding support for a new language
Adding a new language requires some programming. You should check out the "JLanguageTool" module from CVS (see the sourceforge help). As not all files are in CVS because of their size, you also need files from the LanguageTool ZIP file:

  1. Unzip standalone-libs.zip and then copy all *.jar files to the subdirectory libs in your checkout directory.
  2. Create a directory libs/build and put external link to junit.jarjunit.jar in there.
  3. Create a directory libs/ooo and copy these files from your OpenOffice.org installation to that directory (they are in program/classes): juh.jar, jurt.jar, ridl.jar, and unoil.jar
  4. Call ant and copy the other missing files from the ZIP, if the compiler complains.

Language.java contains the information about supported languages. You can add a new language by creating a new Language object in this class and providing a part-of-speech tagger for it, similar to de/danielnaber/languagetool/tagging/en/EnglishTagger.java. The tagger must implement the Tagger interface, any implementation details (i.e. how to actually assign tags to words) are up to you -- the easiest thing is probably to just copy the English tagger.

A trivial tagger that only assigns null tags to words is DemoTagger. This is enough for rules that refer to words but not to part-of-speech tags. You can add those rules to a file rules/xy/grammar.xml, whereas xy is the short name for your language. You will also need to add the short name of your language to rules.dtd.

The test cases run by "ant test" will automatically include your new language and its rules, based on the "example" elements of each rule.

To add part-of-speech tags, please have a look at resource/en/make-dict-en.sh (note: this file is only in CVS, not in the released ZIP). First try to make it work for English. You need the external link to fsafsa package. Install it and add its installation directory to your PATH. Once it works for English, create your own version of manually_added.txt and use that to create a .dict file, then adapt your tagger to use it (e.g. copy EnglishTagger.java and change the RESOURCE_FILENAME constant).

Remember that you will also need to adapt build.xml. Just search for "/en/" in that file and copy those lines, adapting them to your language.

Background
For background information, my diploma thesis about LanguageTool is available (note that this refers to an earlier version of LanguageTool which was written in Python):
PDF, 650 KB
Postscript (.ps.gz), 630 KB

Last modified: 2008-06-08