LanguageTool |
Development |
|
|
This page has everything you need to know to teach LanguageTool new error detection rules, plus more. You don't even have to be a programmer for that.
The three-minute introductionThis section tells you in a nutshell how to write your own LanguageTool rules for detecting errors:
That's it! You have just added a new rule. Keep on reading to get a grasp on what the elements of a rule mean
and how to build more complex rules or Help wanted!We're looking for people who support us writing new rules so LanguageTool can detect more errors. Also seeHow can you help?
If your language isn't supported yet, you can add it by following the documentation in our wiki. Source code checkout (Java developers only)If you are a Java developer and you want to extend LanguageTool or if you want to use the latest development version, check out LanguageTool with Subversion:
svn checkout http://svn.code.sf.net/p/languagetool/code/trunk/languagetool languagetool
Alternatively, you can get the code from github, where it is mirrored (Sorry, the mirror is currently not up-to-date - May 2013):
git clone https://github.com/danielnaber/languagetool-mirror.git
You can then build the code with mvn clean package or just run the tests with mvn clean test. Maven's default memory settings are often too low, so you will probably need to set your environment variable MAVEN_OPTS to:
-Xmx512m -XX:MaxPermSize=256m
After the build, the LibreOffice/OpenOffice extension can be found in languagetool-office-extension/target,
the stand-alone version in languagetool-standalone/target.
Please also see the
Language checking processThis is what LanguageTool does when it analyzes a text for errors:
The most important thing you need to keep in mind that LanguageTool's rules describe what errors look like, not what correct sentences look like (this is the opposite of how you learn a new language). Adding new XML rulesMost rules are contained in rules/xx/grammar.xml, whereas xx is a language code like en or de. A rule is basically a pattern which shows an error message to the user if the pattern matches. A pattern can address words or part-of-speech tags. Here are some examples of patterns that can be used in that file:
A pattern's tokens are matched case-insensitively by default. This can be changed for the whole pattern by setting the pattern's case_sensitive attribute to yes. Alternatively, case-sensitive matching can be turned on for single tokens by using (?-i) in regular expressions (ex: <token regexp="yes">(?-i)Bill</token> will match "Bill" but not "bill").A simple exampleHere's an example of a complete rule that marks "bed English", "bat attitude" etc as an error: <rule id="BED_ENGLISH" name="Possible typo 'bed/bat(bad) English/...'"> <pattern> <marker> <token regexp="yes">bed|bat</token> </marker> <token regexp="yes">English|attitude</token> </pattern> <message>Did you mean <suggestion>bad</suggestion>?</message> <url>http://some-server.org/the-bed-bad-error</url> <example type="correct">Sorry for my <marker>bad</marker> English.</example> <example correction="bad" type="incorrect">Sorry for my <marker>bed</marker> English.</example> </rule> The basic elements of a ruleA short description of the elements and their attributes:
Testing rulesThe LanguageTool user interface (languagetool-standalone.jar) needs to be restarted if you have changed the grammar.xml file. Testing rules is faster with our embedded test case feature: just call sh testrules.sh en on Linux or testrules.bat en on Windows, using your language code instead of en. This will test your rule with its example sentences: the incorrect sentence is supposed to be detected by your rule, while the correct sentence is not supposed to give an error. If that is not the case you will get a message. In that case, either your rule or your example sentences are not quite right yet. Using testrules.sh/bat is not only much faster than manually starting the user interface over and over again, it will always test all rules, so we recommend you use that during rule development. InflectionThe inflected attribute of the token element is used to match not only the given form but also all of its inflected forms. For example <token inflected="yes">bicycle</token> will match bicycle, bicycles, bicycling etc. Grouping rulesSometimes it requires more than one rule to find all occurrences of an error. You can put all those rules in one rulegroup element. The rulegroup's id and name attribute will be use for all the rules of that group. Starting with LanguageTool 1.8, overlapping matches for rules in the same rulegroup are filtered out to avoid duplicate matches for the same error. CategoriesThe rules are best put into categories that describe their purpose, and allow to enable or disable a number of rules at the same time. When creating a category, you can use the type attribute to describe the type of the error according to the Quality Issue Type from the W3 Internationalization Tag Set. This will make integration of LT with other tools easier. Turning rules off by defaultSome rules can be optional, useful only in specific registers, or very sensitive. You can turn them off by default by using an attribute default="off". The user can turn the rule on/off in the Options dialog box, and this setting is being saved in the configuration file. SkipThe skip attribute of the token element is used in two situations:
VariablesIn XML rules, you can refer to previously matched tokens in the pattern. For example: <pattern>
<token regexp="yes" skip="-1">ani|ni|i|lub|albo|czy|oraz<exception scope="next">,</exception></token> <token><match no="0"/></token> </pattern> This rule matches sequences like ani... ani, ni... ni, i... i but you don't have to write all these cases explicitly. The first match (matches are numbered from zero, so it's <match no="0"/>) is automatically inserted into the second token. Note that this rule will match sentences like: Nie kupiłem ani gruszek ani jabłek. Kupię to lub to lub tamto.A similar mechanism could be used in suggestions, however there are more features, and tokens are numbered from 1 (for compatibility with the older notation \1 for the first matched token). For example: <suggestion><match no="1"/></suggestion>
A more complicated example: <pattern>
<token regexp="yes">^(\p{Lu}{2}+[i]*\p{Lu}+[\p{L}& &[^\p{Lu}]]{1,4}+)</token> </pattern> <message>Prawdopodobny błąd zapisu odmiany; skrótowce odmieniamy z dywizem: <suggestion><match no="1" regexp_match="^(\p{Lu}{2}+[i]*\p{Lu}+)([\p{L}& &[^\p{Lu}]]{1,4}+)" regexp_replace="$1-$2"/></suggestion></message> This rule matches Polish inflected acronyms such as "SMSem" that should be written with a hyphen: "SMS-em". So the acronym is matched with a complicated regular expression, and the match replaces the match using Java regular expression notation. Basically, the regular expression only shows two parts and inserts a hyphen between them. For some languages (currently Polish, English, Catalan, Spanish, Galician, Dutch, Romanian, Slovak and Russian), element <match/> can be used to insert an inflected matched token (or another word with a specified part of speech tag). For example: <pattern>
<token regexp="yes">has|have</token> <marker> <token postag="VBD|VBP|VB" postag_regexp="yes"> <exception postag="VBN|NN:U.*|JJ.*|RB" postag_regexp="yes"/> </token> </marker> <token><exception postag="VBG"/></token> </pattern> <message> Possible agreement error -- use past participle here: <suggestion><match no="2" postag="VBN"/></suggestion>. </message> The above rule takes the second verb with a POS tag "VBN", "VBP" or "VB" and displays its form with a POS tag "VBN" in the suggestion. You can also specify POS tags using regular expressions (postag_regexp="yes") and replace POS tags – just like in the above example with acronyms. This is useful for large and complicated tagsets (for many examples, see Polish rule file: rules/pl/grammar.xml). Sometimes the rule should change the case of the matched word. For this purpose, you can use case_conversion attribute values: startlower, startupper, allupper and alllower. Another useful thing is that <match> can refer to a token, but apply its POS to another word. This is useful for suggesting another word with the same part of speech. There is a special abbreviated syntax used for this purpose: <match no="1" postag="verb:.*perf">kierować</match>
This syntax means: take the POS tag of the first matched token that matches the regular expression specified in the postag attribute, and then apply this POS tag to the verb "kierować". This way the verb will be inflected just the way the matched verb was originally inflected. The reason why you need to specify the POS tag is that the matched token can have several POS tags (several readings). Note that by default <match> element inside the <token> element inserts only a string – so it matches a string, and not part of speech tags. So even if it refers to a token with a POS tag, it copies the matched token, and not its POS token. However, you can use all above attributes to change the form of the token. You can however use the <match> element to copy POS tags alone but to do so, you must use the attribute setpos="yes". All other attributes can be applied so that the POS could be converted appropriately. This can be useful for creating rules specifying grammatical agreement. Currently, such rules must be quite wordy, somewhat more terse syntax is in development. Adding new Java rulesRules that cannot be expressed with a simple pattern in grammar.xml can be developed as a Java class. As a developer, extend LanguageTool's Rule class and implement the match(AnalyzedSentence text) method. See rules/WordRepeatRule.java for a simple example which you can use to develop your own rules. You will also need to add your rule's class to the getRelevantRules() method in <YourLanguage>.java to activate it. Translating the user interfaceWe use Transifex to translate our property files. Updated translations are only copied to the LanguageTool source before a release, so if you need an early preview, say so on the LanguageTool mailing list and we'll update the files accordingly. Background informationFor some background information, Daniel Naber's diploma thesis about the original version of LanguageTool is available - please note that this refers to an earlier version of LanguageTool which was written in Python: Page last modified: 2013-05-18 |
Time to generate page: 0.04s