org.foray.hyphen.util
Class NaturalLanguage

java.lang.Object
  extended by org.foray.hyphen.util.NaturalLanguage

public final class NaturalLanguage
extends Object

Manages various aspects of a natural language, specifically what grapheme clusters are valid in that language. NOTE: There may be a better way to do this, but I have not found it yet. Java has the "Locale" class, which gives access to certain resources. However, this seems to be JVM-specific, and does not allow for extension by addition of new locales. Also the ICU4J libraries from IBM (parts of which are included in Java 5, parts in Java 6) provide some similar capabilities, but do not seem to be documented well enough for us to use. It seems like writing this class will be easier than trying to figure out any of the other.


Constructor Summary
NaturalLanguage()
          Private Constructor.
 
Method Summary
 void addCluster(int[] codepoints)
          Add a new Grapheme Cluster to this language.
 void addRange(int start, int end)
          Add a range of Unicode codepoints to this language.
 boolean isIncluded(int codepoint)
          Indicates whether a specific Unicode codepoint is valid as a grapheme in this language.
 boolean isIncluded(int[] codepoints, int start, int end)
          Indicates whether a given sequence of characters is a valid grapheme cluster in this language.
 int validateText(CharSequence theChars)
          Validates the content of a sequence of chars to determine whether they are valid in this language.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

NaturalLanguage

public NaturalLanguage()
Private Constructor.

Method Detail

addRange

public void addRange(int start,
                     int end)
Add a range of Unicode codepoints to this language.

Parameters:
start - The first codepoint in the range to be added.
end - The last codepoint in the range to be added.

addCluster

public void addCluster(int[] codepoints)
Add a new Grapheme Cluster to this language.

Parameters:
codepoints - The sequence of Unicode codepoint that define the Grapheme Cluster.

isIncluded

public boolean isIncluded(int codepoint)
Indicates whether a specific Unicode codepoint is valid as a grapheme in this language.

Parameters:
codepoint - The Unicode codepoint to be tested.
Returns:
True iff codepoint is valid in this language.

isIncluded

public boolean isIncluded(int[] codepoints,
                          int start,
                          int end)
Indicates whether a given sequence of characters is a valid grapheme cluster in this language.

Parameters:
codepoints - The sequence of codepoints to be tested. This sequence must be already normalized to the canonical decomposed sequence and order.
start - The index to the first character that is being tested.
end - The index to the last character that is being tested.
Returns:
True iff the sequence of characters matches a valid Grapheme Cluster in this language.

validateText

public int validateText(CharSequence theChars)
Validates the content of a sequence of chars to determine whether they are valid in this language. By "valid" is meant that the grapheme clusters contained in the text are valid grapheme clusters in this language.

Parameters:
theChars - The String or other CharSequence that contains the text to be validated. This text does not need to already be normalized.
Returns:
The index to the first codepoint of the first grapheme cluster in theChars that is not valid in this language, or -1 if all clusters are valid.


Copyright © 2017. All rights reserved.