|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.foray.hyphen.util.WordList
public class WordList
Parses an input file, returning a sorted list of the words in that file. Eliminates non-word characters and words containing numbers. This is useful for taking existing documents and building word lists from them that can be used as the starting point for manual entry of hyphenation points that can in turn be used as input to the pattern generation logic.
Much of the work in this class can be done with sed and awk and sort and uniq. However, I do not know how to get these to support 21-bit or even 16-bit characters.
Constructor Summary | |
---|---|
WordList(InputStream input,
String inputEncoding)
Constructor. |
Method Summary | |
---|---|
static boolean |
containsNonWord(String input)
Indicates whether a given word contains any non-word characters. |
static void |
main(String[] args)
Command-line interface to the word list processing. |
String[] |
parse()
Parses the input, returning an array of unique, sorted words from the input. |
static String |
removeNonWordChars(String token)
Removes the non-word characters from one word. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public WordList(InputStream input, String inputEncoding)
input
- The input for the word list.inputEncoding
- The name of the encoding to be used for decoding the
input.Method Detail |
---|
public String[] parse() throws IOException
IOException
- For errors reading the input stream.public static String removeNonWordChars(String token)
token
- The token (word) from which the non-word characters should
be removed.
public static boolean containsNonWord(String input)
input
- The input "word".
public static void main(String[] args)
args
- The command-line arguments. There should be exactly four:
1) the input file, 2) the name of the input encoding, for example
"US-ASCII" or "UTF-8", 3) the output file, and 4) the name of the output
encoding.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |