org.foray.hyphen.util
Class WordList

java.lang.Object
  extended by org.foray.hyphen.util.WordList

public class WordList
extends Object

Parses an input file, returning a sorted list of the words in that file. Eliminates non-word characters and words containing numbers. This is useful for taking existing documents and building word lists from them that can be used as the starting point for manual entry of hyphenation points that can in turn be used as input to the pattern generation logic.

Much of the work in this class can be done with sed and awk and sort and uniq. However, I do not know how to get these to support 21-bit or even 16-bit characters.


Constructor Summary
WordList(InputStream input, String inputEncoding)
          Constructor.
 
Method Summary
static boolean containsNonWord(String input)
          Indicates whether a given word contains any non-word characters.
static void main(String[] args)
          Command-line interface to the word list processing.
 String[] parse()
          Parses the input, returning an array of unique, sorted words from the input.
static String removeNonWordChars(String token)
          Removes the non-word characters from one word.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WordList

public WordList(InputStream input,
                String inputEncoding)
Constructor.

Parameters:
input - The input for the word list.
inputEncoding - The name of the encoding to be used for decoding the input.
Method Detail

parse

public String[] parse()
               throws IOException
Parses the input, returning an array of unique, sorted words from the input. Words containing numerals are discarded. Non-word characters are also discarded, so that words in quotation marks (for example) will not have the quotation marks included.

Returns:
The sorted unique array of words from the input.
Throws:
IOException - For errors reading the input stream.

removeNonWordChars

public static String removeNonWordChars(String token)
Removes the non-word characters from one word.

Parameters:
token - The token (word) from which the non-word characters should be removed.
Returns:
The cleaned-up word.

containsNonWord

public static boolean containsNonWord(String input)
Indicates whether a given word contains any non-word characters.

Parameters:
input - The input "word".
Returns:
True iff the word contains one or more non-word characters.

main

public static void main(String[] args)
Command-line interface to the word list processing.

Parameters:
args - The command-line arguments. There should be exactly four: 1) the input file, 2) the name of the input encoding, for example "US-ASCII" or "UTF-8", 3) the output file, and 4) the name of the output encoding.


Copyright © 2017. All rights reserved.