FOray

FOray Users
Module Users
Developers
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

FOrayFont: Font Encoding

Contents

Introduction

This document is oriented toward Developers.

The purpose of this document is to outline the basic concepts of how Unicode characters in a client document are tied to glyph outlines in a font and to "characters" in an output document. It is currently a work-in-progress and is incomplete.

Background Concepts

It is important to distinguish between the concepts of character and glyph, and to remember that there is not necessarily a one-to-one relationship between them.

There are at least three different ways that a character/glyph is identified, and these three ways must be tied together:

  • The Unicode codepoint. This is also referred to as a character code.
  • The "name" of the glyph (for Type1 fonts only).
  • The index into the array of glyphs. For PDF, this is the value that is written to the PDF file. Throughout FOray documentation and source code, this value is referred to as the glyph index. In FOray's processing, it is also the key to obtaining glyph metric information.

Single-Byte Type 1 Fonts

A Type 1 font stores glyph outlines in a PostScript dictionary. A PostScript dictionary is a map concept, each member containing a key and a value. In a PostScript font dictionary, the key is a glyph name and the value is the data structure containing the actual instructions to draw the glyph. Standard glyph names of Latin alphabetic characters are their values, like "a" (U+0061) or "T" (U+0054). Standard names of other glyphs are longer, like "exclam" (an exclamation point, U+0021). Note that in the pre-Unicode days when PostScript was developed, using multi-character "names" was an effective way to uniquely identify and store more than 256 glyphs in a single-byte font.

It would be awkward and wasteful to embed the names of each glyph in an output document. Instead, a one-byte array index is embedded. The array itself is called an encoding vector, and it contains a string element for each name in the encoding. So, for example, the AdobeStandardEncoding encoding vector contains at index 0x21 (decimal 33) the name "exclam".

An encoding vector is limited to 256 characters (the standard encodings use less because they skip the ASCII control codes), but the font itself has no such limit. Different encoding vectors can be used to access different combinations of glyphs in the font.

So, if the encoding vector is known, it is easy to map a glyph index to a glyph name, or to map a glyph name to a glyph index. But we still need a way to map a Unicode code point to one or the other. Also, font metrics, at least in the AFM format, are tied directly to glyph names. This makes sense, because the metrics are the same for the glyphs regardless of what encoding is used.

TrueType Fonts

TrueType fonts (TTF) use a CMAP table to map character codes to glyph indices. The CMAP table can have subtables that support different encodings. Right now, FOray only supports TTF fonts that have a Unicode CMAP subtable. Support for other encodings seems possible by creating an intermediate map (or computation) that would map, for example, from ShiftJIS to Unicode.

Overall Design

Regardless of which format is used, we must parse either a metric file or a font file and make the information in that file available to the client application.

Optimize for Unicode input

Font formats are, in general, organized for view- or print-time efficiency. That is, they are optimized to efficiently map from a font character index presented to them to retrieving the instructions necessary to draw the glyph at that index. FOray, on the other hand, has a different set of tasks, generally going in the opposite direction from the font's internal optimizations. Instead of being presented with glyph indexes, it is presented with Unicode code points, and attempts to optimize for that case. So, the data structures used in FOray for storing and retrieving font data may seem "scrambled" when compared with the structures in the native font or metrics files. Just remember that the two things are optimized for different tasks.

Optimize for Processing Epoch

Font-related tasks in FOray can be broken down into three general epochs:

  • Pre-processing. These are tasks, such as parsing metrics data, that are necessary to make the font usable for processing. These tend to be one-time events.
  • Processing. These are tasks, usually very repetitive, that occur as the document is processed. These include finding glyph widths and kerning information.
  • Post-processing. These are usually tasks related to writing font information into the output document. These tend to be one-time events.

Because of the repetitive nature of the Processing epoch, we wish to optimize for efficiency in that phase. In other words, we are willing to spend extra time in Pre-processing and Post-processing to arrange our data structures so that we can spend less time during Processing.

Optimize for Type of Data

The following is a list of tasks that are currently handled during the Processing epoch:

  • Encode a character.
  • Obtain the width of a character.
  • Kern a pair of characters.

FOray uses an Encoding class to manage the encoding and decoding of Unicode code points.

Regardless of the underlying font format, the glyph width information for the font has a one-to-one relationship with that font's character set. That is, each glyph in the font has exactly one glyph width associated with it. Therefore FOray stores width information for all fonts in an array parallel to the character set.

Kerning information on the other hand is completely variable in its relationship with the glyphs in the font. Some glyphs are not kerned at all, others are part of multiple kerning pairs. Therefore, it is more efficient to store kerning information by Unicode code point instead of glyph index (as is done with widths). The structure for kerning information is therefore three parallel arrays: The first contains the first Unicode code point in the pair, the second contains the second character in the pair, and the third contains the kerning value. After these arrays are created, they are sorted, primarily on the contents of the first array, secondarily on the contents of the second array. When kerning information is requested, the first two arrays are searched using a binary search. This logic is entirely encapsulated in the Kerning class.

Encoding

FOray uses the class org.foray.ps.encode.EncodingVector to encapsulate the encoding vectors in the style used by Type 1 font, and the class org.foray.ps.encode.CMap4 to encapsulate the Unicode cmap used by TrueType fonts. These classes are each subclasses of an abstract org.foray.ps.encode.Encoding class which implements an aXSL interface exposed to client applications. (The encoding classes are in the PostScript module because they are really PostScript concepts).

An EncodingVector technically maps a font index to a glyph name. It is the main mechanism whereby information about non-contiguous characters (Unicode code points) can be stored in (contiguous) arrays. However, for efficiency reasons, FOray does not store the glyph name, instead storing the glyph name's Unicode code point as the key. These code points are stored in an array which has a parallel array containing the glyph index which corresponds to the code point. These two arrays are then sorted in parallel on the contents of the code point array. Encoding a character requires first a binary search of the code point array, then simply retrieving the corresponding value in the glyph index array. Decoding a character currently requires searching (potentially) each element of the (out-of-order) glyph index array. Decoding should not be needed often, but, if a faster approach is needed, the first two arrays mentioned could be copied to another set of arrays, which would then be sorted in parallel by the glyph index array.

Implicit in FOray's approach of using code points in EncodingVectors is the ability to map glyph names into Unicode code points. This requires a glyph list.

The internal data of CMap4 instances very closely follow the data in a Format 4 TrueType cmap table. Where contiguous ranges of code points map to contiguous ranges of glyph indexes, the data is stored as a range. When it does not, the data is stored as an array of glyph indexes, the index for which can be readily computed by comparison of the code point with the starting value of the range.

Character Sets

One potentially confusing issue is the difference between a font's encoding and its character set. For TrueType fonts with a Unicode cmap (the only kind currently supported by FOray), FOray treats the two as equivalent. In other words, the character set is the characters that the encoding can encode. However, the two are not equivalent with Type 1 fonts, because a Type 1 font may contain more glyphs than any one encoding can include. In other words, we treat the Unicode cmap of a TrueType font as a comprehensive internal encoding, but no such concept exists within a Type 1 font.

FOray uses the class org.foray.font.charset.CharSet to artificially create a comprehensive character set for fonts that need them. This allows all of the character information to be recorded for the font, independent of any encoding used. AFM files contain a CharacterSet entry, but we have found instances of fonts which contain characters outside of the character set described in this entry. The entry appears to be more of a general description than a comprehensive list of all available characters.

Note that while an Encoding needs to be able to convert Unicode code points to glyph indexes and vice versa, a FOray CharSet is really just an array of Unicode code points, sorted in code point order. This array can be searched using a binary search, and the array index for the code point is then used as an index into other (conceptually parallel) arrays that contain such information as glyph widths.

Glyph Lists

Since FOray does not store a glyph name in an EncodingVector, some mechanism is needed to convert the glyph name into a Unicode code point. This mechanism is called a glyph list. Adobe has created a document entitled Unicode and Glyph Names, which partially addresses this issue. The first paragraph of this document says: "The purpose of the Adobe Glyph Naming convention is to support the computation of a Unicode character string from a sequence of glyphs. This is achieved by specifying a mapping from glyph names to character strings." As you can see, for our purposes, this is really going the wrong direction (from glyph name to Unicode code point instead of from Unicode code point to glyph name). However, it is useful anyway. The other interesting thing is that the Adobe Glyph List (AGL), which provides the actual map, maps some glyph names to more than one Unicode codepoint (e.g. for ligatures). This means that to map the other way will mean that we need to consider more than one character at a time. We know that we need to be able to do this anyway, to handle other context-sensitive issues that will come up in OpenType fonts.

The AGL contains 4,281 entries. It is sorted alphabetically, but even with binary searching, we don't want to look for each glyph if we can help it. The standard encodings can easily be mapped directly from Unicode codepoint to array index. For font-specific encodings, such a map can be built from the font metrics file, using the AGL.

The class org.foray.ps.encode.GlyphList and its subclasses manage glyph lists within FOray. (The glyph list classes are in the PostScript module because they are really PostScript concepts). Glyph lists are generally needed only during the pre-processing epoch, when creating encodings or character sets for Type1 fonts. A FOray GlyphList consists of an alphabetically-sorted String array that contains the glyph names, and a parallel char array that contains the Unicode code point to which the glyph name maps. To map from a glyph name to a Unicode code point, a binary search is performed on the String array, and the corresponding char in the char array is returned.