Go to the first, previous, next, last section, table of contents.


4 High-Level Description of GNU gperf

The perfect hash function generator gperf reads a set of “keywords” from an input file (or from the standard input by default). It attempts to derive a perfect hashing function that recognizes a member of the static keyword set with at most a single probe into the lookup table. If gperf succeeds in generating such a function it produces a pair of C source code routines that perform hashing and table lookup recognition. All generated C code is directed to the standard output. Command-line options described below allow you to modify the input and output format to gperf.

By default, gperf attempts to produce time-efficient code, with less emphasis on efficient space utilization. However, several options exist that permit trading-off execution time for storage space and vice versa. In particular, expanding the generated table size produces a sparse search structure, generally yielding faster searches. Conversely, you can direct gperf to utilize a C switch statement scheme that minimizes data space storage size. Furthermore, using a C switch may actually speed up the keyword retrieval time somewhat. Actual results depend on your C compiler, of course.

In general, gperf assigns values to the bytes it is using for hashing until some set of values gives each keyword a unique value. A helpful heuristic is that the larger the hash value range, the easier it is for gperf to find and generate a perfect hash function. Experimentation is the key to getting the most from gperf.

4.1 Input Format to gperf

You can control the input file format by varying certain command-line arguments, in particular the ‘-t’ option. The input's appearance is similar to GNU utilities flex and bison (or UNIX utilities lex and yacc). Here's an outline of the general format:

declarations
%%
keywords
%%
functions

Unlike flex or bison, the declarations section and the functions section are optional. The following sections describe the input format for each section.

It is possible to omit the declaration section entirely, if the ‘-t’ option is not given. In this case the input file begins directly with the first keyword line, e.g.:

january
february
march
april
...

4.1.1 Declarations

The keyword input file optionally contains a section for including arbitrary C declarations and definitions, gperf declarations that act like command-line options, as well as for providing a user-supplied struct.

4.1.1.1 User-supplied struct

If the ‘-t’ option (or, equivalently, the ‘%struct-type’ declaration) is enabled, you must provide a C struct as the last component in the declaration section from the input file. The first field in this struct must be of type char * or const char * if the ‘-P’ option is not given, or of type int if the option ‘-P’ (or, equivalently, the ‘%pic’ declaration) is enabled. This first field must be called ‘name’, although it is possible to modify its name with the ‘-K’ option (or, equivalently, the ‘%define slot-name’ declaration) described below.

Here is a simple example, using months of the year and their attributes as input:

struct month { char *name; int number; int days; int leap_days; };
%%
january,   1, 31, 31
february,  2, 28, 29
march,     3, 31, 31
april,     4, 30, 30
may,       5, 31, 31
june,      6, 30, 30
july,      7, 31, 31
august,    8, 31, 31
september, 9, 30, 30
october,  10, 31, 31
november, 11, 30, 30
december, 12, 31, 31

Separating the struct declaration from the list of keywords and other fields are a pair of consecutive percent signs, ‘%%’, appearing left justified in the first column, as in the UNIX utility lex.

If the struct has already been declared in an include file, it can be mentioned in an abbreviated form, like this:

struct month;
%%
january,   1, 31, 31
...

4.1.1.2 Gperf Declarations

The declaration section can contain gperf declarations. They influence the way gperf works, like command line options do. In fact, every such declaration is equivalent to a command line option. There are three forms of declarations:

  1. Declarations without argument, like ‘%compare-lengths’.
  2. Declarations with an argument, like ‘%switch=count.
  3. Declarations of names of entities in the output file, like ‘%define lookup-function-name name.

When a declaration is given both in the input file and as a command line option, the command-line option's value prevails.

The following gperf declarations are available.

‘%delimiters=delimiter-list
Allows you to provide a string containing delimiters used to separate keywords from their attributes. The default is ",". This option is essential if you want to use keywords that have embedded commas or newlines.
‘%struct-type’
Allows you to include a struct type declaration for generated code; see above for an example.
‘%ignore-case’
Consider upper and lower case ASCII characters as equivalent. The string comparison will use a case insignificant character comparison. Note that locale dependent case mappings are ignored.
‘%language=language-name
Instructs gperf to generate code in the language specified by the option's argument. Languages handled are currently:
‘KR-C’
Old-style K&R C. This language is understood by old-style C compilers and ANSI C compilers, but ANSI C compilers may flag warnings (or even errors) because of lacking ‘const’.
‘C’
Common C. This language is understood by ANSI C compilers, and also by old-style C compilers, provided that you #define const to empty for compilers which don't know about this keyword.
‘ANSI-C’
ANSI C. This language is understood by ANSI C (C89, ISO C90) compilers, ISO C99 compilers, and C++ compilers.
‘C++’
C++. This language is understood by C++ compilers.
The default is ANSI-C.
‘%define slot-name name
This declaration is only useful when option ‘-t’ (or, equivalently, the ‘%struct-type’ declaration) has been given. By default, the program assumes the structure component identifier for the keyword is ‘name’. This option allows an arbitrary choice of identifier for this component, although it still must occur as the first field in your supplied struct.
‘%define initializer-suffix initializers
This declaration is only useful when option ‘-t’ (or, equivalently, the ‘%struct-type’ declaration) has been given. It permits to specify initializers for the structure members following slot-name in empty hash table entries. The list of initializers should start with a comma. By default, the emitted code will zero-initialize structure members following slot-name.
‘%define hash-function-name name
Allows you to specify the name for the generated hash function. Default name is ‘hash’. This option permits the use of two hash tables in the same file.
‘%define lookup-function-name name
Allows you to specify the name for the generated lookup function. Default name is ‘in_word_set’. This option permits multiple generated hash functions to be used in the same application.
‘%define class-name name
This option is only useful when option ‘-L C++’ (or, equivalently, the ‘%language=C++’ declaration) has been given. It allows you to specify the name of generated C++ class. Default name is Perfect_Hash.
‘%7bit’
This option specifies that all strings that will be passed as arguments to the generated hash function and the generated lookup function will solely consist of 7-bit ASCII characters (bytes in the range 0..127). (Note that the ANSI C functions isalnum and isgraph do not guarantee that a byte is in this range. Only an explicit test like ‘c >= 'A' && c <= 'Z'’ guarantees this.)
‘%compare-lengths’
Compare keyword lengths before trying a string comparison. This option is mandatory for binary comparisons (see section 4.3 Use of NUL bytes). It also might cut down on the number of string comparisons made during the lookup, since keywords with different lengths are never compared via strcmp. However, using ‘%compare-lengths’ might greatly increase the size of the generated C code if the lookup table range is large (which implies that the switch option ‘-S’ or ‘%switch’ is not enabled), since the length table contains as many elements as there are entries in the lookup table.
‘%compare-strncmp’
Generates C code that uses the strncmp function to perform string comparisons. The default action is to use strcmp.
‘%readonly-tables’
Makes the contents of all generated lookup tables constant, i.e., “readonly”. Many compilers can generate more efficient code for this by putting the tables in readonly memory.
‘%enum’
Define constant values using an enum local to the lookup function rather than with #defines. This also means that different lookup functions can reside in the same file. Thanks to James Clark <jjc@ai.mit.edu>.
‘%includes’
Include the necessary system include file, <string.h>, at the beginning of the code. By default, this is not done; the user must include this header file himself to allow compilation of the code.
‘%global-table’
Generate the static table of keywords as a static global variable, rather than hiding it inside of the lookup function (which is the default behavior).
‘%pic’
Optimize the generated table for inclusion in shared libraries. This reduces the startup time of programs using a shared library containing the generated code. If the ‘%struct-type’ declaration (or, equivalently, the option ‘-t’) is also given, the first field of the user-defined struct must be of type ‘int’, not ‘char *’, because it will contain offsets into the string pool instead of actual strings. To convert such an offset to a string, you can use the expression ‘stringpool + o, where o is the offset. The string pool name can be changed through the ‘%define string-pool-name’ declaration.
‘%define string-pool-name name
Allows you to specify the name of the generated string pool created by the declaration ‘%pic’ (or, equivalently, the option ‘-P’). The default name is ‘stringpool’. This declaration permits the use of two hash tables in the same file, with ‘%pic’ and even when the ‘%global-table’ declaration (or, equivalently, the option ‘-G’) is given.
‘%null-strings’
Use NULL strings instead of empty strings for empty keyword table entries. This reduces the startup time of programs using a shared library containing the generated code (but not as much as the declaration ‘%pic’), at the expense of one more test-and-branch instruction at run time.
‘%define constants-prefix prefix
Allows you to specify a prefix for the constants TOTAL_KEYWORDS, MIN_WORD_LENGTH, MAX_WORD_LENGTH, and so on. This option permits the use of two hash tables in the same file, even when the option ‘-E’ (or, equivalently, the ‘%enum’ declaration) is not given or the option ‘-G’ (or, equivalently, the ‘%global-table’ declaration) is given.
‘%define word-array-name name
Allows you to specify the name for the generated array containing the hash table. Default name is ‘wordlist’. This option permits the use of two hash tables in the same file, even when the option ‘-G’ (or, equivalently, the ‘%global-table’ declaration) is given.
‘%define length-table-name name
Allows you to specify the name for the generated array containing the length table. Default name is ‘lengthtable’. This option permits the use of two length tables in the same file, even when the option ‘-G’ (or, equivalently, the ‘%global-table’ declaration) is given.
‘%switch=count
Causes the generated C code to use a switch statement scheme, rather than an array lookup table. This can lead to a reduction in both time and space requirements for some input files. The argument to this option determines how many switch statements are generated. A value of 1 generates 1 switch containing all the elements, a value of 2 generates 2 tables with 1/2 the elements in each switch, etc. This is useful since many C compilers cannot correctly generate code for large switch statements. This option was inspired in part by Keith Bostic's original C program.
‘%omit-struct-type’
Prevents the transfer of the type declaration to the output file. Use this option if the type is already defined elsewhere.

4.1.1.3 C Code Inclusion

Using a syntax similar to GNU utilities flex and bison, it is possible to directly include C source text and comments verbatim into the generated output file. This is accomplished by enclosing the region inside left-justified surrounding ‘%{’, ‘%}’ pairs. Here is an input fragment based on the previous example that illustrates this feature:

%{
#include <assert.h>
/* This section of code is inserted directly into the output. */
int return_month_days (struct month *months, int is_leap_year);
%}
struct month { char *name; int number; int days; int leap_days; };
%%
january,   1, 31, 31
february,  2, 28, 29
march,     3, 31, 31
...

4.1.2 Format for Keyword Entries

The second input file format section contains lines of keywords and any associated attributes you might supply. A line beginning with ‘#’ in the first column is considered a comment. Everything following the ‘#’ is ignored, up to and including the following newline. A line beginning with ‘%’ in the first column is an option declaration and must not occur within the keywords section.

The first field of each non-comment line is always the keyword itself. It can be given in two ways: as a simple name, i.e., without surrounding string quotation marks, or as a string enclosed in double-quotes, in C syntax, possibly with backslash escapes like \" or \234 or \xa8. In either case, it must start right at the beginning of the line, without leading whitespace. In this context, a “field” is considered to extend up to, but not include, the first blank, comma, or newline. Here is a simple example taken from a partial list of C reserved words:

# These are a few C reserved words, see the c.gperf file 
# for a complete list of ANSI C reserved words.
unsigned
sizeof
switch
signed
if
default
for
while
return

Note that unlike flex or bison the first ‘%%’ marker may be elided if the declaration section is empty.

Additional fields may optionally follow the leading keyword. Fields should be separated by commas, and terminate at the end of line. What these fields mean is entirely up to you; they are used to initialize the elements of the user-defined struct provided by you in the declaration section. If the ‘-t’ option (or, equivalently, the ‘%struct-type’ declaration) is not enabled these fields are simply ignored. All previous examples except the last one contain keyword attributes.

4.1.3 Including Additional C Functions

The optional third section also corresponds closely with conventions found in flex and bison. All text in this section, starting at the final ‘%%’ and extending to the end of the input file, is included verbatim into the generated output file. Naturally, it is your responsibility to ensure that the code contained in this section is valid C.

4.1.4 Where to place directives for GNU indent.

If you want to invoke GNU indent on a gperf input file, you will see that GNU indent doesn't understand the ‘%%’, ‘%{’ and ‘%}’ directives that control gperf's interpretation of the input file. Therefore you have to insert some directives for GNU indent. More precisely, assuming the most general input file structure

declarations part 1
%{
verbatim code
%}
declarations part 2
%%
keywords
%%
functions

you would insert ‘*INDENT-OFF*’ and ‘*INDENT-ON*’ comments as follows:

/* *INDENT-OFF* */
declarations part 1
%{
/* *INDENT-ON* */
verbatim code
/* *INDENT-OFF* */
%}
declarations part 2
%%
keywords
%%
/* *INDENT-ON* */
functions

4.2 Output Format for Generated C Code with gperf

Several options control how the generated C code appears on the standard output. Two C functions are generated. They are called hash and in_word_set, although you may modify their names with a command-line option. Both functions require two arguments, a string, char * str, and a length parameter, int len. Their default function prototypes are as follows:

Function: unsigned int hash (const char * str, size_t len)
By default, the generated hash function returns an integer value created by adding len to several user-specified str byte positions indexed into an associated values table stored in a local static array. The associated values table is constructed internally by gperf and later output as a static local C array called ‘hash_table’. The relevant selected positions (i.e. indices into str) are specified via the ‘-k’ option when running gperf, as detailed in the Options section below (see section 5 Invoking gperf).

Function: in_word_set (const char * str, size_t len)
If str is in the keyword set, returns a pointer to that keyword. More exactly, if the option ‘-t’ (or, equivalently, the ‘%struct-type’ declaration) was given, it returns a pointer to the matching keyword's structure. Otherwise it returns NULL.

If the option ‘-c’ (or, equivalently, the ‘%compare-strncmp’ declaration) is not used, str must be a NUL terminated string of exactly length len. If ‘-c’ (or, equivalently, the ‘%compare-strncmp’ declaration) is used, str must simply be an array of len bytes and does not need to be NUL terminated.

The code generated for these two functions is affected by the following options:

‘-t’
‘--struct-type’
Make use of the user-defined struct.
‘-S total-switch-statements
‘--switch=total-switch-statements
Generate 1 or more C switch statement rather than use a large, (and potentially sparse) static array. Although the exact time and space savings of this approach vary according to your C compiler's degree of optimization, this method often results in smaller and faster code.

If the ‘-t’ and ‘-S’ options (or, equivalently, the ‘%struct-type’ and ‘%switch’ declarations) are omitted, the default action is to generate a char * array containing the keywords, together with additional empty strings used for padding the array. By experimenting with the various input and output options, and timing the resulting C code, you can determine the best option choices for different keyword set characteristics.

4.3 Use of NUL bytes

By default, the code generated by gperf operates on zero terminated strings, the usual representation of strings in C. This means that the keywords in the input file must not contain NUL bytes, and the str argument passed to hash or in_word_set must be NUL terminated and have exactly length len.

If option ‘-c’ (or, equivalently, the ‘%compare-strncmp’ declaration) is used, then the str argument does not need to be NUL terminated. The code generated by gperf will only access the first len, not len+1, bytes starting at str. However, the keywords in the input file still must not contain NUL bytes.

If option ‘-l’ (or, equivalently, the ‘%compare-lengths’ declaration) is used, then the hash table performs binary comparison. The keywords in the input file may contain NUL bytes, written in string syntax as \000 or \x00, and the code generated by gperf will treat NUL like any other byte. Also, in this case the ‘-c’ option (or, equivalently, the ‘%compare-strncmp’ declaration) is ignored.

4.4 Controlling Identifiers

The identifiers of the functions, tables, and constants defined by the code generated by gperf can be controlled through gperf declarations or the equivalent command-line options. This is useful for three purposes:

4.5 The Copyright of the Output

gperf is under GPL, but that does not cause the output produced by gperf to be under GPL. The reason is that the output contains only small pieces of text that come directly from gperf's source code -- only about 7 lines long, too small for being significant --, and therefore the output is not a “work based on gperf” (in the sense of the GPL version 3).

On the other hand, the output produced by gperf contains essentially all of the input file. Therefore the output is a “derivative work” of the input (in the sense of U.S. copyright law); and its copyright status depends on the copyright of the input. For most software licenses, the result is that the the output is under the same license, with the same copyright holder, as the input that was passed to gperf.


Go to the first, previous, next, last section, table of contents.