|
|
lex generates a file of C code called lex.yy.c. This
file must be compiled by the C compiler and linked with
a main routine. The program should be linked with the
lex library, using the -ll option to
cc or ld. This library supplies a main routine.
The lexical analyzer
routine produced is called yylex. This routine reads
its input and, when a token is recognized,
executes the code associated with the token class.
The default action is to write the token to the standard input.
The string matched by the regular expression defining the token class
is placed in yytext, a character array.
The variable yyleng gives the length of this array.
This value of yytext may be copied into an external
array to make it available to other routines.
The regular expressions understood by lex contain many of the usual operators and special characters. The following table summarizes these:
string | the literalstring |
* | zero or more occurrences of the preceding pattern |
+ | one or more occurrences of the preceding pattern |
? | zero or one occurrences of the preceding pattern |
. | any single character |
| | alternation |
( ) | used for grouping |
~ | beginning of an input line |
^ | end of an input line |
pattern{n,m} | n to m occurrences of pattern |
pattern{n} | n occurrences of pattern |
[string] | any character in string |
[^string] | any character not in string |
[char1-char2] | any character in the range char1-char2 |
The declarations section of a lex input file may contain variable declarations, #include statements, and abbreviations for regular expressions. The subroutines section contains user-defined functions used by the lexical analyzer.
Any line beginning with a blank is assumed to contain only C text and is copied to the file lex.yy.c; if it is in the declarations section, it is copied into the external definition area of the lex.yy.c file. Variable declarations and #include statements should be placed in a section delimited by %{ and %}. Abbreviations consist of a symbol on the left of the line and its replacement text to the right. When abbreviations are used they are surrounded by curly braces, {}.
Three I/O routines are defined: input() reads a character; unput(c) returns a character to the input stream; output(c) outputs a character. These routines may be redefined by the user.
Other built-in routines include the following: REJECT, on the right side of the rule, causes the match to be rejected and the next suitable match executed; the function yymore() accumulates additional characters into yytext; the function yyless(p) pushes back the portion of the string matched beginning at position p.
The variable names generated by lex all begin with the prefix yy or YY. Users should avoid defining variables starting with these prefixes.
The lexical analyzer's implementation involves finite state machine; this state machine can be configured in the declarations section. This is done with a declaration of the following form, where x is a key letter, and n is an integer:
%x nThe following parameters may be set in this way:
Key letter | Meaning | Default |
---|---|---|
p | number of positions | 2500 |
n | number of states | 500 |
e | number of parse tree nodes | 1000 |
a | number of transitions | 2000 |
k | number of packed character classes | 1000 |
o | size of output array | 3000 |
Multiple files on the command line are concatenated and treated as a single file. If no files are given, standard input is used.
%{ #include "global.h" int count; %} D [0-9] %% if { printf("IF statement\n"); count++; } [a-z]+ printf("tag, value %s\n",yytext); 0{D}+ printf("octal number %s\n",yytext); {D}+ printf("decimal number %s\n",yytext); "++" printf("unary op\n"); "+" printf("binary op\n"); "/*" skipcommnts(); %% skipcommnts() { for (;;) { while (input() != '*') ; if (input() != '/') unput(yytext[yyleng-1]); else return; } }
X/Open Portability Guide, Issue 3, 1989 .