|
|
If you work on a compiler project or develop a program to check the validity of an input language, you may want to use the UNIX system program tool yacc(CP). lex can conveniently be used with yacc to develop compilers. For a more complete discussion, see ``yacc''. Whether or not you plan to use lex with yacc, this part is useful because it covers information of interest to all lex programmers.
No matter what kind of parser you are using, the lexical analyzer generated by lex is invoked through a call to the function yylex(). yylex is an integer-valued function. This name is convenient because the built-in function call that yacc-generated parsers use to invoke the lexical analyzer is also yylex().
If your lexical analyzer is to be used along with a parser, each lex action should end with a return statement of the form:
return token;or
return(token);
Here, token is an integer-valued variable, literal, or macro, or a character enclosed in single quotes. The value returned indicates to the parser what kind of token the lexical analyzer has found. The parser then resumes control and later makes another call to the lexical analyzer (via a call to yylex()) when it needs another token.
The different return values indicate to the parser
whether the token is a reserved word,
identifier, constant, arithmetic operator, relational operator,
or other token type.
In some cases, such as reserved words, and sometimes operators,
the return value indicates which token was recognized: for example,
a value of 16 might indicate the reserved word else in
a C compiler.
In the other cases, especially identifiers and constants, there needs
to be another mechanism for telling the parser the exact value
of the token.
Consider the following portion of lex source for a lexical analyzer
for a hypothetical programming language:
1 %{ 2 include "token.defs" 3 extern int tokval; 4 %} 5 %% 6 begin return(BEGIN); 7 end return(END); 8 while return(WHILE); 9 if return(IF); 10 package return(PACKAGE); 11 reverse return(REVERSE); 12 loop return(LOOP); 13 "(" return('('); 14 ")" return(')'); 15 [a-zA-Z][a-zA-Z0-9]* { tokval = put_in_tabl(); 16 return(IDENTIFIER); } 17 [0-9]+ { tokval = put_in_tabl(); 18 return(INTEGER); } 19 \+ { tokval = PLUS; 20 return(ARITHOP); } 21 - { tokval = MINUS; 22 return(ARITHOP); } 23 > { tokval = GREATER; 24 return(RELOP); } 25 >= { tokval = GREATEREQL; 26 return(RELOP); } 27 %% 28 put_in_tabl() 29 { 30 . 31 . 32 . 33 }The identifiers BEGIN, END, and WHILE, used as return values would typically be integer-valued macros defined in some header file: in this case, token.defs (see line 2). The header file would look something like this:
#define BEGIN 1 #define END 2 #define WHILE 3 . . . #define RELOP 12If it becomes necessary to change the value of some token type, then a #define statement in the header file is changed. If using yacc to generate the parser, you may insert the following statement into the definitions section of your lex source:
#include "y.tab.h"
To indicate the reserved words in the example, the returned integer values suffice. If a token consists of a single character, like the ``('' and ``)'' tokens in lines 13 and 14 of the example, its literal value may be passed to the parser in the ``return'' statement. For the other token types (lines 15-26), the integer value identifying the token is stored in the programmer-defined variable tokval. This variable is globally defined so that the parser as well as the lexical analyzer can access it. If the parser has been generated by yacc, the variable yylval should be used for this purpose. Parsers generated by yacc look in that variable for the value of the token. yylval is defined in the file y.tab.h. It is an integer variable by default, but it can be redefined as a C union in the yacc specification.
This example illustrates the use of a symbol table. The symbol table is a data structure that is globally accessible (or at least accessible by the parser); it stores the text of most or all of the tokens found in a particular input and associates them with identifying ``indexes''. The function put_in_tabl() helps maintain the symbol table. This function may be defined in the subroutines section of the lex specification, or in an included file. The implementation details of this function are not important here, but it may be understood to carry out the following tasks:
The index is a unique identifying integer: no other identifier may have the same index. The index could be an array subscript, but the details of this are unimportant to us as long as the value is unique.
Note that the example shows two ways to assign a value to tokval. put_in_tabl() may assign a symbol table index to tokval. In the last few actions of the example (line 15-26), tokval is assigned a specific integer (in the form of a defined macro) indicating which operator the analyzer recognized. For example, in the last rule in the specification (lines 25 and 26), the lexical analyzer indicates the general class of operator by returning the value RELOP to the parser. This enables the parser to check the input for syntactic correctness. Further, the lexical analyzer indicates the specific operator by assigning the value GREATEREQL to tokval. The parser has some way of associating this value with the string >= . Quite possibly the symbol table will have a set of base entries that includes >= . The definition of GREATEREQL, and the other predefined token indexes will not necessarily appear in the same set of definitions as the token types.