PTC MKS Toolkit Menu

Using MKS Lex & Yacc Effectively

System and Software

I've tried the other Yaccs and MKS Lex & Yacc is by far the best on the market, with the best documentation and the neatest technical support people.
Tom Campbell,
Systems and Software, Inc.

Contents

Introduction

Many programming tools promise to make software development easier by assuming partial responsibility for producing source code. Often a new tool will produce results which are less than what was expected. Effective results may require techniques which must be discovered through a process of trial and error. The lexical analyzer generator lex is a powerful tool for developing applications. However, it has a reputation for being difficult to learn and use well.

This article attempts to present some simple tips for making the best use of lex.
Its goals are to:

This article would be of interest to users with some knowledge of lex.

Remember this advice when writing your lex specifications:
"Poorly written lex expressions can require arbitrary lookahead; if you keep running out of table entries, you almost certainly need to rethink how you are writing your expressions! Remember to leave the parsing to yacc, and to keep your lex input simple and clean."

Unfortunately, it is not always obvious to new users what methods are expensive*, and which are "simple and clean".

* In this article an "expensive" action is one which results in significantly larger state tables in the generated scanner, a somewhat slower scanner, or both.

The sample code in this article should be portable to most C compilers. Although the code is written in C, the ideas discussed are also applicable to C++ and other applications.

Matching Comments

The recognition of comments in input is one area where the combination of a simple lex pattern with a small amount of C can reduce the complexity of a scanner.

Consider the problem of recognizing C style comments. A comment begins with the instance of the two characters "/*" and ends with the first subsequent instance of the character sequence "*/". A lex pattern which will recognize all legal comments is quite complex (and nearly unreadable):

            "/*""/"*([^*/]|[^*]"/"|"*"[^/])*"*"*"*/"
            

It seems unreasonable to use so complex an expression to recognize a construct which will very likely be thrown away as soon as it is found!

Fortunately, MKS Lex & Yacc provides a support function which simplifies the handling of comments. The yycomment() function may be called when the sequence of characters which mark the beginning of a comment have been found. The function takes a string which gives the character sequence that marks the end of a comment. It will skip over characters in the input until the end of comment is found. C style comments may be matched with the following:

            "/*"	yycomment( "*/" );
            

Matching Literal Strings

Scanning a quoted string is a common operation in most lexical analyzers. A pattern which recognizes C style string constants and correctly handles escaped newlines and double quotes would be:

            \\"([^"\\n]|\\\\["\\n])*\\"
            

There are several problems which are associated with this pattern:

Again, this is a situation where judicious use of C can streamline the scanner. The idea is to use a lex pattern to recognize the start of a quoted string, and a C action to place the contents of the string into a buffer. A copy of the buffer is returned to the calling function:

            \\" {
	            static char buffer[ BUFFER_SIZE ];
	            char *ptr;
	            int c;

	            ptr = buffer;
	            while ( (c = yymapch('"', '\\\\')) != EOF ) *ptr++ = c;
	            *ptr = '\\0';
	            yylval.string = strdup( buffer );
	            return ( STRING );
            }
            

MKS Lex & Yacc provides the function yymapch(), which returns the next character in the input stream, or EOF when a specified delimiter is encountered. This function also interprets the usual C escape characters.

The static character buffer should be large enough to hold any reasonable string. However, as it is written, the code still suffers from the limits imposed by a fixed size buffer. (In fact, there is a bug in the code: no checking is done to ensure that the buffer will not overflow. This is strictly in the interests of clarity.) If the code is rewritten so that the static buffer is allocated using malloc(), then it will be possible to grow the buffer using realloc() when a very large string is encountered.

Note that another approach to this problem might be to use exclusive start conditions in lex, similar to the example described in the "Ambiguity and Look Ahread" section of the lex Programmer's Guide.

Recognizing Reserved Words

Most common languages contain a number of reserved words which are always recognized as such, regardless of context. A common approach to matching reserved words and identifiers is the following:

            "if"		return ( IF );
            "then"		return ( THEN );
            "else"		return ( ELSE );
             ...

            [a-zA-Z_][a-zA-Z_0-9]*	{
	            yylval.string = strdup( yytext );
	            return ( IDENTIFIER );
            }
            

However, a lex generated scanner is not particularly efficient at matching a number of long, literal strings.

A better approach is to use a single pattern to match all identifiers and reserved words in the input. A user defined C function reserved() is used to check each string against a list of reserved words:

            [a-zA-Z_][a-zA-Z_0-9]*	{
	            int i;

	            if ( (i = reserved(yytext)) != 0 )
		            return ( i );
	            else
		            yylval.string = strdup( yytext );
	            return ( IDENTIFIER );
            }
            

Case Insensitivity

In some applications, the case of the characters in the input language is not significant. That is, "while", "While", and "WHILE" are all equivalent.

It is common to see the following approach taken to matching these reserved words:

            [Ii][Ff]		return ( IF );
            [Tt][Hh][Ee][Nn]	return ( THEN );
            [Ee][Ll][Ss][Ee]	return ( ELSE );
             ...
            

However, this is very inefficient use of lex. If there are more than a small number of reserved words to be specified in this manner, it is very likely that lex will run out of table space. In any event, the resulting scanner will be larger and slower than necessary.

There are two simple approaches that might be taken to solve this problem.

Characters may be forced to a single case as they are read by the scanner. The yylex() function uses the macro yygetc() to fetch characters from the input. This macro may be redefined:

            %{
            #undef	yygetc()
            #define	yygetc()	tolower( getc(yyin) )
            %}
            

A word of warning: Some C libraries define tolower() as a preprocessor macro. If this is the case, then yygetc() should be redefined as a function, not as a macro:

            %{
            #undef	yygetc()
            %}
            %%
             ...
            %%

            int yygetc()
            {
	            int i = getc( yyin );
	            return ( tolower(i) );
            }
            

This technique suffers from the fact that all characters in the input are forced to a single case. This would include literal text strings, probably an unintended side effect.

A better idea would be to use a reserved() function like that described previously to check strings against a table of reserved words. It is a simple matter to make the reserved() function case insensitive. A standard C function such as strnicmp() may be employed to perform case insensitive string comparisons.

Some General Tips

The alternative operator "|" is expensive, and may be avoided in many cases. When matching single characters, use a character class instead of alternatives. For example:

            [abc]
            

is always better than:

            (a|b|c)
            

Longer strings should be written as separate patterns. A special syntax may be used that causes lex to drop through to a single action:

            "abc"	|
            "def"	|
            "ghi"	{ /* action */ }
            

is less expensive than:

            "abc"|"def"|"ghi"	{ /* action */ }
            

lex is not very efficient at matching long strings. If possible, reduce the total number of literal strings to be searched for in the input. Given the following:

            "line1"		return ( 1 );
            "line2"		return ( 2 );
            "line3"		return ( 3 );
            

It is possible to rewrite this as:

            "line"[1-3]	return ( yytext[4] - '0' );
            

Differences Between MKS Lex & Yacc and AT&T/Berkely

MKS Lex and Yacc is slightly different than the AT&T and Berkeley versions that you might be used to. This comparison highlights some of the differences. Should you require AT&T lex and yacc, this version is included in MKS Toolkit for Enterprise Developers.

The following comparison can be found in the UNIX Programming Series book published by O'Reilly & Associates, Inc. entitled "lex & yacc" and written by John R. Levine, Tony Mason & Doug Brown in Appendix F.

Differences
Most of the differences found in MKS Lex & Yacc vs. AT&T Lex & Yacc are due to running on a PC rather than UNIX.

New Features

Conclusion

The MKS Lex program is a powerful tool for creating applications that must perform some sort of lexical analysis on input. Crafting an efficient lexical analyzer requires careful attention to both the lex patterns, and their associated C actions. Knowing a few tricks, however, helps the tool fulfill its promise of making software development easier.