MKS Toolkit - Running / Porting UNIX to Windows
Quick Links
Datasheet
What's New
Customer Testimonials
Case Studies
Help me choose which MKS Product will best fit our needs (MKS Toolkit; MKS X/Server)


 
More Information
 
Call:
(North America)
1-703-803-3343
1-800-637-8034
 
(Europe)
+49 (0) 89 32106 576
or
Contact Us
 
 
 
 
 
 
 
 
MKS Lex and Yacc MKS Lex and Yacc
System and Software

I've tried the other Yaccs and MKS Lex & Yacc is by far the best on the market, with the best documentation and the neatest technical support people.
Tom Campbell,
Systems and Software, Inc.

Using MKS Lex & Yacc Effectively

Contents

Introduction

Many programming tools promise to make software development easier by assuming partial responsibility for producing source code. Often a new tool will produce results which are less than what was expected. Effective results may require techniques which must be discovered through a process of trial and error. The lexical analyzer generator lex is a powerful tool for developing applications. However, it has a reputation for being difficult to learn and use well.

This article attempts to present some simple tips for making the best use of lex.
Its goals are to:

  • Illustrate some of the common approaches which result in inefficient scanners. Particular attention will be paid to the pitfalls encountered by clients of the MKS technical support department.
  • Present alternative methods which achieve more streamlined results.
  • Illustrate the alternatives with concrete examples.
  • Show only techniques which are of practical use in a typical lex application.

This article would be of interest to users with some knowledge of lex.

Remember this advice when writing your lex specifications:

    "Poorly written lex expressions can require arbitrary lookahead; if you keep running out of table entries, you almost certainly need to rethink how you are writing your expressions! Remember to leave the parsing to yacc, and to keep your lex input simple and clean."

Unfortunately, it is not always obvious to new users what methods are expensive*, and which are "simple and clean".

* In this article an "expensive" action is one which results in significantly larger state tables in the generated scanner, a somewhat slower scanner, or both.

The sample code in this article should be portable to most C compilers. Although the code is written in C, the ideas discussed are also applicable to C++ and other applications.

Matching Comments

The recognition of comments in input is one area where the combination of a simple lex pattern with a small amount of C can reduce the complexity of a scanner.

Consider the problem of recognizing C style comments. A comment begins with the instance of the two characters "/*" and ends with the first subsequent instance of the character sequence "*/". A lex pattern which will recognize all legal comments is quite complex (and nearly unreadable):

"/*""/"*([^*/]|[^*]"/"|"*"[^/])*"*"*"*/"

It seems unreasonable to use so complex an expression to recognize a construct which will very likely be thrown away as soon as it is found!

Fortunately, MKS Lex & Yacc provides a support function which simplifies the handling of comments. The yycomment() function may be called when the sequence of characters which mark the beginning of a comment have been found. The function takes a string which gives the character sequence that marks the end of a comment. It will skip over characters in the input until the end of comment is found. C style comments may be matched with the following:

"/*"	yycomment( "*/" );

Matching Literal Strings

Scanning a quoted string is a common operation in most lexical analyzers. A pattern which recognizes C style string constants and correctly handles escaped newlines and double quotes would be:

\\"([^"\\n]|\\\\["\\n])*\\"

There are several problems which are associated with this pattern:

  • The maximum length of a string which may be recognized using a lex scanner is defined by the macro YYLMAX. Strings which are longer than YYLMAX will cause the token buffer to overflow. Increasing the value of YYLMAX will reduce the frequency with which token buffer overflow occurs. However, this will not change the fact that lex is imposing an artificial limit on the length of strings in the input.
  • The lex pattern is not able to account for escape characters in the string. For example, the character sequence \\t is not interpreted as a tab character.
  • The pattern is not terribly efficient. Long patterns which use the alternative operator and negated character classes are expensive, and should be avoided if possible.

Again, this is a situation where judicious use of C can streamline the scanner. The idea is to use a lex pattern to recognize the start of a quoted string, and a C action to place the contents of the string into a buffer. A copy of the buffer is returned to the calling function:

\\" {
	static char buffer[ BUFFER_SIZE ];
	char *ptr;
	int c;

	ptr = buffer;
	while ( (c = yymapch('"', '\\\\')) != EOF ) *ptr++ = c;
	*ptr = '\\0';
	yylval.string = strdup( buffer );
	return ( STRING );
}

MKS Lex & Yacc provides the function yymapch(), which returns the next character in the input stream, or EOF when a specified delimiter is encountered. This function also interprets the usual C escape characters.

The static character buffer should be large enough to hold any reasonable string. However, as it is written, the code still suffers from the limits imposed by a fixed size buffer. (In fact, there is a bug in the code: no checking is done to ensure that the buffer will not overflow. This is strictly in the interests of clarity.) If the code is rewritten so that the static buffer is allocated using malloc(), then it will be possible to grow the buffer using realloc() when a very large string is encountered.

Note that another approach to this problem might be to use exclusive start conditions in lex, similar to the example described in the "Ambiguity and Look Ahread" section of the lex Programmer's Guide.

Recognizing Reserved Words

Most common languages contain a number of reserved words which are always recognized as such, regardless of context. A common approach to matching reserved words and identifiers is the following:

"if"		return ( IF );
"then"		return ( THEN );
"else"		return ( ELSE );
 ...

[a-zA-Z_][a-zA-Z_0-9]*	{
	yylval.string = strdup( yytext );
	return ( IDENTIFIER );
}

However, a lex generated scanner is not particularly efficient at matching a number of long, literal strings.

A better approach is to use a single pattern to match all identifiers and reserved words in the input. A user defined C function reserved() is used to check each string against a list of reserved words:

[a-zA-Z_][a-zA-Z_0-9]*	{
	int i;

	if ( (i = reserved(yytext)) != 0 )
		return ( i );
	else
		yylval.string = strdup( yytext );
	return ( IDENTIFIER );
}

Case Insensitivity

In some applications, the case of the characters in the input language is not significant. That is, "while", "While", and "WHILE" are all equivalent.

It is common to see the following approach taken to matching these reserved words:

[Ii][Ff]		return ( IF );
[Tt][Hh][Ee][Nn]	return ( THEN );
[Ee][Ll][Ss][Ee]	return ( ELSE );
 ...

However, this is very inefficient use of lex. If there are more than a small number of reserved words to be specified in this manner, it is very likely that lex will run out of table space. In any event, the resulting scanner will be larger and slower than necessary.

There are two simple approaches that might be taken to solve this problem.

Characters may be forced to a single case as they are read by the scanner. The yylex() function uses the macro yygetc() to fetch characters from the input. This macro may be redefined:

%{
#undef	yygetc()
#define	yygetc()	tolower( getc(yyin) )
%}

A word of warning: Some C libraries define tolower() as a preprocessor macro. If this is the case, then yygetc() should be redefined as a function, not as a macro:

%{
#undef	yygetc()
%}
%%
 ...
%%

int yygetc()
{
	int i = getc( yyin );
	return ( tolower(i) );
}

This technique suffers from the fact that all characters in the input are forced to a single case. This would include literal text strings, probably an unintended side effect.

A better idea would be to use a reserved() function like that described previously to check strings against a table of reserved words. It is a simple matter to make the reserved() function case insensitive. A standard C function such as strnicmp() may be employed to perform case insensitive string comparisons.

Some General Tips

The alternative operator "|" is expensive, and may be avoided in many cases. When matching single characters, use a character class instead of alternatives. For example:

[abc]

is always better than:

(a|b|c)

Longer strings should be written as separate patterns. A special syntax may be used that causes lex to drop through to a single action:

"abc"	|
"def"	|
"ghi"	{ /* action */ }

is less expensive than:

"abc"|"def"|"ghi"	{ /* action */ }

lex is not very efficient at matching long strings. If possible, reduce the total number of literal strings to be searched for in the input. Given the following:

"line1"		return ( 1 );
"line2"		return ( 2 );
"line3"		return ( 3 );

It is possible to rewrite this as:

"line"[1-3]	return ( yytext[4] - '0' );

Differences Between MKS Lex & Yacc and AT&T/Berkely

MKS Lex and Yacc is slightly different than the AT&T and Berkeley versions that you might be used to. This comparison highlights some of the differences. Should you require AT&T lex and yacc, this version is included in MKS Toolkit for Enterprise Developers.

The following comparison can be found in the UNIX Programming Series book published by O'Reilly & Associates, Inc. entitled "lex & yacc" and written by John R. Levine, Tony Mason & Doug Brown in Appendix F.

Differences
Most of the differences found in MKS Lex & Yacc vs. AT&T Lex & Yacc are due to running on a PC rather than UNIX.

  • The output files have different names: lex_yy.c, ytab.h, and y.out. rather than lex.yy.c, y.tab.c, y.tab.h, and y.output.
  • MKS Lex has its own method for handling nested include files.
  • MKS Lex has its own method for resetting a lexer into its initial state.
  • MKS Lex uses the macro yygetc() to read input. You can redefine it to change the input source.
  • The standard lex token buffer is only 100 characters but can be enlarged by redefining some macros.
  • The internal yacc tables are generated differently. This makes error recovery slightly different; in general MKS yacc will perform fewer reductions than will UNIX yacc before noticing an error on a lookahead token.

New Features

  • Microsoft Visual Studio Add-in
  • MKS Lex and Yacc can generate scanners and parsers in C++ as well as in C.
  • MKS Lex and Yacc both let you change the prefix of symbols in the generated code from the default 'yy', so that you can include multiple lexers and parsers in one program.
  • MKS Yacc can allocate its stack dynamically, allowing recursive and reentrant parsers.
  • MKS Yacc has selection preferences which let you resolve reduce/reduce conflicts by specifying lookahead tokens that tell it to use one rule or the other.
  • The MKS Lex library contains routines that skip over C style comments and handle C strings including escapes.
  • MKS Yacc documents many more of its internal variables than do AT&T or Berkeley yacc. This lets error recovery routines get access and change much more of the parser's internal state.
  • The package includes sample scanners and parsers for C, C++, Java, dBASE, Fortran, Hypertalk, Modula-2, Pascal, pic (the troff picture language), and SQL.

Conclusion

The MKS Lex program is a powerful tool for creating applications that must perform some sort of lexical analysis on input. Crafting an efficient lexical analyzer requires careful attention to both the lex patterns, and their associated C actions. Knowing a few tricks, however, helps the tool fulfill its promise of making software development easier.


[ UNIX on Windows Home | Site Map | MKS Toolkit Products | Sales | Services | Support | RSS ]
[ PC X Server | Secure Shell (SSH) Solutions | Feedback |
© 1995 - 2014 MKS Inc. All Rights Reserved