词法分析

概念

Lexical Analysis

将输⼊字符流分成多个词素，识别每个词素，根据模式⽣成相应的词法单元

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


E = M * C ** 2 

to token stream:

< id , pointer to symbol-table entry for E >
< assign op >
< id , pointer to symbol-table entry for M >
< mult op >
< id , pointer to symbol-table entry for C >
< exp op >
< number , integer value 2 >

Lookahead

在从左向右的分析过程中，有时候需要向前看(Lookahead)

空格是没有含义的
关键字不是保留字（reserved word）

Specification of Tokens

字母表
符号串
- 前缀(prefix),后缀(suffix),⼦串(substring),⼦序列(subsequence),逆转(reverse),长度(length)
- 连接（concatenation）,指数（exponentiation）,Kleene闭包(closure):L *
语⾔:某个给定字母表上⼀个任意的可数的符号串集合

正则表达式,用来描述模式

1
2


//C语⾔标识符,字符或下划线开头，仅包含字符，下划线，数字
( A|B|...|Z|a|b|...|z|_) (( A|B|...|Z|a|b|...|z|_) | (0|1|...|9))*

正则定义（Regular Definition）：对正则表达式命名

1
2
3


letter_ -> A | B | · · · | Z | a| b| c · · · | z | _
digit -> 0 | 1 | ... | 9
id -> letter_ ( letter_ | digit ) *

正则表达式的扩展
- https://www.runoob.com/regexp/regexp-syntax.html

Recognition of Tokens

1
2
3
4
5
6
7


\tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;

构造⼀个正则表达式R来匹配所有的词法单元 R = R1 | R2 | … | Rk 

1.排在前面的正则表达式优先匹配
2.取最长匹配的模式
3.构造⼀个ERROR正则表达式在最后

识别⽆符号数字的状态转换图

Lex

https://pandolia.net/tinyc/ch8_flex.html

根据给定的正则表达式，转换(正则->NFA -> DFA -> DFA 最小化)，并自动⽣成相应的词法分析程序

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31


//声明（Declarations）
%{
#define T_WORD 1
int numChars = 0, numWords = 0, numLines = 0;
%}

//定义（Definitions）
WORD		([^ \t\n\r\a]+)
    
//规则（Rules）
%%
\n			{ numLines++; numChars++; }
{WORD}		{ numWords++; numChars += yyleng; return T_WORD; }
<<EOF>>		{ return 0; }
.			{ numChars++; }
%%
    
//用户定义过程（User subroutines）
int main() {
	int token_type;
	while (token_type = yylex()) {
		printf("WORD:\t%s\n", yytext);
	}
	printf("\nChars\tWords\tLines\n");
	printf("%d\t%d\t%d\n", numChars, numWords, numLines);
	return 0;
}

int yywrap() {
	return 1;
}