实现简易的C语言编译器（part 7）

紧接着上一部分抽象语法树的内容。在这一部分，我们将利用这些定义好的节点（砖块）和抽象语法描述（水泥）搭建起完整的抽象语法树。
同词法分析实现的方式一样，我们首先定义语法分析的类：

class Parser(object):
    def __init__(self, lexer):
        self.lexer = lexer
        # set current token to the first token taken from the input
        self.current_token = self.lexer.get_next_token()

    def eat(self, token_type):
        # compare the current token type with the passed token type
        # otherwise raise an exception.
        if self.current_token.type == token_type:
            self.current_token = self.lexer.get_next_token()
        else:
            raise Exception(f"token {token_type} is undesired in function")

通过eat这个函数，可能你已经大概猜到了我们将要实现的语法分析的原理：一个一个的获取token进行比较，要么选择吃掉它，然后获取下一个token；要么抛出错误，终止程序。这种方式也符合我们分析代码的过程，比如括号必须成对出现。通过设置期望的token，从而可以按照语法规则进行token的分析和抽象语法树的搭建。在抛出异常中，我们使用了Python语言的一个功能：F字符串。有兴趣的可以参考Python的官方指南，简单并且强大着。
其实，有了前面归纳的语法描述形式，我们完全可以写出一一对应的函数，使用上一部分定义好的节点来存储这些token，然后按照函数之间的逻辑关系再将这些节点连接起来形成新的节点，最终得到整个的抽象语法树。比如，在前面介绍四则运算表达式的描述的时候，我们已经给出了如何用节点存储这些token。接下来，我们按照声明、语句和表达式三个部分详细阐述抽象语法树构建的原理。

5.1 声明

为了使描述简明扼要，抓住问题核心，我们先暂时省略结构体和枚举类型，以及带有初始值的变量声明的描述，得到如下简化之后的声明的描述：

type_specifier : INT
               | CHAR 

declarator : direct_declarator
           | * declarator 

direct_declarator : ID
                  | direct_declarator ( parameter_type_list ) 
                  | direct_declarator ( )
                  | direct_declarator [ const_expression ] 
                  | direct_declarator [ ]

declarator_list : declarator
                | declarator_list , declarator 

declaration : type_specifier declarator_list ;

在Parser类里面，我们以函数的形式定义上面出现的所有描述：

    def type_specifier(self):
        """type_spec : INT
                     | CHAR
        """
        token = self.current_token
        if token.type in (INT, CHAR):
            self.eat(token.type)
            return BaseType(token.value）
        else:
            # no such type
            raise Exception(f"undefined type.")  

    def declarator(self):
        """declarator : * direct_declarator
                      | direct_declarator
        """
        if self.current_token.type == MUL:
            self.eat(MUL)
            node = self.declarator()
            node.set_type(PointerType())
            return node

        return self.direct_declarator()

    def direct_declarator(self):
        """direct_declarator : ID
                             | direct_declarator [ const_expr ]
                             | direct_declarator [ ]
                             | direct_declarator ( )
                             | direct_declarator ( parameter_type_list )
        """
        node = Declaration(self.variable())
        if self.current_token.type == LPAREN:
            # function type
            self.eat(LPAREN)
            node.add_type(FunctionType(self.parameter_type_list()))
            self.eat(RPAREN)
        elif self.current_token.type == LBRACKET:
            # array type
            self.eat(LBRACKET)
            node.add_type(ArrayType(self.const_expression()))
            self.eat(RBRACKET)

        return node

    def variable(self):
        """variable : ID"""
        name = self.current_token.value
        self.eat(ID)
        return name

    def declarator_list(self):
        """declarator_list : declarator
                           | declarator_list, declarator
        """
        nodes = VariableList()
        while self.current_token.type != SEMI:
            child_node = self.declarator()
            nodes.add(child_node)
            if self.current_token.type == SEMI:
                break

            self.eat(COMMA)

        return nodes

    def declaration(self):
        """declarations : type_specifier declarator_list SEMI
        """
        nodes = DeclarationList()
        type_node = self.type_specifier()
        for decl_node in self.declarator_list().nodes:
            decl_node.set_type(type_node)
            nodes.add(decl_node)

        self.eat(SEMI)
        return nodes

这里用到的两个list类其实就是前面介绍的NodeList的变体，如下：

class DeclarationList(NodeList):
    """A declaration list, derived for identification"""
    pass

class VariableList(NodeList):
    """A variable list, derived for identification"""
    pass

先不去细究parameter_type_list和const_expression的实现方式，单从简化过的声明的描述上来看，我们是完全按照描述照葫芦画瓢来实现对应的函数，这个过程中用到了前面定义的节点，以及如何组织它们。同样地，下面介绍地语句和表达式，其实也是类似的实现形式。

5.2 语句

还是从语句的描述入手：

statement : expression_statement
          | jump_statement
          | iteration_statement
          | selection_statement
          | compound_statement
          | empty_statement

compound_statement : { declaration_list }
                   | { declaration_list statement_list }

selection_statement : IF ( expression ) statement
                    | IF ( expression ) statement ELSE statement

...

对应的实现如下：

    def statement(self):
        """statement : expression_statement
                     | jump_statement
                     | iteration_statement
                     | selection_statement
                     | compound_statement
                     | empty_statement
        """
        if self.current_token.type == BEGIN:
            node = self.compound_statement()
        elif self.current_token.type == IF:
            node = self.selection_statement()
        elif self.current_token.type in (RETURN, BREAK, CONTINUE):
            node = self.jump_statement()
        elif self.current_token.type in (FOR, WHILE):
            node = self.iteration_statement()
        elif self.current_token.type == SEMI:
            node = None
        else:
            node = self.expression_statement()
        return node

    def compound_statement(self):
        """compound_statement : { declaration_list }
                              | { declaration_list statement_list }
        """
        self.eat(BEGIN)

        declaration_nodes = DeclarationList()
        statement_nodes = StatementsList()
        while True:
            while self.current_token.type in (INT, CHAR):
                for child in self.declaration().nodes:
                    declaration_nodes.add(child)

            if self.current_token.type == END:
                break

            statement_nodes.add(self.statement())

        self.eat(END)
        node = CompoundStatement(declaration_nodes, statement_nodes)

        return node

    def selection_statement(self):
        """if_statement : IF ( expression ) statement { ELSE statement }
        """
        self.eat(IF)
        self.eat(LPAREN)
        expr_node = self.expression()
        self.eat(RPAREN)

        then_node = self.statement()

        else_node = None
        if self.current_token.type == ELSE:
            self.eat(ELSE)
            else_node = self.statement()

        return IfStatement(expr_node, then_node, else_node)

    def iteration_statement(self):
        """for_statement : FOR ( expression_statement expression_statement expression ) statement
        """
        ...

    def jump_statement(self):
        """jump_statement : CONTINUE ;
                       | BREAK ;
                       | RETURN expression ;
        """
        ...

    def expression_statement(self):
        """expression_statement: expression ;
        """
        ...

因为函数的定义和实现完全按照描述来做的，没有什么难度。剩下的部分亦可按照这样的方式进行，这里就不一一介绍了。

5.3 表达式

前面介绍表达式的时候，已经给出了四则运算表达式的实现，这里主要分析逻辑和赋值运算。首先，重温一下它们的表述方式：

relation_expression : additive_expression
                    | relation_expression < additive_expression
                    | relation_expression <= additive_expression
                    | relation_expression > additive_expression
                    | relation_expression >= additive_expression

equality_expression : relation_expression
                    | equality_expression == relation_expression
                    | equality_expression != relation_expression

expression : equality_expression
           | equality_expression = expression

它们的实现亦可按照这样的表示方法进行：

    def relation_expression(self):
        """relation_expression : additive_expression
                               | relation_expression (< | >| <=| >=) additive_expression
        """
        node = self.additive_expression()

        while self.current_token.type in (RELT, REGT, RELEQ, REGEQ):
            token = self.current_token
            self.eat(token.type)

            node = BinOp(left=node, op=token.value, right=self.additive_expression())

        return node

    def equality_expression(self):
        """equality_expression : relation_expression
                               | equality_expression (== | !=) relation_expression
        """
        node = self.relation_expression()

        while self.current_token.type in (REEQ, RENEQ):
            token = self.current_token
            self.eat(token.type)

            node = BinOp(left=node, op=token.value, right=self.relation_expression())

        return node

    def assignment_expression(self):
        """assignment_expression : equality_expression
                                 | unary_expression = assignment_expression
        """
        node = self. equality_expression()
        token = self.current_token
        if token.type == ASSIGN:
            self.eat(ASSIGN)
            right_node = self.assignment_expression()
            node = BinOp(left=node, op=token.value, right=right_node)

        return node

这里有一个细节，不知道大家有没有注意到逻辑和赋值表达式实现的区别：我们都用了BinOp去存储二者之间的逻辑或者赋值关系，但是在赋值表达式中，我们先递归后存储；而逻辑表达式则先存储再递归。这样的区别就是按照前面我们介绍的左结合和右结合的概念实现上的区别，以便符合C语言语法的要求。

5.4 函数定义

再来看一看函数定义的抽象描述：

function_definition : type_specifier declarator compound_statement

对应地，实现如下：

def function_definition(self):
    """function_definition : type_specifier declarator compound_statement
    """
    type_node = self.type_specifier()
    decl_node.set_type(type_node)
    node = FunctionDefn(decl_node, self.compound_statement())

    return node

5.5 实例

有了这些描述对应的实现，我们定义下面的类：

class TranslationUnit(NodeList):
    """An unit list, derived for identification"""
    pass

作为一个根节点使用。为了对上面介绍的实现细节有一个直观的感受，对于下面的C代码：

int a, b;
int main()
{
    a = 1;
    if (a > 0)
        b = 2;
    return 0;
}

用Graphviz库简单绘制创建的节点，可以得到下面的抽象语法树图示：

图5-1 抽象语法树

至此，我们已经一步一步地分析和实现了语法分析的过程。这里面，还有很多上一部分定义的抽象描述没有具体实现。但是，相信经过这部分的分析，大家可以试着补充完整，从而更好地理解语法分析这部分的内容。下一部分，将开启新的篇章，敬请期待。

实现简易的C语言编译器（part 0）
实现简易的C语言编译器（part 1）
实现简易的C语言编译器（part 2）
实现简易的C语言编译器（part 3）
实现简易的C语言编译器（part 4）
实现简易的C语言编译器（part 5）
实现简易的C语言编译器（part 6）
实现简易的C语言编译器（part 8）
实现简易的C语言编译器（part 9）
实现简易的C语言编译器（part 10）
实现简易的C语言编译器（part 11）

最后编辑于：2019.10.09 13:51:41

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 206,839评论 6赞 482
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 88,543评论 2赞 382
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 153,116评论 0赞 344
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 55,371评论 1赞 279
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 64,384评论 5赞 374
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 49,111评论 1赞 285
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 38,416评论 3赞 400
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 37,053评论 0赞 259
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 43,558评论 1赞 300
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 36,007评论 2赞 325
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 38,117评论 1赞 334
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,756评论 4赞 324
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 39,324评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 30,315评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,539评论 1赞 262
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 45,578评论 2赞 355
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,877评论 2赞 345