Compiler | 1 Introduction & 2 A Simple Syntax-Directed Translator

This note is partially copied from Dragon Book - Compilers Principals Techniques and Tools (2nd Edition) Chapter 1 Introduction and Chapter 2 A Simple Syntax-Directed Translator.

Chapter 1 Introduction

1.1 Language Processors

A language-processing system

1.2 The Structure of a Compiler

Phases of a compiler

1.6.3 Static Scope and Block Structure

A declaration D "belongs" to a block B if B is the most closely nested block containing D; that is, D is located within B, but not within any block that is nested within B

The static-scope rule for variable declarations in a block-structured languages is a follows. If declaration D of name x belongs to block B, then the scope of D is all of B, except for any blocks B' nested to any depth within B, in which x is redeclared. Here, x is redeclared in B' if some other declaration D' of the same name x belongs to B'.

1.6.5 Dynamic Scope

The term dynamic scope, refers to the following policy: a use of a name x refers to the declaration of x in the most recently called procedure with such a declaration.

Declarations tell us about the types of things, while definitions tell us about their values. Thus, int i is a declaration of i, while i = 1 is a definition of i. In C++, a method is declared in a class definition, by giving the types of arguments and result of the method (often called signature for the method). The method is then defined in another place.

Chapter 2 A Simple Syntax-Directed Translator

Translate Java code into three-address code

2.1 Introduction

The syntax of a programming language describes the proper form of its programs, while the semantics of the language defines what its programs mean, what each program does when it executes.

2.2 Syntax Definition

Using the variable expr to denote an expression and the variable stmt to denote a statement, the structuring rule of an if-else statement can be expressed as

stmt -> if ( expr ) stmt else stmt

in which the arrow may be read as "can have the form". Such a rule is called a prduction. In a production, lexical elements like the keyword if and the parentheses are called terminals. Variables like expr and stmt represent sequences of terminals are called nonterminals.

2.2.1 Definition of Grammars

A context-free grammar has four components:

A set of terminal symbols, sometimes referred to as "tokens". The terminals are the elementary symbols of the language defined by the grammar.
A set of nonterminals, sometimes called "syntactic variables". Each nonterminal represents a set of strings or terminals, in a manner we shall describe.
A set of productions, where each production consists of a nonterminal, called the head or left side of the production, an arrow, and a sequence of terminals and/or nonterminals, called the body or right side of the production. The intuitive intent of a production is to specify one of the written forms of a construct; if the head nonterminal represents a construct, then the body represents a written form of the construct.
A designation (指定) of one of the nonterminals as the start symbol.

We specify grammars by listing their productions, with the productions for the start symbol listed first. We assume that digits, signs such as < and <=, and boldface strings such as while are terminals. An italicized (斜体) name is a nonterminal, and any nonitalicized name or symbol may be assumed to be a terminal. For notational convenience, productions with the same nonterminal as the head can have their bodies grouped, with the alternative bodies separated by the symbol |. which we read as "or".

e.g. lists of digits separated by plus or minus

list -> list + digit | list - digit | digit
digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

terminals: + - 0 1 2 3 4 5 6 7 8 9
nonterminals: list, digit
start symbol: list

We say a production is for a nonterminal if the nonterminal is the head of the production. A string of terminals is a sequence of zero or more terminals. The string of zero terminals, written as ε , is called the empty string.

2.2.2 Derivations

A grammar derives strings by beginning with the start symbol and repeatedly replacing a nonterminal by the body of a production for that nonterminal. The terminal strings that can be derived from the start symbol form the language defined by the grammar.

2.2.3 Parse Trees

Formally, given a context-free grammar, a parse tree according to the grammar is a tree with the following properties:

The root is labeled by the start symbol.
Each leaf is labeled by a terminal or by ε.
Each interior node is labeled by a nonterminal.
If A is the nonterminal labeling some interior node and X₁, X₂, ..., X_n are the labels of the children of that node from left to right, then there must be a production A -> X₁X₂···X_n. Here, X₁, X₂, ..., X_n each stand for a symbol that is either a terminal or a nonterminal. As a special case, if A -> ε is a production, then a node labeled A may have a single child labeled ε.

The process of finding a parse tree for a given string of terminals is called parsing that string.

2.2.5 Associativity of Operators

left-associative: a + b + c is equivalent to (a + b) + c. The operand b is taken by the left plus sign, so the plus operator is left-associative.
right-associative: a = b = c is equivalent to a = (b = c). The operand b is taken by the right assign operator, so assignment is right-associative.

2.2.6 Precedence of Operators

The associativity rules apply to occurrences of the same operator, so they do not resolve the ambiguity of more than one kind of operator. Rules defining the relative precedence of operators are needed when more than one kind of operator is present.

We say * has higher precedence than + if * takes its operands before + does.

2.3 Syntax-Directed Translation

Syntax-directed translation is done by attaching rules or program fragments to productions in a grammar. 语法制导翻译时通过向一个文法 (grammar) 的产生式 (production) 附加 (attach) 一些规则 (rule) 或程序片段 (program fragments) 而得到的。

Attributes. An attribute is any quantity associated with a programming construct. Examples of attributes are data types of expressions, the number of instructions in the generated code, or the location of the first instruction in the generated code for a construct, among many other posibilities. Since we use grammar symbols (nonterminals and terminals) to represent programming constructs, we extend the notion of attrubutes from constructs to the symbols that represent them.
(Syntax-directed) translation schemes. A translation scheme is a notation for attaching program fragments to the productions of a grammar. The program fragments are executed when the production is used during syntax analysis. The combined result of all these fragment executions, in the order induced by the syntax analysis, produces the translation of the program to which this analysis/ synthesis process is applied.

2.3.2 Synthesized Attributes

A syntax-directed definition associates

With each grammar symbol, a set of attributes and
With each production, a set of semantic rules for computing the values of the attributes associated with the symbols appearing in the production.

An attribute is said to be synthesized if its value at a parse-tree node N is determined from attribute values at the children of N and at N itself. Synthesized attributes have the desirable property that they can be evaluated during a single bottom-up traversal of a parse tree.

2.3.5 Translation Schemes

A syntax-directed translation scheme is a notation for specifying a translation by attaching program fragments to productions in a grammar. A translation scheme is like a syntax-directed definition, except that the order of evaluation of the semantic rule is explicitly specified.

Program fragments embedded within production bodies are called semantic actions. The position at which an action is to be executed is shown by enclosing it between curly braces and writing it within the production body.

When drawing a parse tree for a translation scheme, we indicate an action by constructing an extra child for it, connected by a dashed line to the node that corresponds to the head of the production. The node for a semantic action has no children.

E.g. Semantic actions for translating infix expressions to postfix expressions.

The implementation of a translation scheme must ensure that semantic actions are performed in the order they would appear during a postorder traversal of a parse tree.

2.4 Parsing

Parsing is the process of determining how a string of terminals can be generated by a grammar.

2.4.1 Top-Down Parsing

The top-down construction of a parse tree is done by starting with the root, labeled with the starting nonterminal stmt, and repeatedly performing the following two steps.

At node N, labeled with nonterminal A, select one of the productions for A and construct children at N for the symbols in the production body.
Find the next node at which a subtree is to be constructed, typically the leftmost unexpanded noterminal of the tree.

For some grammars, the above steps can be implemented during a single left-to-right scan of the input string. The current terminal being scanned in the input is frequently referred to as the lookahead symbol.

2.4.2 Predictive Parsing

Predictive parsing relies on information about the first symbols that can be generated by a production body. More precisely, let α be a string of grammar symbols (terminals and / or nonterminals). We define FIRST (α) to be the set of terminals that appear as the first symbols of one or more strings of terminals generated from α. If α is ε or can generate ε, then ε is also in FIRST (α).

The FIRST sets must be considered if there are two productions A -> α and A -> β. Ignoring ε-productions for the moment, predictive parsing requires FIRST (α) and FIRST (β) to be disjoint (不相交的). The lookahead symbol can then be used to decide which production to use; if the lookahead symbol is in FIRST (α), then α is used. Otherwise, if the lookahead symbol is in FIRST (β), then β is used.

2.4.3 Designing a Predictive Parser

Recall that a predictive parser is a program consisting of a procedure for every nonterminal. The procedure for nonterminal A does two things.

It decides which A-production to use by examining the lookahead symbol. The production with body α (where α is not ε, the empty string) is used if the lookahead symbol is in FIRST (α). If there is a conflict between two nonempty bodies for any lookahead symbol, then we cannot use this parsing method on this grammar. In addition, the ε-production for A, if it exists, is used if the lookahead symbol is not in the FIRST set for any other production body for A.
The procedure then mimics the body of the chosen production. That is ,the symbols of the body are "executed" in turn, from the left. A nonterminal is "executed" by a call to the procedure for that nonterminal, and a terminal matching the lookahead symbol is "executed" by reading the next input symbol. If at some point the terminal in the body does not match the lookahead symbol, a syntax error is reported.

2.4.5 Left Recursion

It is possible for a recursive-descent parser to loop forever. A problem arises with "left-recursive" productions like

expr -> expr + term

where the leftmost symbol of the body is the same as the nonterminal at the head of the production.

A left-recursive production can be eliminated by rewriting the offending production. Consider a nonterminal A with two productions

A -> Aα | β

where α and β are sequences of terminals and nonterminals that do not start with A. For example, in

expr -> expr + term | term

nonterminal A = expr, string α = + term, and string β = term.

The nonterminal A and its production are said to be left recursive, because the production A -> Aα has A itself as the leftmost symbol on the right side. Repeated application of this production builds up a sequence of α's to the right of A. When A is finally replaced by β, we have a β followed by a sequence of zero or more α's.

The left-recursion-elimination technique | The same effect can be achieved by rewriting the productions for A in the following manner, using a new nonterminal R:

A -> βR
R -> αR | ε

Nonterminal R and its production R -> αR are right recursive because this production for R has R itself as the last symbol on the right side.

2.5.1 Abstract and Concrete Syntax

A useful starting point for designing a translator is a data structure called an abstract syntax tree. In an abstract syntax tree for an expression, each interior node represents an operator; the children of the node represent the operands of the operator. More generally, any programming construct can be handled by making up an operator for the construct and treating as operands the semantically meaningful components of that construct.

Abstract syntax trees, or simply syntax tees, resemble parse trees to an extent. However, in the syntax tree, interior nodes represent programming constructs while in the parse tree, the interior node represent nonterminals. Many nonterminals of a grammar represent programming constructs, but others are "helpers" of one sort or another, such as those representing terms, factors, or other variations of expressions. Om the syntax tree, these helpers typically are not needed and are hence dropped. To emphasize the contrast, a parse tree is sometimes called a concrete syntax tree and the underlying grammar is called a concrete syntax for the language.

2.6 Lexical Analysis

A lexical analyzer reads characters from the input and groups them into "token objects".

A sequence of input characters that comprises (组成) a single token is called a lexeme. Thus, we can say that the lexical analyzer insulates (隔离) a parser from the lexeme representation of tokens.

2.6.1 Removal of White Space and Comments

2.6.2 Reading Ahead

2.6.3 Constants

<num, 31>

2.6.4 Recognizing Keywords and Identifiers

Grammars routinely treat identifiers as terminals to simplify the parser, which can then expect the same terminal, say id, each time any identifier appears in the input.

<id, "count">

2.6.5 A Lexical Analyzer

2.7 Symbol Tables

From Section 1.6.1, the scope of a declaration is the portion of a program to which the declaration applies. We shall implement scopes by setting up a separate symbol table for each scope. A program block with declarations will have its own symbol table with an entry for each declaration in the block.

2.7.1 Symbol Table Per Scope

The most-closely nested rule for blocks is that an identifier x is in the scope of the most-closely nested declaration of x; that is, the declaration of x found by examining blocks inside-out, starting with the block in which x appears.

In Java Implementation of chained symbol table in Fig 2.37 defines a class Env, short for environment.

Class Env supports three operations:

Create a new symbol table. The constructor Env(p) on lines 6 through 8 of Fig 2.37 creates an Env object with a hash table named table. The object is chained to the environment-valued parameter p by setting field prev to p. Although it is the Env objects that form a chain, it is convenient to talk of the tablesbeing chained.
Put a new entry in the current table.
Get an entry for an identifier by searching the chain of tables, starting with the table for the current block.

Chaining of symbol tables results in a tree structure, since more than on block can be nested inside an enclosing block.

2.7.2 The Use of Symbol Tables

In effect, the role of a symbol table is to pass information from declarations to uses. A semantic action "puts" information about identifier x into the symbol, when the declaration of x is analyzed. Subsequently, a semantic action associated with a production such as factor -> id "gets" information about the identifier from the symbol table.

2.8.1 Two Kinds of Intermediate Representations

The two most important intermediate representations are:

Trees, including parse trees and (abstract) syntax trees (introduced in Section 2.5.1).
Linear representations, especially "three-address code".

In addition to creating an intermediate representation, a compiler front end checks that the source program follows the syntactic and semantic rules of the source language. This checking is called static checking; in general "static" means "done by the compiler". Static checking assures that certain kinds of programming errors, including type mismatches, are detected and reported during compilation.

It is common for compilers to emit the three-address code while the parser "goes through the motions" of constructing a syntax tree, without actually constructing the complete tree data structure. Rather, the compiler stores nodes and their attributes needed for semantic checking or other purposes, along with the data structure used for parsing.

2.8.2 Construction of Syntax Trees

We shall first give a translation scheme that constructs syntax trees, and later show how the scheme can be modified to emit three-address code, along with, or instead of, the syntax tree.

All the nonterminals in the translation scheme have an attribute n, which is a node of the syntax tree. Nodes are implemented as objects of class Node.

Class Node has two immediate subclasses: Expr for all kinds of expressions, and Stmt for all kinds of statements. Each type of statement has a corresponding subclass of Stmt; for example, operator while corresponds to subclass While. A syntax-tree node for operator while with children x and y is created by the pseudocode new While (x, y) which creates an object of class While by calling constructor function While.

2.8.3 Static Checking

Static checks are consistency checks that are done during compilation. Static checks includes:

Syntactic Checking.
Type Checking.

The terms l-value and r-value refer to values that are appropriate on the left and right sides of an assignment, respectively.That is , r-values are what we usually think of as "values", while l-values are locations.

The idea of matching actual with expected types continues to apply (type checking), even in the following situations:

Coercions ( 强制类型转换 ). A coercion occurs if the type of an operand is automatically converted to the type expected by the operator.
Overloading ( 运算符重载 ). A symbol is said to be overloaded if it has different meanings depending on its context.

2.8.4 Three-Address Code

Three-address code is a sequence of instructions of the form

x = y op z

where x, y, and z are names, constants, or compiler-generated temporaries; and op stands for an operator.

Arrays will be handled by using the following two variants of instructions:

x [ y ] = z
x = y [ z ]

The following instructions are for control flow:

ifFalse x goto L. If x is false, next execute the instruction labeled L
ifTrue x goto L. If x is true, next execute the instruction labeled L
goto L. next execute the instruction labeled L

A label L can be attached to any instruction by prepending a prefix L:. Any instruction can have more than one label.

Translation of Statements

Statements are translated into three-addrress code by using jump instructions to implement the flow of control through the statement.

Translation of Expressions

2.9 Summary of Chapter 2

The syntax-directed techniques in this chapter can be used to construct compiler front ends, such as those illustrated in Fig 2.46.

最后编辑于：2018.08.20 16:15:48

人面猴
序言：七十年代末，一起剥皮案震惊了整个滨河市，随后出现的几起案子，更是在滨河造成了极大的恐慌，老刑警刘岩，带你破解...
沈念sama阅读 202,723评论 5赞 476
死咒
序言：滨河连续发生了三起死亡事件，死亡现场离奇诡异，居然都是意外死亡，警方通过查阅死者的电脑和手机，发现死者居然都...
沈念sama阅读 85,080评论 2赞 379
救了他两次的神仙让他今天三更去死
文/潘晓璐我一进店门，熙熙楼的掌柜王于贵愁眉苦脸地迎上来，“玉大人，你说我怎么就摊上这事。” “怎么了？”我有些...
开封第一讲书人阅读 149,604评论 0赞 335
道士缉凶录：失踪的卖姜人
文/不坏的土叔我叫张陵，是天一观的道长。经常有香客问我，道长，这世上最难降的妖魔是什么？我笑而不...
开封第一讲书人阅读 54,440评论 1赞 273
港岛之恋（遗憾婚礼）
正文为了忘掉前任，我火速办了婚礼，结果婚礼上，老公的妹妹穿的比我还像新娘。我一直安慰自己，他们只是感情好，可当我...
茶点故事阅读 63,431评论 5赞 364
恶毒庶女顶嫁案：这布局不是一般人想出来的
文/花漫我一把揭开白布。她就那样静静地躺着，像睡着了一般。火红的嫁衣衬着肌肤如雪。梳的纹丝不乱的头发上，一...
开封第一讲书人阅读 48,499评论 1赞 281
城市分裂传说
那天，我揣着相机与录音，去河边找鬼。笑死，一个胖子当着我的面吹牛，可吹牛的内容都是我干的。我是一名探鬼主播，决...
沈念sama阅读 37,893评论 3赞 395
双鸳鸯连环套：你想象不到人心有多黑
文/苍兰香墨我猛地睁开眼，长吁一口气：“原来是场噩梦啊……” “哼！你这毒妇竟也来了？” 一声冷哼从身侧响起，我...
开封第一讲书人阅读 36,541评论 0赞 256
万荣杀人案实录
序言：老挝万荣一对情侣失踪，失踪者是张志新（化名）和其女友刘颖，没想到半个月后，有当地人在树林里发现了一具尸体，经...
沈念sama阅读 40,751评论 1赞 296
护林员之死
正文独居荒郊野岭守林人离奇死亡，尸身上长有42处带血的脓包…… 初始之章·张勋以下内容为张勋视角年9月15日...
茶点故事阅读 35,547评论 2赞 319
白月光启示录
正文我和宋清朗相恋三年，在试婚纱的时候发现自己被绿了。大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
茶点故事阅读 37,619评论 1赞 329
活死人
序言：一个原本活蹦乱跳的男人离奇死亡，死状恐怖，灵堂内的尸体忽然破棺而出，到底是诈尸还是另有隐情，我是刑警宁泽，带...
沈念sama阅读 33,320评论 4赞 318
日本核电站爆炸内幕
正文年R本政府宣布，位于F岛的核电站，受9级特大地震影响，放射性物质发生泄漏。R本人自食恶果不足惜，却给世界环境...
茶点故事阅读 38,890评论 3赞 307
男人毒药：我在死后第九天来索命
文/蒙蒙一、第九天我趴在偏房一处隐蔽的房顶上张望。院中可真热闹，春花似锦、人声如沸。这庄子的主人今日做“春日...
开封第一讲书人阅读 29,896评论 0赞 19
一桩弑父案，背后竟有这般阴谋
文/苍兰香墨我抬头看了看天上的太阳。三九已至，却和暖如春，着一层夹袄步出监牢的瞬间，已是汗流浃背。一阵脚步声响...
开封第一讲书人阅读 31,137评论 1赞 259
情欲美人皮
我被黑心中介骗来泰国打工，没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留，地道东北人。一个月前我还...
沈念sama阅读 42,796评论 2赞 349
代替公主和亲
正文我出身青楼，却偏偏与公主长得像，于是被迫代替她去往敌国和亲。传闻我的和亲对象是个残疾皇子，可洞房花烛夜当晚...
茶点故事阅读 42,335评论 2赞 342