Lexical Analysis

CSC488 Lec1

Posted by Dino on January 18, 2021

Lexical Analysis

Lexical Analysis: Chop the program into tokens

  • Input: A character stream
  • Output: A token stream

Lexical analysis transforms its input (a stream of characters) from one / more source files into a stream of language-specific lexical tokens.

Lexical Analysis Tasks:

  • Delete irrelevant information (whitespace, comments)
  • Determine lexical token boundaries in source stream
  • Transmit lexical tokens to next pass
  • Deal wit ill-formed lexical tokens, recover from lexical errors.
  • Transmit source coordinates (file, line number) to next pass.

Programming language objects a lexical analyzer must deal with

  • Identifiers
  • Reserved words or Keywords
  • Literal constants: integer, real, string, etc.
  • Special characters: “+”, “-“, “*”, “/” etc
  • Comments

A programming language should be designed so that the lexical analyzer can determine lexical token boundaries easily

In most languages, lexical tokens are designed to be easily separated based a single left to right scan of the input stream without backup

Building Lexical Analysis

Describe the lexical structure of the language using regular expressions.

Modern parsing tools like Flexm JFlex, and ANTLR will

  • Generate a NFA that recognizes the strings(regular sets) specified by the regular expressions.
  • Convert the NFA into a DFA
  • Implement the DFA using a table-driven algorithm.

Regular Expression

Regular expression notation is a convenient way to precisely describe the lexical tokens in a language.

The sets of strings defined by regular expressions are regular sets.

Each regular expression defines a regular set.

Notations:

  • ”+” 1 or more repetitions
  • “*” 0 or more repetitions
  • ”[]” for denoting a set of characters
  • ”()” for parenthesis

Characters may be quoted in single quotes to avoid ambiguity

Examples:

  • Identifier: [a-zA-Z][a-zA-Z0-9_]*
  • Digits: [0]|([1-9][0-9])*
  • WhiteSpaceBlock: [\t\r\n]+

Finite Automata

Finite automata(state machines) can be used to recognize tokens specified by a regular expression.

A finite automaton (FA) is

  • A finite set of states
  • A set of transitions from one state to another. Transitions depend on input characters from the vocabulary.
  • A distinguished start state
  • A set of final (accepting) states.

If the FA has a unique transition for every (state, inout character) pair then it is a deterministic FA(DFA)

The input to the DFA is the next input character to the lexical analyzer.

The main state of the DFA uses the character to classify the type of possible token, e.g., identifier, number, special character, whitespace.

For lexical tokens that can consist of more than one character, the main state transitions to a sub-state that deals with accumulating one type of lexical token, e.g. identifier, number

Once a complete token has been recognized by the DFA, it is packaged into a lexical token data structure and sent onward to syntax analysis

Each sub-state consumes the characters in its lexical token and returns to the main state to deal with the character immediately after its token.

Table Driven DFA Scanner Example

Language with whitespace, integers, identifiers, string, special characters, C style comments.

Make lexical analysis decisions with 1 character look ahead.

Advance in input, save start of long tokens.

Save input pointer on entry to each state. If appropriate emit accumulated token on return to state1

Notation:

  • N: Change to state N
  • +N: Advance input pointer and change to state N
Lexical Table Lexical DFA

Example:

  • Based on the lexical table & lexical dfa above trace the process of string “a123”
State Input Transition
1 a 3
3 a +3
3 1 +3
3 2 +3
3 3 +3
3   1
1