Chapter 4: Building the Parser - Structure

In this chapter, we’ll set up the infrastructure for our parser. The parser transforms tokens into an Abstract Syntax Tree.

Parser Design Pattern

We’ll use recursive descent parsing - a simple and intuitive approach where:

Each grammar rule becomes a function
Functions call each other to match nested structures
We look ahead at tokens to decide which rule to apply

This approach works well for assembly language because the grammar is simple and unambiguous.

The Parser State

Our parser maintains this state:

#![allow(unused)]
fn main() {
pub struct Parser<'a, 'b> {
    scanner: &'a mut Scanner<'b>,
    source: &'b str,
    current: Token,
    previous: Token,
    errors: Vec<ParseError>,
}
}

Let’s understand each field:

Scanner Reference

#![allow(unused)]
fn main() {
scanner: &'a mut Scanner<'b>,
}

The parser doesn’t store tokens in advance. It asks the scanner for tokens one at a time. This is more memory efficient and allows for streaming parsing.

Source Reference

#![allow(unused)]
fn main() {
source: &'b str,
}

We keep a reference to the original source so we can extract text for identifiers and error messages.

Current and Previous Tokens

#![allow(unused)]
fn main() {
current: Token,
previous: Token,
}

current: The token we’re about to process
previous: The token we just processed

This gives us one token of lookahead, which is enough for ByteASM’s grammar.

Error Collection

#![allow(unused)]
fn main() {
errors: Vec<ParseError>,
}

Rather than stopping at the first error, we collect errors and continue parsing. This lets us report multiple problems in one pass.

Core Parser Utilities

Creating the Parser

#![allow(unused)]
fn main() {
impl<'a, 'b> Parser<'a, 'b> {
    pub fn new(scanner: &'a mut Scanner<'b>, source: &'b str) -> Self {
        // Get the first token
        let current = scanner.scan_token().unwrap_or_else(|_| Token {
            kind: TokenKind::EOF,
            value: None,
            location: Location::default(),
        });

        Self {
            scanner,
            source,
            current: current.clone(),
            previous: current,
            errors: Vec::new(),
        }
    }
}
}

We immediately scan the first token so current is ready.

Advancing Through Tokens

#![allow(unused)]
fn main() {
pub fn advance(&mut self) -> Token {
    let fallback_location = self.current.location;

    // Move current to previous, scan new current
    self.previous = std::mem::replace(
        &mut self.current,
        self.scanner.scan_token().unwrap_or_else(|_| Token {
            kind: TokenKind::EOF,
            value: None,
            location: fallback_location,
        }),
    );

    self.previous.clone()
}
}

advance() returns the old current token (now previous) and loads the next token.

Checking Token Types

#![allow(unused)]
fn main() {
pub fn check(&self, kind: TokenKind) -> bool {
    self.current.kind == kind
}

pub fn is_at_end(&self) -> bool {
    self.current.kind == TokenKind::EOF
}
}

Expecting Specific Tokens

#![allow(unused)]
fn main() {
pub fn expect(&mut self, kind: TokenKind, expected: &str) -> ParseResult<Token> {
    if self.check(kind) {
        Ok(self.advance())
    } else {
        Err(ParseError::UnexpectedToken {
            expected: expected.to_string(),
            found: self.current.kind,
            location: self.current.location,
        })
    }
}
}

expect() is used when we know what must come next. If the token doesn’t match, we report an error.

Error Types

Our parser can produce these errors:

#![allow(unused)]
fn main() {
pub enum ParseError {
    UnexpectedToken {
        expected: String,
        found: TokenKind,
        location: Location,
    },

    UnexpectedEof {
        location: Location,
    },

    InvalidOperand {
        message: String,
        location: Location,
    },

    InvalidExpression {
        message: String,
        location: Location,
    },

    InvalidDirective {
        message: String,
        location: Location,
    },

    InvalidLabel {
        message: String,
        location: Location,
    },
}
}

Each error includes a Location for precise error reporting.

Getting Error Location

#![allow(unused)]
fn main() {
impl ParseError {
    pub fn location(&self) -> &Location {
        match self {
            ParseError::UnexpectedToken { location, .. } => location,
            ParseError::UnexpectedEof { location } => location,
            ParseError::InvalidOperand { location, .. } => location,
            ParseError::InvalidExpression { location, .. } => location,
            ParseError::InvalidDirective { location, .. } => location,
            ParseError::InvalidLabel { location, .. } => location,
        }
    }
}
}

Error Recovery

When we encounter an error, we don’t want to stop immediately. We use panic mode recovery - skip tokens until we find a safe point to continue.

For assembly language, the safe point is the next line:

#![allow(unused)]
fn main() {
fn synchronize(&mut self) {
    while !self.is_at_end() {
        if self.check(TokenKind::NewLine) {
            self.advance();
            return;
        }
        self.advance();
    }
}
}

This means:

When an error occurs, add it to errors
Call synchronize() to skip to the next line
Continue parsing
At the end, return all collected errors

The Main Parse Loop

#![allow(unused)]
fn main() {
pub fn parse(&mut self) -> Result<Program, Vec<ParseError>> {
    let mut program = Program::new();

    while !self.is_at_end() {
        // Skip blank lines
        self.skip_empty_lines();

        if self.is_at_end() {
            break;
        }

        // Try to parse a line
        match self.parse_line() {
            Ok(statements) => {
                program.statements.extend(statements);
            }
            Err(e) => {
                self.errors.push(e);
                self.synchronize();
            }
        }
    }

    if self.errors.is_empty() {
        Ok(program)
    } else {
        Err(std::mem::take(&mut self.errors))
    }
}
}

Skipping Empty Lines

#![allow(unused)]
fn main() {
fn skip_empty_lines(&mut self) {
    while self.check(TokenKind::NewLine) || self.check(TokenKind::Comment) {
        self.advance();
    }
}
}

Line Structure

A single line can contain:

Just a label: main:
Just an instruction: nop
Just a directive: .org 0x8000
A label followed by an instruction: loop: dex

Our parse_line() function handles all these cases:

#![allow(unused)]
fn main() {
fn parse_line(&mut self) -> ParseResult<Vec<Statement>> {
    let mut statements = Vec::new();

    // Check for label
    if self.is_label_start() {
        statements.push(self.parse_label()?);
    }

    // Check for instruction or directive
    if self.check(TokenKind::Instruction) {
        statements.push(self.parse_instruction()?);
    } else if self.check(TokenKind::Directive) {
        statements.push(self.parse_directive()?);
    }

    // Expect end of line
    if !self.is_at_end() &&
       !self.check(TokenKind::NewLine) &&
       !self.check(TokenKind::Comment)
    {
        return Err(ParseError::UnexpectedToken {
            expected: "end of line".to_string(),
            found: self.current.kind,
            location: self.current.location,
        });
    }

    // Consume newline if present
    if self.check(TokenKind::NewLine) {
        self.advance();
    }

    Ok(statements)
}
}

Detecting Labels

A label is an identifier (or local label) followed by a colon:

#![allow(unused)]
fn main() {
fn is_label_start(&self) -> bool {
    (self.check(TokenKind::Identifier) || self.check(TokenKind::LocalLabel))
        // We need to look at what follows to confirm it's a label
        // For now, we try to parse and see
}
}

Result Type

#![allow(unused)]
fn main() {
pub type ParseResult<T> = Result<T, ParseError>;
}

Public API

The parser module exposes a simple function:

#![allow(unused)]
fn main() {
pub fn parse(source: &str) -> Result<Program, Vec<ParseError>> {
    let mut scanner = Scanner::new(source);
    let mut parser = Parser::new(&mut scanner, source);
    parser.parse()
}
}

Usage:

#![allow(unused)]
fn main() {
let source = ".org 0x8000\nlda #0x42";
let program = byte_asm::parser::parse(source)?;
}

Summary

In this chapter, we set up the parser infrastructure:

Parser state: scanner reference, current/previous tokens, error list
Core utilities: advance(), check(), expect()
Error types: with location information for each error
Error recovery: synchronize on newlines to continue after errors
Main loop: parse lines until EOF, collecting errors

In the next chapter, we’ll implement the actual parsing logic for labels, instructions, and directives.

Previous: Chapter 3 - Designing the AST | Next: Chapter 5 - Parser Implementation

Keyboard shortcuts

ByteASM Tutorial