Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 4: Building the Parser - Structure

In this chapter, we’ll set up the infrastructure for our parser. The parser transforms tokens into an Abstract Syntax Tree.

Parser Design Pattern

We’ll use recursive descent parsing - a simple and intuitive approach where:

  • Each grammar rule becomes a function
  • Functions call each other to match nested structures
  • We look ahead at tokens to decide which rule to apply

This approach works well for assembly language because the grammar is simple and unambiguous.

The Parser State

Our parser maintains this state:

#![allow(unused)]
fn main() {
pub struct Parser<'a, 'b> {
    scanner: &'a mut Scanner<'b>,
    source: &'b str,
    current: Token,
    previous: Token,
    errors: Vec<ParseError>,
}
}

Let’s understand each field:

Scanner Reference

#![allow(unused)]
fn main() {
scanner: &'a mut Scanner<'b>,
}

The parser doesn’t store tokens in advance. It asks the scanner for tokens one at a time. This is more memory efficient and allows for streaming parsing.

Source Reference

#![allow(unused)]
fn main() {
source: &'b str,
}

We keep a reference to the original source so we can extract text for identifiers and error messages.

Current and Previous Tokens

#![allow(unused)]
fn main() {
current: Token,
previous: Token,
}
  • current: The token we’re about to process
  • previous: The token we just processed

This gives us one token of lookahead, which is enough for ByteASM’s grammar.

Error Collection

#![allow(unused)]
fn main() {
errors: Vec<ParseError>,
}

Rather than stopping at the first error, we collect errors and continue parsing. This lets us report multiple problems in one pass.

Core Parser Utilities

Creating the Parser

#![allow(unused)]
fn main() {
impl<'a, 'b> Parser<'a, 'b> {
    pub fn new(scanner: &'a mut Scanner<'b>, source: &'b str) -> Self {
        // Get the first token
        let current = scanner.scan_token().unwrap_or_else(|_| Token {
            kind: TokenKind::EOF,
            value: None,
            location: Location::default(),
        });

        Self {
            scanner,
            source,
            current: current.clone(),
            previous: current,
            errors: Vec::new(),
        }
    }
}
}

We immediately scan the first token so current is ready.

Advancing Through Tokens

#![allow(unused)]
fn main() {
pub fn advance(&mut self) -> Token {
    let fallback_location = self.current.location;

    // Move current to previous, scan new current
    self.previous = std::mem::replace(
        &mut self.current,
        self.scanner.scan_token().unwrap_or_else(|_| Token {
            kind: TokenKind::EOF,
            value: None,
            location: fallback_location,
        }),
    );

    self.previous.clone()
}
}

advance() returns the old current token (now previous) and loads the next token.

Checking Token Types

#![allow(unused)]
fn main() {
pub fn check(&self, kind: TokenKind) -> bool {
    self.current.kind == kind
}

pub fn is_at_end(&self) -> bool {
    self.current.kind == TokenKind::EOF
}
}

Expecting Specific Tokens

#![allow(unused)]
fn main() {
pub fn expect(&mut self, kind: TokenKind, expected: &str) -> ParseResult<Token> {
    if self.check(kind) {
        Ok(self.advance())
    } else {
        Err(ParseError::UnexpectedToken {
            expected: expected.to_string(),
            found: self.current.kind,
            location: self.current.location,
        })
    }
}
}

expect() is used when we know what must come next. If the token doesn’t match, we report an error.

Error Types

Our parser can produce these errors:

#![allow(unused)]
fn main() {
pub enum ParseError {
    UnexpectedToken {
        expected: String,
        found: TokenKind,
        location: Location,
    },

    UnexpectedEof {
        location: Location,
    },

    InvalidOperand {
        message: String,
        location: Location,
    },

    InvalidExpression {
        message: String,
        location: Location,
    },

    InvalidDirective {
        message: String,
        location: Location,
    },

    InvalidLabel {
        message: String,
        location: Location,
    },
}
}

Each error includes a Location for precise error reporting.

Getting Error Location

#![allow(unused)]
fn main() {
impl ParseError {
    pub fn location(&self) -> &Location {
        match self {
            ParseError::UnexpectedToken { location, .. } => location,
            ParseError::UnexpectedEof { location } => location,
            ParseError::InvalidOperand { location, .. } => location,
            ParseError::InvalidExpression { location, .. } => location,
            ParseError::InvalidDirective { location, .. } => location,
            ParseError::InvalidLabel { location, .. } => location,
        }
    }
}
}

Error Recovery

When we encounter an error, we don’t want to stop immediately. We use panic mode recovery - skip tokens until we find a safe point to continue.

For assembly language, the safe point is the next line:

#![allow(unused)]
fn main() {
fn synchronize(&mut self) {
    while !self.is_at_end() {
        if self.check(TokenKind::NewLine) {
            self.advance();
            return;
        }
        self.advance();
    }
}
}

This means:

  1. When an error occurs, add it to errors
  2. Call synchronize() to skip to the next line
  3. Continue parsing
  4. At the end, return all collected errors

The Main Parse Loop

#![allow(unused)]
fn main() {
pub fn parse(&mut self) -> Result<Program, Vec<ParseError>> {
    let mut program = Program::new();

    while !self.is_at_end() {
        // Skip blank lines
        self.skip_empty_lines();

        if self.is_at_end() {
            break;
        }

        // Try to parse a line
        match self.parse_line() {
            Ok(statements) => {
                program.statements.extend(statements);
            }
            Err(e) => {
                self.errors.push(e);
                self.synchronize();
            }
        }
    }

    if self.errors.is_empty() {
        Ok(program)
    } else {
        Err(std::mem::take(&mut self.errors))
    }
}
}

Skipping Empty Lines

#![allow(unused)]
fn main() {
fn skip_empty_lines(&mut self) {
    while self.check(TokenKind::NewLine) || self.check(TokenKind::Comment) {
        self.advance();
    }
}
}

Line Structure

A single line can contain:

  • Just a label: main:
  • Just an instruction: nop
  • Just a directive: .org 0x8000
  • A label followed by an instruction: loop: dex

Our parse_line() function handles all these cases:

#![allow(unused)]
fn main() {
fn parse_line(&mut self) -> ParseResult<Vec<Statement>> {
    let mut statements = Vec::new();

    // Check for label
    if self.is_label_start() {
        statements.push(self.parse_label()?);
    }

    // Check for instruction or directive
    if self.check(TokenKind::Instruction) {
        statements.push(self.parse_instruction()?);
    } else if self.check(TokenKind::Directive) {
        statements.push(self.parse_directive()?);
    }

    // Expect end of line
    if !self.is_at_end() &&
       !self.check(TokenKind::NewLine) &&
       !self.check(TokenKind::Comment)
    {
        return Err(ParseError::UnexpectedToken {
            expected: "end of line".to_string(),
            found: self.current.kind,
            location: self.current.location,
        });
    }

    // Consume newline if present
    if self.check(TokenKind::NewLine) {
        self.advance();
    }

    Ok(statements)
}
}

Detecting Labels

A label is an identifier (or local label) followed by a colon:

#![allow(unused)]
fn main() {
fn is_label_start(&self) -> bool {
    (self.check(TokenKind::Identifier) || self.check(TokenKind::LocalLabel))
        // We need to look at what follows to confirm it's a label
        // For now, we try to parse and see
}
}

Result Type

#![allow(unused)]
fn main() {
pub type ParseResult<T> = Result<T, ParseError>;
}

Public API

The parser module exposes a simple function:

#![allow(unused)]
fn main() {
pub fn parse(source: &str) -> Result<Program, Vec<ParseError>> {
    let mut scanner = Scanner::new(source);
    let mut parser = Parser::new(&mut scanner, source);
    parser.parse()
}
}

Usage:

#![allow(unused)]
fn main() {
let source = ".org 0x8000\nlda #0x42";
let program = byte_asm::parser::parse(source)?;
}

Summary

In this chapter, we set up the parser infrastructure:

  • Parser state: scanner reference, current/previous tokens, error list
  • Core utilities: advance(), check(), expect()
  • Error types: with location information for each error
  • Error recovery: synchronize on newlines to continue after errors
  • Main loop: parse lines until EOF, collecting errors

In the next chapter, we’ll implement the actual parsing logic for labels, instructions, and directives.


Previous: Chapter 3 - Designing the AST | Next: Chapter 5 - Parser Implementation