Chapter 4: Building the Parser - Structure
In this chapter, we’ll set up the infrastructure for our parser. The parser transforms tokens into an Abstract Syntax Tree.
Parser Design Pattern
We’ll use recursive descent parsing - a simple and intuitive approach where:
- Each grammar rule becomes a function
- Functions call each other to match nested structures
- We look ahead at tokens to decide which rule to apply
This approach works well for assembly language because the grammar is simple and unambiguous.
The Parser State
Our parser maintains this state:
#![allow(unused)]
fn main() {
pub struct Parser<'a, 'b> {
scanner: &'a mut Scanner<'b>,
source: &'b str,
current: Token,
previous: Token,
errors: Vec<ParseError>,
}
}
Let’s understand each field:
Scanner Reference
#![allow(unused)]
fn main() {
scanner: &'a mut Scanner<'b>,
}
The parser doesn’t store tokens in advance. It asks the scanner for tokens one at a time. This is more memory efficient and allows for streaming parsing.
Source Reference
#![allow(unused)]
fn main() {
source: &'b str,
}
We keep a reference to the original source so we can extract text for identifiers and error messages.
Current and Previous Tokens
#![allow(unused)]
fn main() {
current: Token,
previous: Token,
}
current: The token we’re about to processprevious: The token we just processed
This gives us one token of lookahead, which is enough for ByteASM’s grammar.
Error Collection
#![allow(unused)]
fn main() {
errors: Vec<ParseError>,
}
Rather than stopping at the first error, we collect errors and continue parsing. This lets us report multiple problems in one pass.
Core Parser Utilities
Creating the Parser
#![allow(unused)]
fn main() {
impl<'a, 'b> Parser<'a, 'b> {
pub fn new(scanner: &'a mut Scanner<'b>, source: &'b str) -> Self {
// Get the first token
let current = scanner.scan_token().unwrap_or_else(|_| Token {
kind: TokenKind::EOF,
value: None,
location: Location::default(),
});
Self {
scanner,
source,
current: current.clone(),
previous: current,
errors: Vec::new(),
}
}
}
}
We immediately scan the first token so current is ready.
Advancing Through Tokens
#![allow(unused)]
fn main() {
pub fn advance(&mut self) -> Token {
let fallback_location = self.current.location;
// Move current to previous, scan new current
self.previous = std::mem::replace(
&mut self.current,
self.scanner.scan_token().unwrap_or_else(|_| Token {
kind: TokenKind::EOF,
value: None,
location: fallback_location,
}),
);
self.previous.clone()
}
}
advance() returns the old current token (now previous) and loads the next token.
Checking Token Types
#![allow(unused)]
fn main() {
pub fn check(&self, kind: TokenKind) -> bool {
self.current.kind == kind
}
pub fn is_at_end(&self) -> bool {
self.current.kind == TokenKind::EOF
}
}
Expecting Specific Tokens
#![allow(unused)]
fn main() {
pub fn expect(&mut self, kind: TokenKind, expected: &str) -> ParseResult<Token> {
if self.check(kind) {
Ok(self.advance())
} else {
Err(ParseError::UnexpectedToken {
expected: expected.to_string(),
found: self.current.kind,
location: self.current.location,
})
}
}
}
expect() is used when we know what must come next. If the token doesn’t match, we report an error.
Error Types
Our parser can produce these errors:
#![allow(unused)]
fn main() {
pub enum ParseError {
UnexpectedToken {
expected: String,
found: TokenKind,
location: Location,
},
UnexpectedEof {
location: Location,
},
InvalidOperand {
message: String,
location: Location,
},
InvalidExpression {
message: String,
location: Location,
},
InvalidDirective {
message: String,
location: Location,
},
InvalidLabel {
message: String,
location: Location,
},
}
}
Each error includes a Location for precise error reporting.
Getting Error Location
#![allow(unused)]
fn main() {
impl ParseError {
pub fn location(&self) -> &Location {
match self {
ParseError::UnexpectedToken { location, .. } => location,
ParseError::UnexpectedEof { location } => location,
ParseError::InvalidOperand { location, .. } => location,
ParseError::InvalidExpression { location, .. } => location,
ParseError::InvalidDirective { location, .. } => location,
ParseError::InvalidLabel { location, .. } => location,
}
}
}
}
Error Recovery
When we encounter an error, we don’t want to stop immediately. We use panic mode recovery - skip tokens until we find a safe point to continue.
For assembly language, the safe point is the next line:
#![allow(unused)]
fn main() {
fn synchronize(&mut self) {
while !self.is_at_end() {
if self.check(TokenKind::NewLine) {
self.advance();
return;
}
self.advance();
}
}
}
This means:
- When an error occurs, add it to
errors - Call
synchronize()to skip to the next line - Continue parsing
- At the end, return all collected errors
The Main Parse Loop
#![allow(unused)]
fn main() {
pub fn parse(&mut self) -> Result<Program, Vec<ParseError>> {
let mut program = Program::new();
while !self.is_at_end() {
// Skip blank lines
self.skip_empty_lines();
if self.is_at_end() {
break;
}
// Try to parse a line
match self.parse_line() {
Ok(statements) => {
program.statements.extend(statements);
}
Err(e) => {
self.errors.push(e);
self.synchronize();
}
}
}
if self.errors.is_empty() {
Ok(program)
} else {
Err(std::mem::take(&mut self.errors))
}
}
}
Skipping Empty Lines
#![allow(unused)]
fn main() {
fn skip_empty_lines(&mut self) {
while self.check(TokenKind::NewLine) || self.check(TokenKind::Comment) {
self.advance();
}
}
}
Line Structure
A single line can contain:
- Just a label:
main: - Just an instruction:
nop - Just a directive:
.org 0x8000 - A label followed by an instruction:
loop: dex
Our parse_line() function handles all these cases:
#![allow(unused)]
fn main() {
fn parse_line(&mut self) -> ParseResult<Vec<Statement>> {
let mut statements = Vec::new();
// Check for label
if self.is_label_start() {
statements.push(self.parse_label()?);
}
// Check for instruction or directive
if self.check(TokenKind::Instruction) {
statements.push(self.parse_instruction()?);
} else if self.check(TokenKind::Directive) {
statements.push(self.parse_directive()?);
}
// Expect end of line
if !self.is_at_end() &&
!self.check(TokenKind::NewLine) &&
!self.check(TokenKind::Comment)
{
return Err(ParseError::UnexpectedToken {
expected: "end of line".to_string(),
found: self.current.kind,
location: self.current.location,
});
}
// Consume newline if present
if self.check(TokenKind::NewLine) {
self.advance();
}
Ok(statements)
}
}
Detecting Labels
A label is an identifier (or local label) followed by a colon:
#![allow(unused)]
fn main() {
fn is_label_start(&self) -> bool {
(self.check(TokenKind::Identifier) || self.check(TokenKind::LocalLabel))
// We need to look at what follows to confirm it's a label
// For now, we try to parse and see
}
}
Result Type
#![allow(unused)]
fn main() {
pub type ParseResult<T> = Result<T, ParseError>;
}
Public API
The parser module exposes a simple function:
#![allow(unused)]
fn main() {
pub fn parse(source: &str) -> Result<Program, Vec<ParseError>> {
let mut scanner = Scanner::new(source);
let mut parser = Parser::new(&mut scanner, source);
parser.parse()
}
}
Usage:
#![allow(unused)]
fn main() {
let source = ".org 0x8000\nlda #0x42";
let program = byte_asm::parser::parse(source)?;
}
Summary
In this chapter, we set up the parser infrastructure:
- Parser state: scanner reference, current/previous tokens, error list
- Core utilities:
advance(),check(),expect() - Error types: with location information for each error
- Error recovery: synchronize on newlines to continue after errors
- Main loop: parse lines until EOF, collecting errors
In the next chapter, we’ll implement the actual parsing logic for labels, instructions, and directives.
Previous: Chapter 3 - Designing the AST | Next: Chapter 5 - Parser Implementation