Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Building a 6502 Assembler: A Complete Tutorial

Welcome to the Byte Fantasy Console Assembler Tutorial! This comprehensive guide will walk you through building a complete 6502 assembler from scratch.

What You’ll Build

By the end of this tutorial, you’ll have a fully functional assembler that can:

  • Parse 6502 assembly source code using the ByteASM language
  • Resolve labels and forward references
  • Generate machine code binaries
  • Handle directives like .org, .db, .dw, .equ, .include
  • Evaluate expressions (e.g., label + 5)
  • Report meaningful errors with line/column information

Prerequisites

  • Basic Rust knowledge (ownership, structs, enums, pattern matching)
  • No prior knowledge of assemblers or compilers required

Tutorial Chapters

Part 1: Foundations

  1. Introduction to Assemblers and the 6502
  2. Building the Scanner (Lexer)

Part 2: Parsing

  1. Designing the Abstract Syntax Tree
  2. Building the Parser - Structure
  3. Building the Parser - Implementation

Part 3: Assembly

  1. The Symbol Table
  2. Two-Pass Assembly
  3. Code Generation
  4. Expression Evaluation

Part 4: Completion

  1. Implementing Directives
  2. Error Handling and Reporting
  3. The Command-Line Interface
  4. Testing the Assembler
  5. Complete Example - A Game

Quick Start

Once you’ve completed the tutorial, you can assemble programs like this:

cargo run -p byte_asm -- bouncing_ball.s -o game.bin
cargo run -p byte_emu -- game.bin

ByteASM Language Overview

ByteASM is a clean, modern 6502 assembly language designed for the Byte fantasy console:

; bouncing_ball.s - A simple demo for Byte console

.equ BALL_X   0x00
.equ BALL_Y   0x01
.equ VRAM     0x1000

.org 0x8000

reset:
    lda #32           ; initialize ball position
    sta BALL_X
    sta BALL_Y

main_loop:
    jsr update_ball
    jsr draw_ball
    rti               ; wait for next frame

update_ball:
    lda BALL_X
    clc
    adc #1
    sta BALL_X
    rts

draw_ball:
    ldx BALL_X
    lda #1            ; white color
    sta VRAM,x
    rts

.org 0xFFFC
.dw reset             ; reset vector
.dw main_loop         ; IRQ vector

Key Features

  • Numeric literals: Decimal (255), hexadecimal (0xFF), binary (0b11111111)
  • All lowercase mnemonics and registers
  • Local labels with . or @ prefix
  • Expressions with +, -, *, / operators
  • Byte extraction with < (low byte) and > (high byte)
  • Current address with $

Project Structure

byte_asm/
├── src/
│   ├── lib.rs              # Library exports
│   ├── main.rs             # CLI application
│   ├── ast.rs              # Abstract Syntax Tree
│   ├── error.rs            # Unified error types
│   ├── symbol.rs           # Symbol table
│   ├── scanner/            # Lexical analysis
│   ├── parser/             # Syntax analysis
│   └── assembler/          # Code generation
├── tests/                  # Integration tests
└── examples/               # Example assembly programs

Let’s begin with Chapter 1: Introduction to Assemblers and the 6502!

Chapter 1: Introduction to Assemblers and the 6502

Welcome to the first chapter of our journey into building a 6502 assembler! In this chapter, we’ll establish the foundational concepts you need before diving into code.

What is an Assembler?

An assembler is a program that translates assembly language source code into machine code that a processor can execute directly.

The Translation Pipeline

Source Code  →  Assembler  →  Machine Code
  (text)                       (bytes)

Consider this simple example:

lda #0x42    ; Load 0x42 into accumulator
sta 0x00     ; Store to address 0x00

The assembler transforms this into:

A9 42 85 00

These four bytes are the actual instructions the 6502 CPU will execute.

Assembly vs Machine Code vs High-Level Languages

Machine Code: Raw bytes that the CPU understands. Each byte has a specific meaning - opcodes, operands, data. Writing directly in machine code is tedious and error-prone.

Assembly Language: A human-readable representation of machine code. Each instruction has a mnemonic (like LDA for “Load Accumulator”) that maps to a specific opcode. Assembly is a 1-to-1 mapping with machine code.

High-Level Languages: Languages like C, Rust, or Python abstract away machine details. One line of high-level code might compile to dozens of machine instructions.

The 6502 Microprocessor

A Brief History

The 6502, designed in 1975 by MOS Technology, became one of the most influential processors in computing history. It powered:

  • Apple II (1977)
  • Commodore 64 (1982)
  • Nintendo Entertainment System (1983)
  • Atari 2600 (1977)

Its simplicity and low cost made it ideal for early personal computers and game consoles.

Architecture Overview

The 6502 is an 8-bit processor with a 16-bit address bus:

  • 8-bit: It processes data 8 bits (1 byte) at a time
  • 16-bit address bus: It can address up to 64KB of memory (2^16 = 65,536 bytes)

Registers

The 6502 has a small set of registers:

RegisterSizePurpose
A (Accumulator)8-bitMain register for arithmetic and logic operations
X8-bitIndex register, used for addressing and counting
Y8-bitIndex register, similar to X
SP (Stack Pointer)8-bitPoints to current position in the stack (page $01)
PC (Program Counter)16-bitAddress of the next instruction to execute
P (Status/Flags)8-bitProcessor status flags

Status Flags

The P register contains flags that reflect the result of operations:

7 6 5 4 3 2 1 0
N V - B D I Z C
FlagNameMeaning
NNegativeSet if result is negative (bit 7 is 1)
VOverflowSet on signed arithmetic overflow
BBreakSet by BRK instruction
DDecimalEnables BCD arithmetic mode
IInterruptDisables IRQ when set
ZZeroSet if result is zero
CCarrySet on unsigned overflow/underflow

6502 Instruction Format

Each 6502 instruction consists of:

  1. Opcode (1 byte): Identifies the instruction and addressing mode
  2. Operand (0, 1, or 2 bytes): The data or address the instruction operates on

Instruction Sizes

  • 1 byte: Instructions with no operand (implied, accumulator modes)
  • 2 bytes: Opcode + 8-bit operand (immediate, zero page, relative)
  • 3 bytes: Opcode + 16-bit operand (absolute addressing)

Little-Endian Byte Order

The 6502 uses little-endian byte order for 16-bit values. The low byte comes first:

Address $1234 is stored as: 34 12

This is important when emitting word values in our assembler.

Addressing Modes

The 6502 supports multiple addressing modes that determine how the operand is interpreted. Each combination of instruction and addressing mode has a unique opcode.

Implied Mode

No operand - the instruction operates on a specific register or performs a fixed action.

nop         ; No operation (1 byte: EA)
clc         ; Clear carry flag (1 byte: 18)
rts         ; Return from subroutine (1 byte: 60)

Accumulator Mode

Operates directly on the A register.

asl a       ; Arithmetic shift left on A (1 byte: 0A)
ror a       ; Rotate right on A (1 byte: 6A)

Immediate Mode

The operand is the actual value to use.

lda #0xFF   ; Load 0xFF into A (2 bytes: A9 FF)
ldx #0x10   ; Load 0x10 into X (2 bytes: A2 10)

The # prefix indicates immediate mode.

Zero Page Mode

The operand is an 8-bit address in the first 256 bytes of memory (page zero).

lda 0x80    ; Load from address 0x0080 (2 bytes: A5 80)
sta 0x00    ; Store to address 0x0000 (2 bytes: 85 00)

Zero page access is faster and uses fewer bytes than absolute addressing.

Absolute Mode

The operand is a full 16-bit address.

lda 0x2000  ; Load from address 0x2000 (3 bytes: AD 00 20)
jmp 0x8000  ; Jump to address 0x8000 (3 bytes: 4C 00 80)

Indexed Modes

Add an index register to the address:

lda 0x2000,x    ; Load from 0x2000 + X (Absolute,X)
lda 0x2000,y    ; Load from 0x2000 + Y (Absolute,Y)
lda 0x80,x      ; Load from 0x80 + X (Zero Page,X)

Indirect Modes

Use an address stored in memory:

jmp (0x2000)    ; Jump to address stored at 0x2000-0x2001 (Indirect)
lda (0x80,x)    ; Indexed Indirect: address at (0x80+X)
lda (0x80),y    ; Indirect Indexed: (address at 0x80) + Y

Relative Mode

Used only for branch instructions. The operand is a signed 8-bit offset from the next instruction.

beq label       ; Branch if zero flag set
bne loop        ; Branch if zero flag clear

The Byte Fantasy Console Memory Map

Our assembler targets the Byte Fantasy Console, which has a specific memory layout:

0x0000 - 0x00FF : Zero Page (fast access RAM)
0x0100 - 0x01FF : Stack
0x0200 - 0x0FFF : General RAM
0x1000 - 0x1FFF : Video RAM (64x64 pixels)
...
0x8000 - 0xFFFB : Program ROM
0xFFFC - 0xFFFD : Reset Vector (16-bit address)
0xFFFE - 0xFFFF : IRQ Vector (16-bit address)

Special Registers

AddressNamePurpose
0xFDVID_PTRVideo page pointer (high byte of VRAM address)
0xFERANDOMRandom number generator
0xFFINPUTController input state

Reset and IRQ Vectors

When the console powers on:

  1. It reads the 16-bit address at 0xFFFC-0xFFFD
  2. Jumps to that address (your reset/init code)

When an IRQ occurs (like VBLANK):

  1. It reads the address at 0xFFFE-0xFFFF
  2. Jumps to that address (your interrupt handler)

What We’ll Build

Our assembler will:

  1. Scan source code into tokens (lexical analysis)
  2. Parse tokens into an Abstract Syntax Tree (syntax analysis)
  3. Resolve labels and forward references (symbol table)
  4. Generate machine code bytes (code generation)
  5. Output a binary file ready for the emulator

Example Input

.org 0x8000

start:
    lda #0x42
    sta 0x00
    jmp start

.org 0xFFFC
.dw start

Example Output

A binary file containing:

  • At offset 0x0000 (address 0x8000): A9 42 85 00 4C 00 80
  • At offset 0x7FFC (address 0xFFFC): 00 80

Summary

In this chapter, we learned:

  • An assembler translates human-readable assembly into machine code
  • The 6502 is an 8-bit processor with a 16-bit address space
  • Instructions consist of opcodes and operands
  • Different addressing modes determine how operands are interpreted
  • The Byte console has a specific memory map with special registers

In the next chapter, we’ll start building our scanner to tokenize assembly source code.


Next: Chapter 2 - Building the Scanner

Chapter 2: Building the Scanner (Lexer)

In this chapter, we’ll build the scanner (also called a lexer) - the first stage of our assembler. The scanner converts source text into a stream of tokens.

What is Lexical Analysis?

The scanner performs lexical analysis: breaking the input stream of characters into meaningful chunks called tokens. Think of tokens as the “words” of our assembly language.

Characters to Tokens

Source:  lda #0xFF

Tokens:  [INSTRUCTION(LDA)] [HASH] [NUMBER(255)]

Each token has:

  • A kind (what type of token it is)
  • Optionally a value (the parsed data)
  • A location (line, column, start position)

Token Types in ByteASM

Let’s define all the token types our scanner will produce:

#![allow(unused)]
fn main() {
pub enum TokenKind {
    // Punctuation
    CloseParen,    // )
    Colon,         // :
    Comma,         // ,
    OpenParen,     // (

    // Operators
    Hash,          // #
    Minus,         // -
    Plus,          // +
    Slash,         // /
    Star,          // *
    LessThan,      // <
    GreaterThan,   // >

    // Special symbols
    Dollar,        // $ (current address)
    Semicolon,     // ; (comment start)

    // Literals
    Number,        // 0xFF, 0b1010, 42
    String,        // "hello"

    // Identifiers and keywords
    Directive,     // .org, .db, etc.
    Instruction,   // lda, sta, etc.
    Register,      // a, x, y
    Identifier,    // label names
    LocalLabel,    // .loop or @loop

    // Structure
    Comment,       // ; to end of line
    NewLine,       // \n
    EOF,           // End of file
}
}

The Token Structure

Each token carries information about its type, value, and position:

#![allow(unused)]
fn main() {
pub struct Token {
    pub kind: TokenKind,
    pub value: Option<TokenValue>,
    pub location: Location,
}

pub struct Location {
    pub line: usize,      // Line number (1-indexed)
    pub column: usize,    // Column number (1-indexed)
    pub start: usize,     // Byte offset in source
    pub length: usize,    // Length in bytes
}

pub enum TokenValue {
    Number(u64),
    String(String),
    Directive(Directive),
    Instruction(Mnemonic),
}
}

Scanner Architecture

Our scanner uses a cursor to track our position in the source:

#![allow(unused)]
fn main() {
pub struct Scanner<'a> {
    cursor: Cursor<'a>,
    source: &'a str,
}

pub struct Cursor<'a> {
    chars: Peekable<Chars<'a>>,
    line: usize,
    column: usize,
    current: usize,    // Current byte position
    start: usize,      // Start of current token
}
}

The Cursor

The cursor provides these operations:

#![allow(unused)]
fn main() {
impl Cursor {
    /// Peek at the next character without consuming it
    fn peek(&mut self) -> Option<char>

    /// Advance and return the next character
    fn advance(&mut self) -> Option<char>

    /// Mark the start of a new token
    fn sync(&mut self)

    /// Create a Location for the current token
    fn location(&self) -> Location

    /// Advance the line counter (after newline)
    fn advance_line(&mut self)
}
}

The Main Scanning Loop

The heart of the scanner is the scan_token method:

#![allow(unused)]
fn main() {
pub fn scan_token(&mut self) -> Result<Token, ScannerError> {
    self.skip_whitespace();
    self.cursor.sync();

    match self.cursor.advance() {
        None => self.make_token(TokenKind::EOF, None),
        Some(c) => match c {
            // Single-character tokens
            ')' => self.make_token(TokenKind::CloseParen, None),
            '(' => self.make_token(TokenKind::OpenParen, None),
            ',' => self.make_token(TokenKind::Comma, None),
            ':' => self.make_token(TokenKind::Colon, None),
            '#' => self.make_token(TokenKind::Hash, None),
            '+' => self.make_token(TokenKind::Plus, None),
            '-' => self.make_token(TokenKind::Minus, None),
            '*' => self.make_token(TokenKind::Star, None),
            '/' => self.make_token(TokenKind::Slash, None),
            '<' => self.make_token(TokenKind::LessThan, None),
            '>' => self.make_token(TokenKind::GreaterThan, None),
            '$' => self.make_token(TokenKind::Dollar, None),

            // Newline
            '\n' => {
                let token = self.make_token(TokenKind::NewLine, None);
                self.cursor.advance_line();
                token
            }

            // Comment
            ';' => {
                self.scan_comment();
                self.make_token(TokenKind::Comment, None)
            }

            // More complex tokens...
            _ => self.scan_complex_token(c)
        }
    }
}
}

Scanning Numbers

ByteASM supports three number formats:

  • 0xFF - Hexadecimal (0x prefix)
  • 0b1010 - Binary (0b prefix)
  • 42 - Decimal (no prefix)
#![allow(unused)]
fn main() {
fn scan_number(&mut self, first_char: char) -> Result<Token, ScannerError> {
    // Check for prefix
    if first_char == '0' {
        match self.cursor.peek() {
            Some('x') | Some('X') => {
                self.cursor.advance();
                return self.scan_hex();
            }
            Some('b') | Some('B') => {
                self.cursor.advance();
                return self.scan_binary();
            }
            _ => {}
        }
    }

    self.scan_decimal()
}

fn scan_hex(&mut self) -> Result<Token, ScannerError> {
    let start = self.cursor.current;

    // Consume hex digits
    while let Some(c) = self.cursor.peek() {
        if c.is_ascii_hexdigit() {
            self.cursor.advance();
        } else {
            break;
        }
    }

    // Must have at least one digit after 0x
    if self.cursor.current == start {
        return Err(ScannerError::NumberExpected {
            line: self.cursor.line,
            column: self.cursor.column,
            symbol: 'x',
        });
    }

    // Parse the hex value
    let hex_str = &self.source[start..self.cursor.current];
    let value = u64::from_str_radix(hex_str, 16)?;

    self.make_token(TokenKind::Number, Some(TokenValue::Number(value)))
}
}

Scanning Identifiers

Identifiers include:

  • Label names (main, loop_counter)
  • Instruction mnemonics (lda, sta)
  • Register names (a, x, y)
#![allow(unused)]
fn main() {
fn scan_identifier(&mut self) -> Result<Token, ScannerError> {
    // Consume alphanumeric characters and underscores
    while let Some(c) = self.cursor.peek() {
        if c.is_ascii_alphanumeric() || c == '_' {
            self.cursor.advance();
        } else {
            break;
        }
    }

    let text = &self.source[self.cursor.start..self.cursor.current];
    let lower = text.to_lowercase();

    // Check for register names
    if lower == "a" || lower == "x" || lower == "y" {
        return Ok(self.make_token(TokenKind::Register, None));
    }

    // Check for instruction mnemonics
    if let Ok(mnemonic) = Mnemonic::try_from(lower.to_uppercase().as_str()) {
        return Ok(self.make_token(
            TokenKind::Instruction,
            Some(TokenValue::Instruction(mnemonic))
        ));
    }

    // It's a general identifier
    Ok(self.make_token(TokenKind::Identifier, None))
}
}

Scanning Directives and Local Labels

Both directives and local labels start with .:

#![allow(unused)]
fn main() {
fn scan_dot(&mut self) -> Result<Token, ScannerError> {
    // Consume the identifier after the dot
    let start = self.cursor.current;
    while let Some(c) = self.cursor.peek() {
        if c.is_ascii_alphanumeric() || c == '_' {
            self.cursor.advance();
        } else {
            break;
        }
    }

    let text = &self.source[start..self.cursor.current];

    // Try to parse as directive
    if let Ok(directive) = Directive::try_from(text.to_uppercase().as_str()) {
        return Ok(self.make_token(
            TokenKind::Directive,
            Some(TokenValue::Directive(directive))
        ));
    }

    // It's a local label
    Ok(self.make_token(TokenKind::LocalLabel, None))
}
}

Local labels can also start with @:

#![allow(unused)]
fn main() {
'@' => {
    self.scan_identifier_rest();
    Ok(self.make_token(TokenKind::LocalLabel, None))
}
}

Scanning Strings

Strings appear in .db directives:

#![allow(unused)]
fn main() {
fn scan_string(&mut self, quote: char) -> Result<String, ScannerError> {
    let mut result = String::new();

    while let Some(c) = self.cursor.peek() {
        if c == quote || c == '\n' {
            break;
        }
        self.cursor.advance();

        // Handle escape sequences
        if c == '\\' {
            match self.cursor.peek() {
                Some('n') => { result.push('\n'); self.cursor.advance(); }
                Some('r') => { result.push('\r'); self.cursor.advance(); }
                Some('t') => { result.push('\t'); self.cursor.advance(); }
                Some('"') => { result.push('"'); self.cursor.advance(); }
                Some('\'') => { result.push('\''); self.cursor.advance(); }
                Some('\\') => { result.push('\\'); self.cursor.advance(); }
                Some(e) => { result.push(e); self.cursor.advance(); }
                None => continue,
            }
        } else {
            result.push(c);
        }
    }

    // Check for unterminated string
    if self.cursor.peek() != Some(quote) {
        return Err(ScannerError::UnterminatedString {
            line: self.cursor.line,
            column: self.cursor.column,
            quote,
        });
    }

    // Consume closing quote
    self.cursor.advance();
    Ok(result)
}
}

Handling Comments

Comments run from ; to the end of the line:

#![allow(unused)]
fn main() {
fn scan_comment(&mut self) {
    while let Some(c) = self.cursor.peek() {
        if c == '\n' {
            break;
        }
        self.cursor.advance();
    }
}
}

Whitespace Handling

We skip spaces, tabs, and carriage returns (but not newlines, which are significant):

#![allow(unused)]
fn main() {
fn skip_whitespace(&mut self) {
    while let Some(c) = self.cursor.peek() {
        match c {
            ' ' | '\r' | '\t' => { self.cursor.advance(); }
            _ => break,
        }
    }
}
}

Error Handling

The scanner can produce these errors:

#![allow(unused)]
fn main() {
pub enum ScannerError {
    UnknownCharacter { line: usize, column: usize, character: char },
    UnknownDirective { line: usize, column: usize, directive: String },
    NumberExpected { line: usize, column: usize, symbol: char },
    UnterminatedString { line: usize, column: usize, quote: char },
}
}

Running the Scanner

Let’s trace through scanning this example:

.org 0x8000
start:
    lda #0x00
InputToken KindValue
.orgDirectiveORG
0x8000Number32768
\nNewLine-
startIdentifier-
:Colon-
\nNewLine-
ldaInstructionLDA
#Hash-
0x00Number0
\nNewLine-
(end)EOF-

Summary

In this chapter, we built a scanner that:

  • Recognizes all ByteASM token types
  • Handles three number formats (hex, binary, decimal)
  • Distinguishes between identifiers, instructions, and registers
  • Parses directives and local labels
  • Handles strings with escape sequences
  • Tracks location information for error reporting

In the next chapter, we’ll design the Abstract Syntax Tree that will represent our parsed program.


Previous: Chapter 1 - Introduction | Next: Chapter 3 - Designing the AST

Chapter 3: Designing the Abstract Syntax Tree

In this chapter, we’ll design the Abstract Syntax Tree (AST) - the data structure that represents a parsed assembly program.

What is an AST?

An Abstract Syntax Tree is a tree representation of the syntactic structure of source code. Unlike the flat list of tokens from the scanner, the AST captures the hierarchical relationships between program elements.

Why Not Just Use Tokens?

Tokens tell us what elements we have, but not how they relate to each other. Consider:

lda (0x80,x)

The tokens are: INSTRUCTION OPENPAREN NUMBER COMMA REGISTER CLOSEPAREN

But what we need to know is:

  • This is an instruction: LDA
  • With an operand in Indexed Indirect mode: (0x80,x)
  • The base address is: 0x80

The AST captures this structure explicitly.

Program Structure

A ByteASM program consists of statements:

#![allow(unused)]
fn main() {
pub struct Program {
    pub statements: Vec<Statement>,
    pub source_file: Option<String>,
}
}

Statement Types

There are three kinds of statements:

#![allow(unused)]
fn main() {
pub enum Statement {
    Label(LabelDef),
    Instruction(InstructionStmt),
    Directive(DirectiveStmt),
}
}

Label Definitions

#![allow(unused)]
fn main() {
pub struct LabelDef {
    pub name: String,
    pub is_local: bool,     // true for .loop or @loop
    pub location: Location,
}
}

Examples:

  • main:LabelDef { name: "main", is_local: false }
  • .loop:LabelDef { name: ".loop", is_local: true }

Instruction Statements

#![allow(unused)]
fn main() {
pub struct InstructionStmt {
    pub mnemonic: Mnemonic,
    pub operand: Option<Operand>,
    pub location: Location,
}
}

Examples:

  • nopInstructionStmt { mnemonic: NOP, operand: None }
  • lda #0x42InstructionStmt { mnemonic: LDA, operand: Some(Immediate(Number(66))) }

Directive Statements

#![allow(unused)]
fn main() {
pub enum DirectiveStmt {
    Org {
        address: Expression,
        location: Location,
    },
    Db {
        values: Vec<DataValue>,
        location: Location,
    },
    Dw {
        values: Vec<Expression>,
        location: Location,
    },
    Equ {
        name: String,
        value: Expression,
        location: Location,
    },
    Include {
        path: String,
        location: Location,
    },
}
}

Operands and Addressing Modes

The operand type directly encodes the addressing mode:

#![allow(unused)]
fn main() {
pub enum Operand {
    // #value - Immediate mode
    Immediate(Expression),

    // a - Accumulator mode (for ASL, ROL, etc.)
    Accumulator,

    // address - Zero Page or Absolute
    Address(Expression),

    // address,x - Indexed by X
    IndexedX(Expression),

    // address,y - Indexed by Y
    IndexedY(Expression),

    // (address) - Indirect (JMP only)
    Indirect(Expression),

    // (zp,x) - Indexed Indirect
    IndirectX(Expression),

    // (zp),y - Indirect Indexed
    IndirectY(Expression),
}
}

Addressing Mode Mapping

SyntaxOperand Type6502 Mode
(none)NoneImplied
aAccumulatorAccumulator
#exprImmediate(expr)Immediate
exprAddress(expr)Zero Page or Absolute
expr,xIndexedX(expr)Zero Page,X or Absolute,X
expr,yIndexedY(expr)Zero Page,Y or Absolute,Y
(expr)Indirect(expr)Indirect
(expr,x)IndirectX(expr)Indexed Indirect
(expr),yIndirectY(expr)Indirect Indexed

Note: The distinction between Zero Page and Absolute is determined during code generation based on the expression value.

Expressions

Expressions represent numeric values that can be computed:

#![allow(unused)]
fn main() {
pub enum Expression {
    // Literal number: 0xFF, 255, 0b1010
    Number(i64),

    // Label or constant name
    Identifier(String),

    // Local label: .loop, @loop
    LocalIdentifier(String),

    // Binary operation: left op right
    Binary {
        left: Box<Expression>,
        op: BinaryOp,
        right: Box<Expression>,
    },

    // Unary operation: op operand
    Unary {
        op: UnaryOp,
        operand: Box<Expression>,
    },

    // Current address: $
    CurrentAddress,
}
}

Binary Operators

#![allow(unused)]
fn main() {
pub enum BinaryOp {
    Add,  // +
    Sub,  // -
    Mul,  // *
    Div,  // /
}
}

Unary Operators

#![allow(unused)]
fn main() {
pub enum UnaryOp {
    Neg,     // - (negation)
    LoByte,  // < (extract low byte)
    HiByte,  // > (extract high byte)
}
}

Expression Examples

SourceAST
42Number(42)
labelIdentifier("label")
.loopLocalIdentifier(".loop")
$CurrentAddress
10 + 5Binary { left: Number(10), op: Add, right: Number(5) }
<0x1234Unary { op: LoByte, operand: Number(0x1234) }
>labelUnary { op: HiByte, operand: Identifier("label") }

Data Values for .db

The .db directive can contain bytes or strings:

#![allow(unused)]
fn main() {
pub enum DataValue {
    Byte(Expression),
    String(String),
}
}

Example:

.db "Hello", 0x0A, 0

Parses to:

#![allow(unused)]
fn main() {
Db {
    values: vec![
        DataValue::String("Hello"),
        DataValue::Byte(Number(0x0A)),
        DataValue::Byte(Number(0)),
    ]
}
}

Complete AST Example

Let’s trace through parsing this program:

.equ SCREEN 0x1000

.org 0x8000

start:
    lda #>SCREEN
    sta 0xFD
.loop:
    jmp .loop

.org 0xFFFC
.dw start

The AST would be:

#![allow(unused)]
fn main() {
Program {
    statements: [
        Directive(Equ {
            name: "SCREEN",
            value: Number(0x1000),
        }),
        Directive(Org {
            address: Number(0x8000),
        }),
        Label(LabelDef {
            name: "start",
            is_local: false,
        }),
        Instruction(InstructionStmt {
            mnemonic: LDA,
            operand: Some(Immediate(
                Unary { op: HiByte, operand: Identifier("SCREEN") }
            )),
        }),
        Instruction(InstructionStmt {
            mnemonic: STA,
            operand: Some(Address(Number(0xFD))),
        }),
        Label(LabelDef {
            name: ".loop",
            is_local: true,
        }),
        Instruction(InstructionStmt {
            mnemonic: JMP,
            operand: Some(Address(LocalIdentifier(".loop"))),
        }),
        Directive(Org {
            address: Number(0xFFFC),
        }),
        Directive(Dw {
            values: [Identifier("start")],
        }),
    ],
    source_file: Some("example.s"),
}
}

Design Principles

1. Preserve Source Information

Every node includes a Location so we can report errors with line numbers:

#![allow(unused)]
fn main() {
impl Statement {
    pub fn location(&self) -> &Location {
        match self {
            Statement::Label(l) => &l.location,
            Statement::Instruction(i) => &i.location,
            Statement::Directive(d) => d.location(),
        }
    }
}
}

2. Explicit Over Implicit

Rather than encoding addressing modes as strings or enums, we use distinct types that make the structure clear:

#![allow(unused)]
fn main() {
// Good: Structure is explicit
Operand::IndirectY(Expression::Number(0x80))

// Bad: Structure is implicit
Operand { mode: "indirect_y", value: 0x80 }
}

3. Expressions Are Recursive

Using Box<Expression> allows expressions to be nested:

#![allow(unused)]
fn main() {
// Represents: (label + offset) * 2
Binary {
    left: Box::new(Binary {
        left: Box::new(Identifier("label")),
        op: Add,
        right: Box::new(Identifier("offset")),
    }),
    op: Mul,
    right: Box::new(Number(2)),
}
}

4. Helper Constructors

We provide convenient constructors:

#![allow(unused)]
fn main() {
impl Expression {
    pub fn binary(left: Expression, op: BinaryOp, right: Expression) -> Self {
        Expression::Binary {
            left: Box::new(left),
            op,
            right: Box::new(right),
        }
    }

    pub fn unary(op: UnaryOp, operand: Expression) -> Self {
        Expression::Unary {
            op,
            operand: Box::new(operand),
        }
    }
}
}

Summary

In this chapter, we designed an AST that:

  • Represents the complete structure of a ByteASM program
  • Distinguishes between labels, instructions, and directives
  • Encodes addressing modes through operand types
  • Supports recursive expressions with operators
  • Preserves source location for error reporting

In the next chapter, we’ll start building the parser that constructs this AST from tokens.


Previous: Chapter 2 - Building the Scanner | Next: Chapter 4 - Parser Structure

Chapter 4: Building the Parser - Structure

In this chapter, we’ll set up the infrastructure for our parser. The parser transforms tokens into an Abstract Syntax Tree.

Parser Design Pattern

We’ll use recursive descent parsing - a simple and intuitive approach where:

  • Each grammar rule becomes a function
  • Functions call each other to match nested structures
  • We look ahead at tokens to decide which rule to apply

This approach works well for assembly language because the grammar is simple and unambiguous.

The Parser State

Our parser maintains this state:

#![allow(unused)]
fn main() {
pub struct Parser<'a, 'b> {
    scanner: &'a mut Scanner<'b>,
    source: &'b str,
    current: Token,
    previous: Token,
    errors: Vec<ParseError>,
}
}

Let’s understand each field:

Scanner Reference

#![allow(unused)]
fn main() {
scanner: &'a mut Scanner<'b>,
}

The parser doesn’t store tokens in advance. It asks the scanner for tokens one at a time. This is more memory efficient and allows for streaming parsing.

Source Reference

#![allow(unused)]
fn main() {
source: &'b str,
}

We keep a reference to the original source so we can extract text for identifiers and error messages.

Current and Previous Tokens

#![allow(unused)]
fn main() {
current: Token,
previous: Token,
}
  • current: The token we’re about to process
  • previous: The token we just processed

This gives us one token of lookahead, which is enough for ByteASM’s grammar.

Error Collection

#![allow(unused)]
fn main() {
errors: Vec<ParseError>,
}

Rather than stopping at the first error, we collect errors and continue parsing. This lets us report multiple problems in one pass.

Core Parser Utilities

Creating the Parser

#![allow(unused)]
fn main() {
impl<'a, 'b> Parser<'a, 'b> {
    pub fn new(scanner: &'a mut Scanner<'b>, source: &'b str) -> Self {
        // Get the first token
        let current = scanner.scan_token().unwrap_or_else(|_| Token {
            kind: TokenKind::EOF,
            value: None,
            location: Location::default(),
        });

        Self {
            scanner,
            source,
            current: current.clone(),
            previous: current,
            errors: Vec::new(),
        }
    }
}
}

We immediately scan the first token so current is ready.

Advancing Through Tokens

#![allow(unused)]
fn main() {
pub fn advance(&mut self) -> Token {
    let fallback_location = self.current.location;

    // Move current to previous, scan new current
    self.previous = std::mem::replace(
        &mut self.current,
        self.scanner.scan_token().unwrap_or_else(|_| Token {
            kind: TokenKind::EOF,
            value: None,
            location: fallback_location,
        }),
    );

    self.previous.clone()
}
}

advance() returns the old current token (now previous) and loads the next token.

Checking Token Types

#![allow(unused)]
fn main() {
pub fn check(&self, kind: TokenKind) -> bool {
    self.current.kind == kind
}

pub fn is_at_end(&self) -> bool {
    self.current.kind == TokenKind::EOF
}
}

Expecting Specific Tokens

#![allow(unused)]
fn main() {
pub fn expect(&mut self, kind: TokenKind, expected: &str) -> ParseResult<Token> {
    if self.check(kind) {
        Ok(self.advance())
    } else {
        Err(ParseError::UnexpectedToken {
            expected: expected.to_string(),
            found: self.current.kind,
            location: self.current.location,
        })
    }
}
}

expect() is used when we know what must come next. If the token doesn’t match, we report an error.

Error Types

Our parser can produce these errors:

#![allow(unused)]
fn main() {
pub enum ParseError {
    UnexpectedToken {
        expected: String,
        found: TokenKind,
        location: Location,
    },

    UnexpectedEof {
        location: Location,
    },

    InvalidOperand {
        message: String,
        location: Location,
    },

    InvalidExpression {
        message: String,
        location: Location,
    },

    InvalidDirective {
        message: String,
        location: Location,
    },

    InvalidLabel {
        message: String,
        location: Location,
    },
}
}

Each error includes a Location for precise error reporting.

Getting Error Location

#![allow(unused)]
fn main() {
impl ParseError {
    pub fn location(&self) -> &Location {
        match self {
            ParseError::UnexpectedToken { location, .. } => location,
            ParseError::UnexpectedEof { location } => location,
            ParseError::InvalidOperand { location, .. } => location,
            ParseError::InvalidExpression { location, .. } => location,
            ParseError::InvalidDirective { location, .. } => location,
            ParseError::InvalidLabel { location, .. } => location,
        }
    }
}
}

Error Recovery

When we encounter an error, we don’t want to stop immediately. We use panic mode recovery - skip tokens until we find a safe point to continue.

For assembly language, the safe point is the next line:

#![allow(unused)]
fn main() {
fn synchronize(&mut self) {
    while !self.is_at_end() {
        if self.check(TokenKind::NewLine) {
            self.advance();
            return;
        }
        self.advance();
    }
}
}

This means:

  1. When an error occurs, add it to errors
  2. Call synchronize() to skip to the next line
  3. Continue parsing
  4. At the end, return all collected errors

The Main Parse Loop

#![allow(unused)]
fn main() {
pub fn parse(&mut self) -> Result<Program, Vec<ParseError>> {
    let mut program = Program::new();

    while !self.is_at_end() {
        // Skip blank lines
        self.skip_empty_lines();

        if self.is_at_end() {
            break;
        }

        // Try to parse a line
        match self.parse_line() {
            Ok(statements) => {
                program.statements.extend(statements);
            }
            Err(e) => {
                self.errors.push(e);
                self.synchronize();
            }
        }
    }

    if self.errors.is_empty() {
        Ok(program)
    } else {
        Err(std::mem::take(&mut self.errors))
    }
}
}

Skipping Empty Lines

#![allow(unused)]
fn main() {
fn skip_empty_lines(&mut self) {
    while self.check(TokenKind::NewLine) || self.check(TokenKind::Comment) {
        self.advance();
    }
}
}

Line Structure

A single line can contain:

  • Just a label: main:
  • Just an instruction: nop
  • Just a directive: .org 0x8000
  • A label followed by an instruction: loop: dex

Our parse_line() function handles all these cases:

#![allow(unused)]
fn main() {
fn parse_line(&mut self) -> ParseResult<Vec<Statement>> {
    let mut statements = Vec::new();

    // Check for label
    if self.is_label_start() {
        statements.push(self.parse_label()?);
    }

    // Check for instruction or directive
    if self.check(TokenKind::Instruction) {
        statements.push(self.parse_instruction()?);
    } else if self.check(TokenKind::Directive) {
        statements.push(self.parse_directive()?);
    }

    // Expect end of line
    if !self.is_at_end() &&
       !self.check(TokenKind::NewLine) &&
       !self.check(TokenKind::Comment)
    {
        return Err(ParseError::UnexpectedToken {
            expected: "end of line".to_string(),
            found: self.current.kind,
            location: self.current.location,
        });
    }

    // Consume newline if present
    if self.check(TokenKind::NewLine) {
        self.advance();
    }

    Ok(statements)
}
}

Detecting Labels

A label is an identifier (or local label) followed by a colon:

#![allow(unused)]
fn main() {
fn is_label_start(&self) -> bool {
    (self.check(TokenKind::Identifier) || self.check(TokenKind::LocalLabel))
        // We need to look at what follows to confirm it's a label
        // For now, we try to parse and see
}
}

Result Type

#![allow(unused)]
fn main() {
pub type ParseResult<T> = Result<T, ParseError>;
}

Public API

The parser module exposes a simple function:

#![allow(unused)]
fn main() {
pub fn parse(source: &str) -> Result<Program, Vec<ParseError>> {
    let mut scanner = Scanner::new(source);
    let mut parser = Parser::new(&mut scanner, source);
    parser.parse()
}
}

Usage:

#![allow(unused)]
fn main() {
let source = ".org 0x8000\nlda #0x42";
let program = byte_asm::parser::parse(source)?;
}

Summary

In this chapter, we set up the parser infrastructure:

  • Parser state: scanner reference, current/previous tokens, error list
  • Core utilities: advance(), check(), expect()
  • Error types: with location information for each error
  • Error recovery: synchronize on newlines to continue after errors
  • Main loop: parse lines until EOF, collecting errors

In the next chapter, we’ll implement the actual parsing logic for labels, instructions, and directives.


Previous: Chapter 3 - Designing the AST | Next: Chapter 5 - Parser Implementation

Chapter 5: Building the Parser - Implementation

In this chapter, we’ll implement the complete parser for ByteASM, building on the infrastructure from Chapter 4.

Parsing Labels

Labels are identifiers followed by a colon:

#![allow(unused)]
fn main() {
fn parse_label(&mut self) -> ParseResult<Statement> {
    let token = self.advance();
    let name = token.text(self.source).to_string();
    let is_local = token.kind == TokenKind::LocalLabel;
    let location = token.location;

    // Expect colon after label name
    if !self.check(TokenKind::Colon) {
        return Err(ParseError::InvalidLabel {
            message: "expected ':' after label name".to_string(),
            location,
        });
    }
    self.advance(); // consume colon

    Ok(Statement::Label(LabelDef {
        name,
        is_local,
        location,
    }))
}
}

Examples

InputResult
main:LabelDef { name: "main", is_local: false }
.loop:LabelDef { name: ".loop", is_local: true }
@temp:LabelDef { name: "@temp", is_local: true }

Parsing Instructions

Instructions have an optional operand:

#![allow(unused)]
fn main() {
fn parse_instruction(&mut self) -> ParseResult<Statement> {
    let token = self.advance();
    let mnemonic = token.mnemonic().unwrap();
    let location = token.location;

    // Check if there's an operand
    let operand = if self.has_operand() {
        Some(self.parse_operand()?)
    } else {
        None
    };

    Ok(Statement::Instruction(InstructionStmt {
        mnemonic,
        operand,
        location,
    }))
}

fn has_operand(&self) -> bool {
    matches!(
        self.current.kind,
        TokenKind::Hash
            | TokenKind::OpenParen
            | TokenKind::Identifier
            | TokenKind::LocalLabel
            | TokenKind::Number
            | TokenKind::Dollar
            | TokenKind::LessThan
            | TokenKind::GreaterThan
            | TokenKind::Register
    )
}
}

Parsing Operands

The operand determines the addressing mode. Here’s the decision tree:

Token           → Operand Type
──────────────────────────────────
#               → Immediate
a (register)    → Accumulator
(               → Indirect, IndirectX, or IndirectY
other           → Address, IndexedX, or IndexedY
#![allow(unused)]
fn main() {
fn parse_operand(&mut self) -> ParseResult<Operand> {
    // Immediate: #expr
    if self.check(TokenKind::Hash) {
        self.advance();
        let expr = self.parse_expression()?;
        return Ok(Operand::Immediate(expr));
    }

    // Accumulator: a
    if self.check(TokenKind::Register) {
        let text = self.current.text(self.source).to_lowercase();
        if text == "a" {
            self.advance();
            return Ok(Operand::Accumulator);
        }
    }

    // Indirect modes: (...)
    if self.check(TokenKind::OpenParen) {
        return self.parse_indirect_operand();
    }

    // Address or indexed: expr or expr,x or expr,y
    self.parse_address_operand()
}
}

Parsing Address Operands

#![allow(unused)]
fn main() {
fn parse_address_operand(&mut self) -> ParseResult<Operand> {
    let expr = self.parse_expression()?;

    // Check for indexing
    if self.check(TokenKind::Comma) {
        self.advance();

        if self.check(TokenKind::Register) {
            let reg = self.current.text(self.source).to_lowercase();
            self.advance();

            return match reg.as_str() {
                "x" => Ok(Operand::IndexedX(expr)),
                "y" => Ok(Operand::IndexedY(expr)),
                _ => Err(ParseError::InvalidOperand {
                    message: format!("expected 'x' or 'y', found '{}'", reg),
                    location: self.previous.location,
                }),
            };
        }
    }

    Ok(Operand::Address(expr))
}
}

Parsing Indirect Operands

Indirect operands have three forms:

  • (addr) - Indirect
  • (zp,x) - Indexed Indirect
  • (zp),y - Indirect Indexed
#![allow(unused)]
fn main() {
fn parse_indirect_operand(&mut self) -> ParseResult<Operand> {
    self.advance(); // consume '('

    let expr = self.parse_expression()?;

    // (zp,x) - Indexed Indirect
    if self.check(TokenKind::Comma) {
        self.advance();

        if self.check(TokenKind::Register) {
            let reg = self.current.text(self.source).to_lowercase();
            self.advance();

            if reg != "x" {
                return Err(ParseError::InvalidOperand {
                    message: "indexed indirect only supports X register".to_string(),
                    location: self.previous.location,
                });
            }

            self.expect(TokenKind::CloseParen, "')'")?;
            return Ok(Operand::IndirectX(expr));
        }
    }

    self.expect(TokenKind::CloseParen, "')'")?;

    // (addr),y - Indirect Indexed
    if self.check(TokenKind::Comma) {
        self.advance();

        if self.check(TokenKind::Register) {
            let reg = self.current.text(self.source).to_lowercase();
            self.advance();

            if reg != "y" {
                return Err(ParseError::InvalidOperand {
                    message: "indirect indexed only supports Y register".to_string(),
                    location: self.previous.location,
                });
            }

            return Ok(Operand::IndirectY(expr));
        }
    }

    // (addr) - Plain Indirect
    Ok(Operand::Indirect(expr))
}
}

Parsing Expressions

Expressions follow standard precedence rules:

  • *, / bind tighter than +, -
  • Unary operators (-, <, >) bind tightest

Expression Grammar

expression   → additive
additive     → multiplicative ( ('+' | '-') multiplicative )*
multiplicative → unary ( ('*' | '/') unary )*
unary        → ('-' | '<' | '>') unary | primary
primary      → NUMBER | IDENTIFIER | LOCAL_LABEL | '$' | '(' expression ')'

Implementation

#![allow(unused)]
fn main() {
pub fn parse_expression(&mut self) -> ParseResult<Expression> {
    self.parse_additive()
}

fn parse_additive(&mut self) -> ParseResult<Expression> {
    let mut left = self.parse_multiplicative()?;

    while self.check(TokenKind::Plus) || self.check(TokenKind::Minus) {
        let op = if self.check(TokenKind::Plus) {
            self.advance();
            BinaryOp::Add
        } else {
            self.advance();
            BinaryOp::Sub
        };

        let right = self.parse_multiplicative()?;
        left = Expression::binary(left, op, right);
    }

    Ok(left)
}

fn parse_multiplicative(&mut self) -> ParseResult<Expression> {
    let mut left = self.parse_unary()?;

    while self.check(TokenKind::Star) || self.check(TokenKind::Slash) {
        let op = if self.check(TokenKind::Star) {
            self.advance();
            BinaryOp::Mul
        } else {
            self.advance();
            BinaryOp::Div
        };

        let right = self.parse_unary()?;
        left = Expression::binary(left, op, right);
    }

    Ok(left)
}

fn parse_unary(&mut self) -> ParseResult<Expression> {
    if self.check(TokenKind::Minus) {
        self.advance();
        let operand = self.parse_unary()?;
        return Ok(Expression::unary(UnaryOp::Neg, operand));
    }

    if self.check(TokenKind::LessThan) {
        self.advance();
        let operand = self.parse_unary()?;
        return Ok(Expression::unary(UnaryOp::LoByte, operand));
    }

    if self.check(TokenKind::GreaterThan) {
        self.advance();
        let operand = self.parse_unary()?;
        return Ok(Expression::unary(UnaryOp::HiByte, operand));
    }

    self.parse_primary()
}

fn parse_primary(&mut self) -> ParseResult<Expression> {
    // Number literal
    if self.check(TokenKind::Number) {
        let token = self.advance();
        let value = token.number().unwrap_or(0);
        return Ok(Expression::Number(value as i64));
    }

    // Identifier
    if self.check(TokenKind::Identifier) {
        let token = self.advance();
        let name = token.text(self.source).to_string();
        return Ok(Expression::Identifier(name));
    }

    // Local label
    if self.check(TokenKind::LocalLabel) {
        let token = self.advance();
        let name = token.text(self.source).to_string();
        return Ok(Expression::LocalIdentifier(name));
    }

    // Current address ($)
    if self.check(TokenKind::Dollar) {
        self.advance();
        return Ok(Expression::CurrentAddress);
    }

    // Parenthesized expression
    if self.check(TokenKind::OpenParen) {
        self.advance();
        let expr = self.parse_expression()?;
        self.expect(TokenKind::CloseParen, "')'")?;
        return Ok(expr);
    }

    Err(ParseError::InvalidExpression {
        message: format!("expected expression, found {:?}", self.current.kind),
        location: self.current.location,
    })
}
}

Parsing Directives

Each directive has its own syntax:

#![allow(unused)]
fn main() {
fn parse_directive(&mut self) -> ParseResult<Statement> {
    let token = self.advance();
    let directive = token.directive().unwrap();
    let location = token.location;

    match directive {
        Directive::ORG => {
            let address = self.parse_expression()?;
            Ok(Statement::Directive(DirectiveStmt::Org { address, location }))
        }

        Directive::DB => {
            let values = self.parse_db_values()?;
            Ok(Statement::Directive(DirectiveStmt::Db { values, location }))
        }

        Directive::DW => {
            let values = self.parse_expression_list()?;
            Ok(Statement::Directive(DirectiveStmt::Dw { values, location }))
        }

        Directive::EQU => {
            // .equ NAME value
            if !self.check(TokenKind::Identifier) {
                return Err(ParseError::InvalidDirective {
                    message: "expected identifier after .equ".to_string(),
                    location: self.current.location,
                });
            }
            let name = self.advance().text(self.source).to_string();
            let value = self.parse_expression()?;
            Ok(Statement::Directive(DirectiveStmt::Equ { name, value, location }))
        }

        Directive::INCLUDE => {
            // .include "filename"
            if !self.check(TokenKind::String) {
                return Err(ParseError::InvalidDirective {
                    message: "expected string after .include".to_string(),
                    location: self.current.location,
                });
            }
            let path = self.advance().string().unwrap_or("").to_string();
            Ok(Statement::Directive(DirectiveStmt::Include { path, location }))
        }
    }
}
}

Parsing .db Values

The .db directive accepts bytes and strings:

#![allow(unused)]
fn main() {
fn parse_db_values(&mut self) -> ParseResult<Vec<DataValue>> {
    let mut values = Vec::new();

    loop {
        if self.check(TokenKind::String) {
            let token = self.advance();
            let s = token.string().unwrap_or("").to_string();
            values.push(DataValue::String(s));
        } else if can_start_expression(self.current.kind) {
            let expr = self.parse_expression()?;
            values.push(DataValue::Byte(expr));
        } else {
            break;
        }

        if !self.check(TokenKind::Comma) {
            break;
        }
        self.advance(); // consume comma
    }

    if values.is_empty() {
        return Err(ParseError::InvalidDirective {
            message: "expected at least one value for .db".to_string(),
            location: self.current.location,
        });
    }

    Ok(values)
}
}

Parsing Expression Lists

Used by .dw:

#![allow(unused)]
fn main() {
pub fn parse_expression_list(&mut self) -> ParseResult<Vec<Expression>> {
    let mut exprs = vec![self.parse_expression()?];

    while self.check(TokenKind::Comma) {
        self.advance();
        exprs.push(self.parse_expression()?);
    }

    Ok(exprs)
}
}

Local Label Resolution

Local labels are scoped to their parent global label. When parsing:

main:
.loop:
    bne .loop

other:
.loop:          ; different from main.loop
    bne .loop

The parser keeps track of the current global label. Local labels like .loop are qualified:

  • First .loopmain.loop
  • Second .loopother.loop

This is handled in the symbol table (next chapter), not the parser.

Complete Parsing Example

Let’s trace parsing:

.org 0x8000
start:
    lda (0x80),y
  1. Parse .org 0x8000

    • Token: Directive(ORG)
    • Parse expression: Number(0x8000)
    • Result: Directive(Org { address: Number(32768) })
  2. Parse start:

    • Token: Identifier
    • Text: “start”
    • Token: Colon
    • Result: Label(LabelDef { name: "start", is_local: false })
  3. Parse lda (0x80),y

    • Token: Instruction(LDA)
    • Has operand: yes (starts with ()
    • Parse indirect operand:
      • Token: OpenParen
      • Parse expression: Number(0x80)
      • Token: CloseParen
      • Token: Comma
      • Token: Register (y)
    • Result: Instruction(InstructionStmt { mnemonic: LDA, operand: IndirectY(Number(128)) })

Summary

In this chapter, we implemented:

  • Label parsing: identifier + colon, detecting local labels
  • Instruction parsing: mnemonic + optional operand
  • Operand parsing: immediate, accumulator, address, indexed, indirect modes
  • Expression parsing: with operator precedence
  • Directive parsing: .org, .db, .dw, .equ, .include

The parser now produces a complete AST from source code. In the next chapter, we’ll build the symbol table to track labels and constants.


Previous: Chapter 4 - Parser Structure | Next: Chapter 6 - The Symbol Table

Chapter 6: The Symbol Table

In this chapter, we’ll build the symbol table - the data structure that tracks labels and constants during assembly.

What Symbols Do We Track?

The symbol table stores:

  1. Labels: Names that refer to addresses in the program

    • main: → address where code follows
    • .loop: → local label within a function
  2. Constants: Named values defined with .equ

    • .equ SCREEN 0x1000 → SCREEN = 4096

Symbol Structure

#![allow(unused)]
fn main() {
pub struct Symbol {
    pub name: String,
    pub value: SymbolValue,
    pub defined_at: Location,
    pub referenced: bool,
}

pub enum SymbolValue {
    Address(u16),      // Label pointing to an address
    Constant(i64),     // Constant value from .equ
    Undefined,         // Forward reference not yet resolved
}
}

Symbol Constructors

#![allow(unused)]
fn main() {
impl Symbol {
    pub fn address(name: impl Into<String>, address: u16, location: Location) -> Self {
        Self {
            name: name.into(),
            value: SymbolValue::Address(address),
            defined_at: location,
            referenced: false,
        }
    }

    pub fn constant(name: impl Into<String>, value: i64, location: Location) -> Self {
        Self {
            name: name.into(),
            value: SymbolValue::Constant(value),
            defined_at: location,
            referenced: false,
        }
    }

    pub fn is_defined(&self) -> bool {
        !matches!(self.value, SymbolValue::Undefined)
    }

    pub fn numeric_value(&self) -> Option<i64> {
        match self.value {
            SymbolValue::Address(addr) => Some(addr as i64),
            SymbolValue::Constant(val) => Some(val),
            SymbolValue::Undefined => None,
        }
    }
}
}

The Symbol Table

#![allow(unused)]
fn main() {
pub struct SymbolTable {
    symbols: HashMap<String, Symbol>,
    current_parent: Option<String>,
}
}

The current_parent tracks the most recent global label for local label resolution.

Basic Operations

#![allow(unused)]
fn main() {
impl SymbolTable {
    pub fn new() -> Self {
        Self {
            symbols: HashMap::new(),
            current_parent: None,
        }
    }

    pub fn set_parent(&mut self, parent: Option<String>) {
        self.current_parent = parent;
    }

    pub fn parent(&self) -> Option<&str> {
        self.current_parent.as_deref()
    }
}
}

Defining Symbols

#![allow(unused)]
fn main() {
pub fn define(&mut self, symbol: Symbol) -> Result<(), CodeGenError> {
    let name = symbol.name.clone();
    let location = symbol.defined_at;

    if let Some(existing) = self.symbols.get(&name) {
        // If existing symbol is undefined (forward reference), update it
        if !existing.is_defined() {
            self.symbols.insert(name, symbol);
            return Ok(());
        }

        // Already defined - that's an error
        return Err(CodeGenError::DuplicateSymbol {
            name,
            first: existing.defined_at,
            second: location,
        });
    }

    self.symbols.insert(name, symbol);
    Ok(())
}
}

Convenience Methods

#![allow(unused)]
fn main() {
pub fn define_label(
    &mut self,
    name: impl Into<String>,
    address: u16,
    location: Location,
) -> Result<(), CodeGenError> {
    self.define(Symbol::address(name, address, location))
}

pub fn define_constant(
    &mut self,
    name: impl Into<String>,
    value: i64,
    location: Location,
) -> Result<(), CodeGenError> {
    self.define(Symbol::constant(name, value, location))
}
}

Looking Up Symbols

#![allow(unused)]
fn main() {
pub fn lookup(&self, name: &str) -> Option<&Symbol> {
    // Try direct lookup
    if let Some(sym) = self.symbols.get(name) {
        return Some(sym);
    }

    // For local labels, try with parent prefix
    if (name.starts_with('.') || name.starts_with('@'))
        && self.current_parent.is_some()
    {
        let qualified = self.qualify_local_label(name);
        return self.symbols.get(&qualified);
    }

    None
}

pub fn lookup_value(&self, name: &str) -> Option<i64> {
    self.lookup(name).and_then(|s| s.numeric_value())
}

pub fn is_defined(&self, name: &str) -> bool {
    self.lookup(name)
        .map(|s| s.is_defined())
        .unwrap_or(false)
}
}

Local Label Handling

Local labels are scoped to their parent global label:

#![allow(unused)]
fn main() {
pub fn qualify_local_label(&self, name: &str) -> String {
    if let Some(parent) = &self.current_parent {
        if name.starts_with('.') {
            // .loop -> parent.loop
            format!("{}{}", parent, name)
        } else if name.starts_with('@') {
            // @loop -> parent.loop
            format!("{}.{}", parent, &name[1..])
        } else {
            name.to_string()
        }
    } else {
        name.to_string()
    }
}
}

Example

main:           ; current_parent = "main"
.loop:          ; stored as "main.loop"
    bne .loop   ; resolved to "main.loop"

other:          ; current_parent = "other"
.loop:          ; stored as "other.loop"
    bne .loop   ; resolved to "other.loop"

When we encounter main:, we call set_parent(Some("main")). When we encounter .loop:, we store it as main.loop. When we reference .loop, we look up main.loop.

The Forward Reference Problem

Consider this code:

    jmp end     ; 'end' not defined yet!
    nop
end:
    rts

When we encounter jmp end, the label end hasn’t been defined yet. This is called a forward reference.

The Two-Pass Solution

We solve this with two passes:

  1. Pass 1: Scan through the code, recording where each label is defined
  2. Pass 2: Generate code, now that all labels are known

In Pass 1, when we see jmp end:

  • We don’t know end’s address
  • We just calculate that this instruction takes 3 bytes
  • We move on

In Pass 2, when we generate code for jmp end:

  • We look up end in the symbol table
  • We now have its address
  • We emit the correct bytes

Tracking References

We track whether symbols are referenced:

#![allow(unused)]
fn main() {
pub fn mark_referenced(&mut self, name: &str) {
    if let Some(sym) = self.symbols.get_mut(name) {
        sym.referenced = true;
    } else if (name.starts_with('.') || name.starts_with('@'))
        && self.current_parent.is_some()
    {
        let qualified = self.qualify_local_label(name);
        if let Some(sym) = self.symbols.get_mut(&qualified) {
            sym.referenced = true;
        }
    }
}
}

This allows us to warn about unused labels:

#![allow(unused)]
fn main() {
pub fn unreferenced_symbols(&self) -> Vec<&Symbol> {
    self.symbols
        .values()
        .filter(|s| !s.referenced && s.is_defined())
        .collect()
}
}

Finding Undefined Symbols

After Pass 1, we can check for undefined symbols:

#![allow(unused)]
fn main() {
pub fn undefined_symbols(&self) -> Vec<&Symbol> {
    self.symbols
        .values()
        .filter(|s| !s.is_defined())
        .collect()
}
}

Complete Example

Let’s trace symbol table operations for:

.equ SCREEN 0x1000

.org 0x8000
start:
    lda #>SCREEN
.loop:
    jmp .loop

.dw start

Pass 1

OperationSymbol Table
.equ SCREEN 0x1000{ SCREEN: Constant(4096) }
.org 0x8000(no change, just sets address)
start:{ SCREEN, start: Address(0x8000) }
lda #>SCREEN(no change, just advances address)
.loop:{ SCREEN, start, start.loop: Address(0x8002) }
jmp .loop(no change)
.dw start(no change)

Pass 2

When generating code:

  • lda #>SCREEN → look up SCREEN → 0x1000 → high byte is 0x10 → emit A9 10
  • jmp .loop → look up start.loop → 0x8002 → emit 4C 02 80
  • .dw start → look up start → 0x8000 → emit 00 80

Summary

In this chapter, we built a symbol table that:

  • Stores labels (addresses) and constants (values)
  • Handles local labels scoped to parent global labels
  • Detects duplicate symbol definitions
  • Supports forward references via undefined symbols
  • Tracks which symbols are referenced

In the next chapter, we’ll implement the two-pass assembly process that uses this symbol table.


Previous: Chapter 5 - Parser Implementation | Next: Chapter 7 - Two-Pass Assembly

Chapter 7: Two-Pass Assembly

In this chapter, we’ll implement the two-pass assembly process that transforms our AST into machine code.

Why Two Passes?

Consider this program:

    jmp end
    nop
end:
    rts

When we reach jmp end, we need to know the address of end. But we haven’t seen end yet! This is the forward reference problem.

The solution is two passes:

  1. Pass 1: Collect all labels and calculate their addresses
  2. Pass 2: Generate code using the complete symbol table

The Assembler Structure

#![allow(unused)]
fn main() {
pub struct Assembler {
    symbols: SymbolTable,
    current_address: u16,
    origin: u16,
    output: Vec<u8>,
    errors: Vec<CodeGenError>,
    current_file: Option<String>,
}

impl Assembler {
    pub fn new() -> Self {
        Self {
            symbols: SymbolTable::new(),
            current_address: 0,
            origin: 0,
            output: Vec::new(),
            errors: Vec::new(),
            current_file: None,
        }
    }
}
}

The Main Assemble Function

#![allow(unused)]
fn main() {
pub fn assemble(&mut self, program: &Program) -> Result<Vec<u8>, AssemblerError> {
    self.current_file = program.source_file.clone();

    // Pass 1: Collect symbols
    self.pass1(program)?;

    // Pass 2: Generate code
    self.pass2(program)?;

    if !self.errors.is_empty() {
        return Err(AssemblerError::Multiple(
            self.errors.iter().cloned().map(AssemblerError::CodeGen).collect(),
        ));
    }

    Ok(std::mem::take(&mut self.output))
}
}

Pass 1: Symbol Collection

In Pass 1, we walk through the program and:

  1. Record label addresses
  2. Process constants from .equ
  3. Calculate instruction sizes to track the current address
#![allow(unused)]
fn main() {
fn pass1(&mut self, program: &Program) -> Result<(), AssemblerError> {
    self.current_address = self.origin;

    for stmt in &program.statements {
        match stmt {
            Statement::Label(label) => {
                self.pass1_label(label)?;
            }
            Statement::Instruction(instr) => {
                self.pass1_instruction(instr)?;
            }
            Statement::Directive(dir) => {
                self.pass1_directive(dir)?;
            }
        }
    }

    Ok(())
}
}

Processing Labels

#![allow(unused)]
fn main() {
fn pass1_label(&mut self, label: &LabelDef) -> Result<(), AssemblerError> {
    let name = if label.is_local {
        self.symbols.qualify_local_label(&label.name)
    } else {
        // Update parent for subsequent local labels
        self.symbols.set_parent(Some(label.name.clone()));
        label.name.clone()
    };

    if let Err(e) = self.symbols.define_label(&name, self.current_address, label.location) {
        self.errors.push(e);
    }

    Ok(())
}
}

Calculating Instruction Size

We need to determine how many bytes an instruction will take:

#![allow(unused)]
fn main() {
fn pass1_instruction(&mut self, instr: &InstructionStmt) -> Result<(), AssemblerError> {
    let size = self.instruction_size(instr);
    self.current_address = self.current_address.wrapping_add(size as u16);
    Ok(())
}

fn instruction_size(&self, instr: &InstructionStmt) -> u8 {
    match self.determine_addressing_mode(instr.mnemonic, &instr.operand) {
        Ok(mode) => {
            if let Some(opcode) = get_opcode(instr.mnemonic, mode) {
                opcode.size
            } else {
                1  // Invalid, will error in pass 2
            }
        }
        Err(_) => 1,
    }
}
}

Processing Directives in Pass 1

#![allow(unused)]
fn main() {
fn pass1_directive(&mut self, directive: &DirectiveStmt) -> Result<(), AssemblerError> {
    match directive {
        DirectiveStmt::Org { address, .. } => {
            let addr = self.evaluate(address)? as u16;
            self.origin = addr;
            self.current_address = addr;
        }

        DirectiveStmt::Db { values, .. } => {
            for value in values {
                match value {
                    DataValue::Byte(_) => {
                        self.current_address = self.current_address.wrapping_add(1);
                    }
                    DataValue::String(s) => {
                        self.current_address = self.current_address.wrapping_add(s.len() as u16);
                    }
                }
            }
        }

        DirectiveStmt::Dw { values, .. } => {
            self.current_address = self.current_address.wrapping_add((values.len() * 2) as u16);
        }

        DirectiveStmt::Equ { name, value, location } => {
            let val = self.evaluate(value)?;
            self.symbols.define_constant(name, val, *location)?;
        }

        DirectiveStmt::Include { .. } => {
            // Handle includes (recursive assembly)
        }
    }
    Ok(())
}
}

Pass 2: Code Generation

In Pass 2, we generate the actual machine code:

#![allow(unused)]
fn main() {
fn pass2(&mut self, program: &Program) -> Result<(), AssemblerError> {
    self.current_address = self.origin;
    self.output.clear();

    for stmt in &program.statements {
        match stmt {
            Statement::Label(label) => {
                // Update parent for local label resolution
                if !label.is_local {
                    self.symbols.set_parent(Some(label.name.clone()));
                }
            }

            Statement::Instruction(instr) => {
                if let Err(e) = self.emit_instruction(instr) {
                    self.errors.push(e);
                }
            }

            Statement::Directive(dir) => {
                if let Err(e) = self.emit_directive(dir) {
                    self.errors.push(e);
                }
            }
        }
    }

    Ok(())
}
}

Determining Addressing Mode

The same operand syntax can map to different addressing modes depending on the value:

#![allow(unused)]
fn main() {
fn determine_addressing_mode(
    &self,
    mnemonic: Mnemonic,
    operand: &Option<Operand>,
) -> Result<AddressingMode, CodeGenError> {
    match operand {
        None => Ok(AddressingMode::Implied),
        Some(Operand::Immediate(_)) => Ok(AddressingMode::Immediate),
        Some(Operand::Accumulator) => Ok(AddressingMode::Accumulator),
        Some(Operand::IndirectX(_)) => Ok(AddressingMode::IndirectX),
        Some(Operand::IndirectY(_)) => Ok(AddressingMode::IndirectY),
        Some(Operand::Indirect(_)) => Ok(AddressingMode::Indirect),

        Some(Operand::Address(expr)) => {
            // Branches always use relative
            if is_branch_instruction(mnemonic) {
                return Ok(AddressingMode::Relative);
            }

            // Try to evaluate; if small enough, use zero page
            match self.evaluate(expr) {
                Ok(value) if value >= 0 && value <= 0xFF => {
                    if get_opcode(mnemonic, AddressingMode::ZeroPage).is_some() {
                        Ok(AddressingMode::ZeroPage)
                    } else {
                        Ok(AddressingMode::Absolute)
                    }
                }
                _ => Ok(AddressingMode::Absolute),
            }
        }

        Some(Operand::IndexedX(expr)) => {
            match self.evaluate(expr) {
                Ok(value) if value >= 0 && value <= 0xFF => {
                    if get_opcode(mnemonic, AddressingMode::ZeroPageX).is_some() {
                        Ok(AddressingMode::ZeroPageX)
                    } else {
                        Ok(AddressingMode::AbsoluteX)
                    }
                }
                _ => Ok(AddressingMode::AbsoluteX),
            }
        }

        Some(Operand::IndexedY(expr)) => {
            match self.evaluate(expr) {
                Ok(value) if value >= 0 && value <= 0xFF => {
                    if get_opcode(mnemonic, AddressingMode::ZeroPageY).is_some() {
                        Ok(AddressingMode::ZeroPageY)
                    } else {
                        Ok(AddressingMode::AbsoluteY)
                    }
                }
                _ => Ok(AddressingMode::AbsoluteY),
            }
        }
    }
}
}

Zero Page Optimization

The 6502 has faster, shorter instructions for zero page addresses (0x00-0xFF):

lda 0x80       ; Zero Page: A5 80 (2 bytes, 3 cycles)
lda 0x0180     ; Absolute:  AD 80 01 (3 bytes, 4 cycles)

Our assembler automatically uses zero page mode when:

  1. The address fits in 8 bits (0x00-0xFF)
  2. A zero page variant exists for that instruction

Branch Instructions

Branches use relative addressing. The helper function identifies branch mnemonics:

#![allow(unused)]
fn main() {
fn is_branch_instruction(mnemonic: Mnemonic) -> bool {
    matches!(
        mnemonic,
        Mnemonic::BCC
            | Mnemonic::BCS
            | Mnemonic::BEQ
            | Mnemonic::BMI
            | Mnemonic::BNE
            | Mnemonic::BPL
            | Mnemonic::BVC
            | Mnemonic::BVS
    )
}
}

Output Helpers

#![allow(unused)]
fn main() {
fn emit_byte(&mut self, byte: u8) {
    self.output.push(byte);
    self.current_address = self.current_address.wrapping_add(1);
}

fn emit_word(&mut self, word: u16) {
    self.emit_byte((word & 0xFF) as u8);         // Low byte first
    self.emit_byte(((word >> 8) & 0xFF) as u8);  // High byte second
}

fn pad_to(&mut self, address: u16) {
    while self.current_address < address {
        self.emit_byte(0);
    }
}
}

Assembly Trace Example

Let’s trace assembling:

.org 0x8000
start:
    lda #0x42
    jmp start

Pass 1

StatementActioncurrent_address
.org 0x8000Set origin0x8000
start:Define start = 0x80000x8000
lda #0x42Size = 2 bytes0x8002
jmp startSize = 3 bytes0x8005

Symbol Table: { start: Address(0x8000) }

Pass 2

StatementOutputDescription
.org 0x8000(reset to 0x8000)
start:(nothing)Just a label
lda #0x42A9 42LDA immediate = 0xA9
jmp start4C 00 80JMP absolute = 0x4C, addr = 0x8000 (little-endian)

Final output: A9 42 4C 00 80

Summary

In this chapter, we implemented two-pass assembly:

  • Pass 1: Collect labels and calculate addresses

    • Process .equ constants
    • Calculate instruction sizes
    • Handle .org to set addresses
  • Pass 2: Generate machine code

    • Look up all labels
    • Emit opcode and operand bytes
    • Handle zero page optimization

In the next chapter, we’ll implement the code generation details.


Previous: Chapter 6 - The Symbol Table | Next: Chapter 8 - Code Generation

Chapter 8: Code Generation

In this chapter, we’ll implement the code generation phase that converts AST nodes into actual machine code bytes.

Instruction Encoding Overview

Each 6502 instruction consists of:

  1. Opcode byte: Identifies the instruction and addressing mode
  2. Operand bytes: 0, 1, or 2 bytes depending on addressing mode
Addressing ModeSizeExample
Implied1NOPEA
Accumulator1ASL A0A
Immediate2LDA #$42A9 42
Zero Page2LDA $80A5 80
Zero Page,X2LDA $80,XB5 80
Zero Page,Y2LDX $80,YB6 80
Absolute3LDA $2000AD 00 20
Absolute,X3LDA $2000,XBD 00 20
Absolute,Y3LDA $2000,YB9 00 20
Indirect3JMP ($2000)6C 00 20
Indexed Indirect2LDA ($80,X)A1 80
Indirect Indexed2LDA ($80),YB1 80
Relative2BEQ labelF0 offset

The emit_instruction Function

#![allow(unused)]
fn main() {
pub fn emit_instruction(&mut self, instr: &InstructionStmt) -> Result<(), CodeGenError> {
    // Save instruction start address for $ evaluation
    let instr_start = self.current_address;

    // Determine addressing mode
    let mode = self.determine_addressing_mode(instr.mnemonic, &instr.operand)?;

    // Look up the opcode
    let opcode = get_opcode(instr.mnemonic, mode).ok_or_else(|| {
        CodeGenError::InvalidAddressingMode {
            mnemonic: instr.mnemonic,
            mode,
            location: instr.location,
        }
    })?;

    // Emit opcode byte
    self.emit_byte(opcode.code);

    // Emit operand bytes
    self.emit_operand_bytes(&instr.operand, mode, instr.location, instr_start)?;

    Ok(())
}
}

Emitting Operand Bytes

Each addressing mode requires different operand handling:

#![allow(unused)]
fn main() {
fn emit_operand_bytes(
    &mut self,
    operand: &Option<Operand>,
    mode: AddressingMode,
    location: Location,
    instr_start: u16,
) -> Result<(), CodeGenError> {
    match mode {
        AddressingMode::Implied | AddressingMode::Accumulator => {
            // No operand bytes
        }

        AddressingMode::Immediate => {
            if let Some(Operand::Immediate(expr)) = operand {
                let value = self.evaluate_byte_at(expr, location, instr_start)?;
                self.emit_byte(value);
            }
        }

        AddressingMode::ZeroPage => {
            if let Some(Operand::Address(expr)) = operand {
                let value = self.evaluate_byte_at(expr, location, instr_start)?;
                self.emit_byte(value);
            }
        }

        AddressingMode::ZeroPageX => {
            if let Some(Operand::IndexedX(expr)) = operand {
                let value = self.evaluate_byte_at(expr, location, instr_start)?;
                self.emit_byte(value);
            }
        }

        AddressingMode::ZeroPageY => {
            if let Some(Operand::IndexedY(expr)) = operand {
                let value = self.evaluate_byte_at(expr, location, instr_start)?;
                self.emit_byte(value);
            }
        }

        AddressingMode::Absolute => {
            if let Some(Operand::Address(expr)) = operand {
                let value = self.evaluate_word_at(expr, location, instr_start)?;
                self.emit_word(value);
            }
        }

        AddressingMode::AbsoluteX => {
            if let Some(Operand::IndexedX(expr)) = operand {
                let value = self.evaluate_word_at(expr, location, instr_start)?;
                self.emit_word(value);
            }
        }

        AddressingMode::AbsoluteY => {
            if let Some(Operand::IndexedY(expr)) = operand {
                let value = self.evaluate_word_at(expr, location, instr_start)?;
                self.emit_word(value);
            }
        }

        AddressingMode::Indirect => {
            if let Some(Operand::Indirect(expr)) = operand {
                let value = self.evaluate_word_at(expr, location, instr_start)?;
                self.emit_word(value);
            }
        }

        AddressingMode::IndirectX => {
            if let Some(Operand::IndirectX(expr)) = operand {
                let value = self.evaluate_byte_at(expr, location, instr_start)?;
                self.emit_byte(value);
            }
        }

        AddressingMode::IndirectY => {
            if let Some(Operand::IndirectY(expr)) = operand {
                let value = self.evaluate_byte_at(expr, location, instr_start)?;
                self.emit_byte(value);
            }
        }

        AddressingMode::Relative => {
            if let Some(Operand::Address(expr)) = operand {
                let offset = self.calculate_branch_offset(expr, instr_start, location)?;
                self.emit_byte(offset as u8);
            }
        }
    }

    Ok(())
}
}

Relative Branch Calculation

Branch instructions use a signed 8-bit offset relative to the instruction after the branch:

#![allow(unused)]
fn main() {
pub fn calculate_branch_offset(
    &self,
    target_expr: &Expression,
    from_address: u16,
    location: Location,
) -> Result<i8, CodeGenError> {
    let target = self.evaluate_with_location(target_expr, location)?;

    // Branch is relative to PC after the branch instruction (PC + 2)
    let offset = target - (from_address as i64 + 2);

    if offset < -128 || offset > 127 {
        return Err(CodeGenError::BranchOutOfRange { offset, location });
    }

    Ok(offset as i8)
}
}

Branch Offset Example

.org 0x8000
loop:           ; Address 0x8000
    dex         ; Address 0x8000 (1 byte)
    bne loop    ; Address 0x8001 (2 bytes)

For bne loop:

  • Current address when encoding: 0x8001
  • Target: 0x8000
  • Offset = target - (current + 2) = 0x8000 - 0x8003 = -3 = 0xFD

Output: D0 FD

Little-Endian Word Emission

The 6502 uses little-endian byte order:

#![allow(unused)]
fn main() {
fn emit_word(&mut self, word: u16) {
    self.emit_byte((word & 0xFF) as u8);         // Low byte
    self.emit_byte(((word >> 8) & 0xFF) as u8);  // High byte
}
}

So address 0x2000 becomes bytes 00 20.

Opcode Lookup

We use the byte_common crate’s opcode table:

#![allow(unused)]
fn main() {
use byte_common::opcode::{get_opcode, AddressingMode, Mnemonic, Opcode};

// Returns Option<&'static Opcode>
let opcode = get_opcode(Mnemonic::LDA, AddressingMode::Immediate);
// opcode.code = 0xA9
// opcode.size = 2
}

The Opcode structure provides:

  • code: The actual opcode byte (0x00-0xFF)
  • size: Instruction size in bytes
  • tick: Cycle count (for emulation)

Code Generation Examples

Example 1: NOP (Implied)

nop
  1. Mnemonic: NOP
  2. Operand: None
  3. Mode: Implied
  4. Opcode: 0xEA
  5. Output: EA

Example 2: LDA Immediate

lda #0x42
  1. Mnemonic: LDA
  2. Operand: Immediate(0x42)
  3. Mode: Immediate
  4. Opcode: 0xA9
  5. Operand byte: 0x42
  6. Output: A9 42

Example 3: LDA Zero Page

lda 0x80
  1. Mnemonic: LDA
  2. Operand: Address(0x80)
  3. Value fits in 8 bits → Mode: Zero Page
  4. Opcode: 0xA5
  5. Address byte: 0x80
  6. Output: A5 80

Example 4: LDA Absolute

lda 0x2000
  1. Mnemonic: LDA
  2. Operand: Address(0x2000)
  3. Value > 0xFF → Mode: Absolute
  4. Opcode: 0xAD
  5. Address word: 0x2000 → 00 20 (little-endian)
  6. Output: AD 00 20

Example 5: JMP Indirect

jmp (0x2000)
  1. Mnemonic: JMP
  2. Operand: Indirect(0x2000)
  3. Mode: Indirect
  4. Opcode: 0x6C
  5. Address word: 00 20
  6. Output: 6C 00 20

Example 6: Branch Instruction

.org 0x8000
start:
    ldx #5
loop:
    dex
    bne loop

At bne loop:

  • Current address: 0x8003
  • Target: 0x8002 (loop)
  • Offset: 0x8002 - 0x8005 = -3 = 0xFD

Output for bne loop: D0 FD

Error Handling

Several errors can occur during code generation:

#![allow(unused)]
fn main() {
pub enum CodeGenError {
    // Invalid mnemonic + mode combination
    InvalidAddressingMode {
        mnemonic: Mnemonic,
        mode: AddressingMode,
        location: Location,
    },

    // Branch target too far
    BranchOutOfRange {
        offset: i64,
        location: Location,
    },

    // Value too large for operand
    ValueOutOfRange {
        value: i64,
        max: i64,
        location: Location,
    },

    // Undefined symbol
    UndefinedSymbol {
        name: String,
        location: Location,
    },
}
}

Complete Instruction Table

For reference, here’s how common instructions encode:

InstructionModeOpcode
LDA #nnImmediateA9
LDA zpZero PageA5
LDA zp,xZero Page,XB5
LDA absAbsoluteAD
LDA abs,xAbsolute,XBD
LDA abs,yAbsolute,YB9
LDA (zp,x)Indexed IndirectA1
LDA (zp),yIndirect IndexedB1
STA zpZero Page85
STA absAbsolute8D
JMP absAbsolute4C
JMP (abs)Indirect6C
JSR absAbsolute20
RTSImplied60
NOPImpliedEA
BEQ relRelativeF0
BNE relRelativeD0

Summary

In this chapter, we implemented code generation:

  • Opcode lookup using mnemonic + addressing mode
  • Operand byte emission for each addressing mode
  • Little-endian word encoding
  • Relative branch offset calculation
  • Error handling for invalid combinations

In the next chapter, we’ll implement expression evaluation for calculating addresses and values.


Previous: Chapter 7 - Two-Pass Assembly | Next: Chapter 9 - Expression Evaluation

Chapter 9: Expression Evaluation

In this chapter, we’ll implement the expression evaluator that computes numeric values from AST expressions.

Why Expressions Matter

Expressions allow powerful constructs in assembly:

.equ BUFFER      0x0200
.equ BUFFER_SIZE 256

lda #BUFFER_SIZE - 1     ; Compute at assembly time
ldx #<BUFFER             ; Low byte of address
ldy #>BUFFER             ; High byte of address
.dw BUFFER + BUFFER_SIZE ; End of buffer
jmp $                    ; Jump to current address

The Evaluator

The evaluator recursively walks an Expression tree and computes the result:

#![allow(unused)]
fn main() {
impl Assembler {
    pub fn evaluate(&self, expr: &Expression) -> Result<i64, CodeGenError> {
        self.evaluate_at(expr, self.current_address, Location::default())
    }

    fn evaluate_at(
        &self,
        expr: &Expression,
        current_addr: u16,
        location: Location,
    ) -> Result<i64, CodeGenError> {
        match expr {
            Expression::Number(n) => Ok(*n),

            Expression::CurrentAddress => Ok(current_addr as i64),

            Expression::Identifier(name) => {
                self.symbols.lookup_value(name).ok_or_else(|| {
                    CodeGenError::UndefinedSymbol {
                        name: name.clone(),
                        location,
                    }
                })
            }

            Expression::LocalIdentifier(name) => {
                // Try direct lookup
                if let Some(value) = self.symbols.lookup_value(name) {
                    return Ok(value);
                }
                // Try qualified lookup
                let qualified = self.symbols.qualify_local_label(name);
                self.symbols.lookup_value(&qualified).ok_or_else(|| {
                    CodeGenError::UndefinedSymbol {
                        name: name.clone(),
                        location,
                    }
                })
            }

            Expression::Binary { left, op, right } => {
                let l = self.evaluate_at(left, current_addr, location)?;
                let r = self.evaluate_at(right, current_addr, location)?;
                self.apply_binary_op(l, *op, r, location)
            }

            Expression::Unary { op, operand } => {
                let value = self.evaluate_at(operand, current_addr, location)?;
                self.apply_unary_op(*op, value)
            }
        }
    }
}
}

Binary Operations

#![allow(unused)]
fn main() {
fn apply_binary_op(
    &self,
    left: i64,
    op: BinaryOp,
    right: i64,
    location: Location,
) -> Result<i64, CodeGenError> {
    match op {
        BinaryOp::Add => Ok(left.wrapping_add(right)),
        BinaryOp::Sub => Ok(left.wrapping_sub(right)),
        BinaryOp::Mul => Ok(left.wrapping_mul(right)),
        BinaryOp::Div => {
            if right == 0 {
                Err(CodeGenError::EvaluationError {
                    message: "division by zero".to_string(),
                    location,
                })
            } else {
                Ok(left / right)
            }
        }
    }
}
}

Examples

ExpressionResult
10 + 515
20 - 317
4 * 832
100 / 1010
0x1000 + 0x1000x1100

Unary Operations

#![allow(unused)]
fn main() {
fn apply_unary_op(&self, op: UnaryOp, value: i64) -> Result<i64, CodeGenError> {
    match op {
        UnaryOp::Neg => Ok(-value),
        UnaryOp::LoByte => Ok(value & 0xFF),
        UnaryOp::HiByte => Ok((value >> 8) & 0xFF),
    }
}
}

Lo-Byte and Hi-Byte Operators

These are essential for working with 16-bit addresses on an 8-bit processor:

.equ SCREEN 0x1234

lda #<SCREEN    ; Low byte: 0x34
sta ptr
lda #>SCREEN    ; High byte: 0x12
sta ptr+1
ExpressionValueResult
<0x12340x12340x34
>0x12340x12340x12
<0xFF000xFF000x00
>0x00FF0x00FF0x00

The Current Address ($)

The $ symbol represents the current address during assembly:

.org 0x8000
loop:
    jmp $       ; Jump to self (infinite loop) - jumps to 0x8000
    .dw $       ; Store current address

When evaluating $, we use the address at the start of the instruction, not after emitting bytes:

#![allow(unused)]
fn main() {
Expression::CurrentAddress => Ok(current_addr as i64),
}

This is why we pass instr_start when evaluating operands.

Range Checking

We provide variants that check the result fits in the required size:

#![allow(unused)]
fn main() {
pub fn evaluate_byte(
    &self,
    expr: &Expression,
    location: Location,
) -> Result<u8, CodeGenError> {
    let value = self.evaluate_with_location(expr, location)?;

    // Allow -128 to 255 (signed or unsigned byte)
    if value < -128 || value > 255 {
        return Err(CodeGenError::ValueOutOfRange {
            value,
            max: 255,
            location,
        });
    }

    Ok(value as u8)
}

pub fn evaluate_word(
    &self,
    expr: &Expression,
    location: Location,
) -> Result<u16, CodeGenError> {
    let value = self.evaluate_with_location(expr, location)?;

    if value < -32768 || value > 65535 {
        return Err(CodeGenError::ValueOutOfRange {
            value,
            max: 65535,
            location,
        });
    }

    Ok(value as u16)
}
}

Evaluation with Custom Address

For correct $ handling during code generation:

#![allow(unused)]
fn main() {
pub fn evaluate_byte_at(
    &self,
    expr: &Expression,
    location: Location,
    current_addr: u16,
) -> Result<u8, CodeGenError> {
    let value = self.evaluate_at(expr, current_addr, location)?;

    if value < -128 || value > 255 {
        return Err(CodeGenError::ValueOutOfRange {
            value,
            max: 255,
            location,
        });
    }

    Ok(value as u8)
}
}

Complex Expression Examples

Computing Buffer End

.equ BUFFER_START 0x0200
.equ BUFFER_SIZE  0x0100

; BUFFER_END = BUFFER_START + BUFFER_SIZE = 0x0300
lda #>BUFFER_START + BUFFER_SIZE    ; Error? No, it's 0x03
lda #>(BUFFER_START + BUFFER_SIZE)  ; Same: 0x03

Table Offsets

.equ SPRITE_SIZE 4
.equ SPRITE_COUNT 8

; Total size = 4 * 8 = 32 bytes
.equ SPRITE_TABLE_SIZE SPRITE_SIZE * SPRITE_COUNT

; Offset to sprite N: N * SPRITE_SIZE
lda sprite_table + 2 * SPRITE_SIZE  ; Third sprite

Relative Jumps

    beq $ + 3   ; Skip next 1-byte instruction if equal
    nop
    rts

Here $ + 3 evaluates to current_address + 3.

Evaluation Order

Expressions follow standard mathematical precedence:

  1. Parentheses () - highest
  2. Unary operators -, <, >
  3. Multiplication and division *, /
  4. Addition and subtraction +, - - lowest

So 2 + 3 * 4 evaluates as 2 + (3 * 4) = 14, not (2 + 3) * 4 = 20.

Error Messages

When evaluation fails, we provide helpful messages:

error: undefined symbol 'sprite_ptr'
  --> game.s:42:5
   |
42 |     lda sprite_ptr
   |         ^^^^^^^^^^ symbol not defined

error: value 300 out of range (max 255)
  --> game.s:15:9
   |
15 |     lda #300
   |         ^^^^ value too large for byte

Summary

In this chapter, we implemented expression evaluation:

  • Numeric literals: Direct values
  • Identifiers: Symbol table lookups
  • Binary operations: +, -, *, /
  • Unary operations: negation, lo-byte, hi-byte
  • Current address: $ symbol
  • Range checking: Ensure values fit in bytes/words

In the next chapter, we’ll implement directive handlers for .org, .db, .dw, etc.


Previous: Chapter 8 - Code Generation | Next: Chapter 10 - Implementing Directives

Chapter 10: Implementing Directives

In this chapter, we’ll implement handlers for all ByteASM directives.

Directive Overview

DirectivePurposeExample
.orgSet assembly address.org 0x8000
.dbDefine bytes.db 0x01, "Hi", 0
.dwDefine words.dw 0x1234, label
.equDefine constant.equ SCREEN 0x1000
.includeInclude file.include "utils.s"

The .org Directive

.org (origin) sets the current assembly address.

Pass 1 Handling

#![allow(unused)]
fn main() {
DirectiveStmt::Org { address, .. } => {
    let addr = self.evaluate(address)? as u16;
    self.origin = addr;
    self.current_address = addr;
}
}

Pass 2 Handling

#![allow(unused)]
fn main() {
DirectiveStmt::Org { address, .. } => {
    let addr = self.evaluate(address)? as u16;

    // If moving forward, pad with zeros
    if addr > self.current_address {
        self.pad_to(addr);
    }

    self.current_address = addr;
}
}

Usage Examples

.org 0x8000         ; Start code at 0x8000
reset:
    ; ... code ...

.org 0xFFFC         ; Jump to reset vector location
.dw reset           ; Write reset vector

Multiple .org Directives

.org 0x8000
    lda #0x01       ; At 0x8000

.org 0x8100         ; Skip to 0x8100 (gap filled with zeros)
    lda #0x02       ; At 0x8100

The .db Directive

.db (define bytes) emits raw bytes and strings.

Pass 1 Handling

Calculate size without emitting:

#![allow(unused)]
fn main() {
DirectiveStmt::Db { values, .. } => {
    for value in values {
        match value {
            DataValue::Byte(_) => {
                self.current_address = self.current_address.wrapping_add(1);
            }
            DataValue::String(s) => {
                self.current_address = self.current_address.wrapping_add(s.len() as u16);
            }
        }
    }
}
}

Pass 2 Handling

Emit the actual bytes:

#![allow(unused)]
fn main() {
DirectiveStmt::Db { values, location } => {
    for value in values {
        match value {
            DataValue::Byte(expr) => {
                let byte = self.evaluate_byte(expr, *location)?;
                self.emit_byte(byte);
            }
            DataValue::String(s) => {
                for byte in s.bytes() {
                    self.emit_byte(byte);
                }
            }
        }
    }
}
}

Usage Examples

; Single bytes
.db 0x00, 0xFF, 0x42

; Expressions
.db 10 + 5, CONSTANT - 1

; Strings
.db "Hello, World!", 0x0A, 0

; Mixed
.db "Score: ", 0x30, 0  ; "Score: 0\0"

Strings and Escape Sequences

The scanner handles escape sequences, so:

.db "Line 1\nLine 2\0"

Emits: 4C 69 6E 65 20 31 0A 4C 69 6E 65 20 32 00

The .dw Directive

.dw (define words) emits 16-bit values in little-endian format.

Pass 1 Handling

#![allow(unused)]
fn main() {
DirectiveStmt::Dw { values, .. } => {
    self.current_address = self.current_address
        .wrapping_add((values.len() * 2) as u16);
}
}

Pass 2 Handling

#![allow(unused)]
fn main() {
DirectiveStmt::Dw { values, location } => {
    for value in values {
        let word = self.evaluate_word(value, *location)?;
        self.emit_word(word);
    }
}
}

Usage Examples

; Numeric values
.dw 0x1234, 0x5678      ; Emits: 34 12 78 56

; Labels (addresses)
.dw start, main_loop    ; Emits addresses

; Interrupt vectors
.org 0xFFFC
.dw reset               ; Reset vector
.dw irq_handler         ; IRQ vector

The .equ Directive

.equ defines a constant that can be used throughout the program.

Pass 1 Handling

#![allow(unused)]
fn main() {
DirectiveStmt::Equ { name, value, location } => {
    let val = self.evaluate(value)?;
    self.symbols.define_constant(name, val, *location)?;
}
}

Pass 2 Handling

#![allow(unused)]
fn main() {
DirectiveStmt::Equ { .. } => {
    // Constants are handled in pass 1, nothing to emit
}
}

Usage Examples

; Hardware addresses
.equ VID_PTR    0xFD
.equ INPUT      0xFF

; Constants
.equ SCREEN_WIDTH  64
.equ SCREEN_HEIGHT 64
.equ PIXEL_COUNT   SCREEN_WIDTH * SCREEN_HEIGHT

; Usage
lda INPUT
sta VID_PTR
ldx #SCREEN_WIDTH

Constants vs Labels

Constants differ from labels:

  • Labels: Get their value from their position in code
  • Constants: Have values assigned directly
.equ CONST 0x42     ; CONST = 0x42 (assigned)

label:              ; label = current address (computed)
    nop

The .include Directive

.include inserts another source file at the current position.

Implementation (Simplified)

#![allow(unused)]
fn main() {
DirectiveStmt::Include { path, location } => {
    // Read the include file
    let source = std::fs::read_to_string(path)
        .map_err(|_| CodeGenError::Internal {
            message: format!("cannot read file: {}", path),
        })?;

    // Parse it
    let program = parser::parse(&source)?;

    // Recursively assemble
    for stmt in program.statements {
        self.process_statement(&stmt)?;
    }
}
}

Usage Examples

; main.s
.include "constants.s"
.include "macros.s"

.org 0x8000
reset:
    jsr init_screen
    ; ...
; constants.s
.equ VID_PTR 0xFD
.equ INPUT   0xFF
.equ VRAM    0x1000

Circular Include Detection

We need to detect and prevent circular includes:

; a.s includes b.s
; b.s includes a.s → Error!

Track included files and error if we see a repeat.

Directive Processing Flow

Complete Pass 1 Handler

#![allow(unused)]
fn main() {
pub fn pass1_directive(&mut self, directive: &DirectiveStmt) -> Result<(), CodeGenError> {
    match directive {
        DirectiveStmt::Org { address, .. } => {
            let addr = self.evaluate(address)? as u16;
            self.origin = addr;
            self.current_address = addr;
        }

        DirectiveStmt::Db { values, .. } => {
            for value in values {
                match value {
                    DataValue::Byte(_) => {
                        self.current_address = self.current_address.wrapping_add(1);
                    }
                    DataValue::String(s) => {
                        self.current_address =
                            self.current_address.wrapping_add(s.len() as u16);
                    }
                }
            }
        }

        DirectiveStmt::Dw { values, .. } => {
            self.current_address = self
                .current_address
                .wrapping_add((values.len() * 2) as u16);
        }

        DirectiveStmt::Equ { name, value, location } => {
            let val = self.evaluate(value)?;
            self.symbols.define_constant(name, val, *location)?;
        }

        DirectiveStmt::Include { .. } => {
            // Handle includes
        }
    }

    Ok(())
}
}

Complete Pass 2 Handler

#![allow(unused)]
fn main() {
pub fn emit_directive(&mut self, directive: &DirectiveStmt) -> Result<(), CodeGenError> {
    match directive {
        DirectiveStmt::Org { address, .. } => {
            let addr = self.evaluate(address)? as u16;
            if addr > self.current_address {
                self.pad_to(addr);
            }
            self.current_address = addr;
        }

        DirectiveStmt::Db { values, location } => {
            for value in values {
                match value {
                    DataValue::Byte(expr) => {
                        let byte = self.evaluate_byte(expr, *location)?;
                        self.emit_byte(byte);
                    }
                    DataValue::String(s) => {
                        for byte in s.bytes() {
                            self.emit_byte(byte);
                        }
                    }
                }
            }
        }

        DirectiveStmt::Dw { values, location } => {
            for value in values {
                let word = self.evaluate_word(value, *location)?;
                self.emit_word(word);
            }
        }

        DirectiveStmt::Equ { .. } => {
            // Nothing to emit
        }

        DirectiveStmt::Include { .. } => {
            // Handle includes
        }
    }

    Ok(())
}
}

Summary

In this chapter, we implemented all ByteASM directives:

  • .org: Sets the assembly address, pads with zeros
  • .db: Emits bytes and strings
  • .dw: Emits 16-bit words (little-endian)
  • .equ: Defines named constants
  • .include: Includes other source files

In the next chapter, we’ll implement comprehensive error handling and reporting.


Previous: Chapter 9 - Expression Evaluation | Next: Chapter 11 - Error Handling

Chapter 11: Error Handling and Reporting

In this chapter, we’ll implement comprehensive error handling to provide helpful messages when things go wrong.

Error Categories

Our assembler has three categories of errors:

  1. Scanner Errors: Problems reading source characters
  2. Parse Errors: Problems understanding the syntax
  3. Code Generation Errors: Problems during assembly

The Error Type Hierarchy

#![allow(unused)]
fn main() {
pub enum AssemblerError {
    Scanner(ScannerError),
    Parse(ParseError),
    CodeGen(CodeGenError),
    Io { message: String },
    Multiple(Vec<AssemblerError>),
}
}

This unified type lets us handle errors consistently throughout the pipeline.

Scanner Errors

#![allow(unused)]
fn main() {
pub enum ScannerError {
    UnknownCharacter {
        line: usize,
        column: usize,
        character: char,
    },
    UnknownDirective {
        line: usize,
        column: usize,
        directive: String,
    },
    NumberExpected {
        line: usize,
        column: usize,
        symbol: char,
    },
    UnterminatedString {
        line: usize,
        column: usize,
        quote: char,
    },
}
}

Example Messages

error: unknown character '?'
  --> game.s:5:10
   |
 5 |     lda ?
   |          ^

error: unterminated string
  --> game.s:12:9
   |
12 |     .db "Hello
   |         ^^^^^^ missing closing quote

Parse Errors

#![allow(unused)]
fn main() {
pub enum ParseError {
    UnexpectedToken {
        expected: String,
        found: TokenKind,
        location: Location,
    },
    UnexpectedEof {
        location: Location,
    },
    InvalidOperand {
        message: String,
        location: Location,
    },
    InvalidExpression {
        message: String,
        location: Location,
    },
    InvalidDirective {
        message: String,
        location: Location,
    },
    InvalidLabel {
        message: String,
        location: Location,
    },
}
}

Example Messages

error: unexpected token: expected ')', found ','
  --> game.s:15:14
   |
15 |     lda (0x80,y)
   |              ^ expected ')' for indirect addressing

error: invalid operand: indexed indirect only supports X register
  --> game.s:20:9
   |
20 |     sta (0x80,y)
   |         ^^^^^^^^ use (zp),y for indirect indexed mode

Code Generation Errors

#![allow(unused)]
fn main() {
pub enum CodeGenError {
    UndefinedSymbol {
        name: String,
        location: Location,
    },
    DuplicateSymbol {
        name: String,
        first: Location,
        second: Location,
    },
    InvalidAddressingMode {
        mnemonic: Mnemonic,
        mode: AddressingMode,
        location: Location,
    },
    BranchOutOfRange {
        offset: i64,
        location: Location,
    },
    ValueOutOfRange {
        value: i64,
        max: i64,
        location: Location,
    },
    CircularInclude {
        path: String,
        location: Location,
    },
    EvaluationError {
        message: String,
        location: Location,
    },
}
}

Example Messages

error: undefined symbol 'sprite_ptr'
  --> game.s:42:9
   |
42 |     lda sprite_ptr
   |         ^^^^^^^^^^ symbol not defined

error: duplicate symbol 'main'
  --> game.s:30:1
   |
10 |  main:
   |  ---- first defined here
   |
30 |  main:
   |  ^^^^ redefined here

error: branch target out of range (offset -200)
  --> game.s:55:5
   |
55 |     bne far_label
   |     ^^^^^^^^^^^^^ target is 200 bytes away (max: -128 to +127)

Formatting Errors

The Diagnostic Structure

#![allow(unused)]
fn main() {
pub struct Diagnostic {
    pub severity: Severity,
    pub message: String,
    pub location: Option<Location>,
    pub file: Option<String>,
}

pub enum Severity {
    Error,
    Warning,
    Note,
}
}

Formatting with Source Context

#![allow(unused)]
fn main() {
impl Diagnostic {
    pub fn format(&self, source: &str) -> String {
        let severity = match self.severity {
            Severity::Error => "error",
            Severity::Warning => "warning",
            Severity::Note => "note",
        };

        let mut output = String::new();

        if let Some(loc) = &self.location {
            // Header line
            if let Some(file) = &self.file {
                output.push_str(&format!(
                    "{}: {}:{}:{}: {}\n",
                    severity, file, loc.line, loc.column, self.message
                ));
            } else {
                output.push_str(&format!(
                    "{}: [{}:{}]: {}\n",
                    severity, loc.line, loc.column, self.message
                ));
            }

            // Show source context
            let lines: Vec<&str> = source.lines().collect();
            if loc.line > 0 && loc.line <= lines.len() {
                let line = lines[loc.line - 1];
                let line_num = format!("{}", loc.line);
                let padding = " ".repeat(line_num.len());

                output.push_str(&format!("   {} |\n", padding));
                output.push_str(&format!("   {} | {}\n", line_num, line));

                // Underline
                let underline_padding = " ".repeat(loc.column.saturating_sub(1));
                let underline = "^".repeat(loc.length.max(1));
                output.push_str(&format!(
                    "   {} | {}{}\n",
                    padding, underline_padding, underline
                ));
            }
        } else {
            output.push_str(&format!("{}: {}\n", severity, self.message));
        }

        output
    }
}
}

Error Recovery

Rather than stopping at the first error, we continue parsing to find more issues.

Parser Error Recovery

#![allow(unused)]
fn main() {
fn synchronize(&mut self) {
    // Skip to the next line
    while !self.is_at_end() {
        if self.check(TokenKind::NewLine) {
            self.advance();
            return;
        }
        self.advance();
    }
}

pub fn parse(&mut self) -> Result<Program, Vec<ParseError>> {
    let mut program = Program::new();

    while !self.is_at_end() {
        self.skip_empty_lines();
        if self.is_at_end() { break; }

        match self.parse_line() {
            Ok(stmts) => program.statements.extend(stmts),
            Err(e) => {
                self.errors.push(e);
                self.synchronize();  // Skip to next line
            }
        }
    }

    if self.errors.is_empty() {
        Ok(program)
    } else {
        Err(std::mem::take(&mut self.errors))
    }
}
}

Assembler Error Collection

#![allow(unused)]
fn main() {
fn pass2(&mut self, program: &Program) -> Result<(), AssemblerError> {
    for stmt in &program.statements {
        match stmt {
            Statement::Instruction(instr) => {
                if let Err(e) = self.emit_instruction(instr) {
                    self.errors.push(e);  // Collect, don't stop
                }
            }
            // ...
        }
    }
    Ok(())
}
}

Warnings

Some issues don’t prevent assembly but should be reported:

#![allow(unused)]
fn main() {
// Zero page address used with absolute mode
pub fn warn_zp_as_absolute(&self, addr: u16, location: Location) {
    if addr <= 0xFF {
        eprintln!(
            "warning: [{}:{}]: address 0x{:02X} could use zero page mode",
            location.line, location.column, addr
        );
    }
}

// Unused symbol
pub fn warn_unused_symbols(&self) {
    for sym in self.symbols.unreferenced_symbols() {
        eprintln!(
            "warning: [{}:{}]: symbol '{}' is defined but never used",
            sym.defined_at.line, sym.defined_at.column, sym.name
        );
    }
}
}

CLI Error Formatting

#![allow(unused)]
fn main() {
fn format_error(error: &ParseError, source: &str, file: &Path) -> String {
    let loc = error.location();
    let lines: Vec<&str> = source.lines().collect();

    let mut output = format!(
        "error: {}\n  --> {}:{}:{}\n",
        error, file.display(), loc.line, loc.column
    );

    if loc.line > 0 && loc.line <= lines.len() {
        let line = lines[loc.line - 1];
        let line_num = format!("{}", loc.line);
        let padding = " ".repeat(line_num.len());

        output.push_str(&format!("   {} |\n", padding));
        output.push_str(&format!("   {} | {}\n", line_num, line));
        output.push_str(&format!(
            "   {} | {}{}'\n",
            padding,
            " ".repeat(loc.column.saturating_sub(1)),
            "^".repeat(loc.length.max(1))
        ));
    }

    output
}
}

Complete Error Output Example

error: unexpected token: expected identifier, found Number
  --> game.s:5:10
   |
 5 | .equ 123 456
   |      ^^^ expected identifier after .equ

error: undefined symbol 'plyer_x'
  --> game.s:15:9
   |
15 |     lda plyer_x
   |         ^^^^^^^ did you mean 'player_x'?

error: branch target out of range (offset -150)
  --> game.s:42:5
   |
42 |     bne start
   |     ^^^^^^^^^ target is too far away

Found 3 errors.

Summary

In this chapter, we implemented comprehensive error handling:

  • Three error categories: Scanner, parser, and code generation
  • Location tracking: Every error includes file, line, and column
  • Source context: Show the offending line with underline
  • Error recovery: Continue after errors to find more issues
  • Warnings: Non-fatal issues like unused symbols

In the next chapter, we’ll build the command-line interface for the assembler.


Previous: Chapter 10 - Implementing Directives | Next: Chapter 12 - The CLI

Chapter 12: The Command-Line Interface

In this chapter, we’ll build a usable command-line tool for our assembler.

CLI Design

byte_asm [OPTIONS] <input.s>

OPTIONS:
    -o, --output <file>   Output binary file (default: a.out)
    -v, --verbose         Show assembly progress
    --hex                 Output as hex dump instead of binary
    -h, --help            Show help message

Argument Parsing

We’ll parse arguments manually for simplicity:

#![allow(unused)]
fn main() {
struct Args {
    input: PathBuf,
    output: PathBuf,
    verbose: bool,
    hex_dump: bool,
}

fn parse_args() -> Result<Args, String> {
    let args: Vec<String> = std::env::args().collect();

    if args.len() < 2 {
        return Err(format!(
            "Usage: {} [OPTIONS] <input.s>\n\n\
             OPTIONS:\n\
             -o, --output <file>   Output binary file (default: a.out)\n\
             -v, --verbose         Show assembly progress\n\
             --hex                 Output as hex dump",
            args[0]
        ));
    }

    let mut input: Option<PathBuf> = None;
    let mut output = PathBuf::from("a.out");
    let mut verbose = false;
    let mut hex_dump = false;

    let mut i = 1;
    while i < args.len() {
        match args[i].as_str() {
            "-o" | "--output" => {
                i += 1;
                if i >= args.len() {
                    return Err("Expected output file after -o".to_string());
                }
                output = PathBuf::from(&args[i]);
            }
            "-v" | "--verbose" => verbose = true,
            "--hex" => hex_dump = true,
            "-h" | "--help" => {
                return Err(format!(
                    "ByteASM - 6502 Assembler\n\n\
                     Usage: {} [OPTIONS] <input.s>\n\n\
                     OPTIONS:\n\
                     -o, --output <file>   Output file (default: a.out)\n\
                     -v, --verbose         Show progress\n\
                     --hex                 Output hex dump\n\
                     -h, --help            Show help",
                    args[0]
                ));
            }
            arg if arg.starts_with('-') => {
                return Err(format!("Unknown option: {}", arg));
            }
            _ => {
                input = Some(PathBuf::from(&args[i]));
            }
        }
        i += 1;
    }

    let input = input.ok_or("No input file specified")?;

    Ok(Args { input, output, verbose, hex_dump })
}
}

Main Function

fn main() {
    let args = match parse_args() {
        Ok(args) => args,
        Err(msg) => {
            eprintln!("{}", msg);
            std::process::exit(1);
        }
    };

    if args.verbose {
        eprintln!("Assembling: {}", args.input.display());
    }

    // Read source file
    let source = match fs::read_to_string(&args.input) {
        Ok(s) => s,
        Err(e) => {
            eprintln!("Error reading {}: {}", args.input.display(), e);
            std::process::exit(1);
        }
    };

    // Parse
    let program = match parser::parse(&source) {
        Ok(p) => p,
        Err(errors) => {
            for error in errors {
                eprintln!("{}", format_error(&error, &source, &args.input));
            }
            std::process::exit(1);
        }
    };

    if args.verbose {
        eprintln!("Parsed {} statements", program.statements.len());
    }

    // Assemble
    let mut assembler = Assembler::new();
    let binary = match assembler.assemble(&program) {
        Ok(b) => b,
        Err(e) => {
            eprintln!("{}", format_assembler_error(&e, &source, &args.input));
            std::process::exit(1);
        }
    };

    if args.verbose {
        eprintln!("Generated {} bytes", binary.len());
        eprintln!("Defined {} symbols", assembler.symbols().len());
    }

    // Output
    if args.hex_dump {
        print_hex_dump(&binary);
    } else {
        if let Err(e) = fs::write(&args.output, &binary) {
            eprintln!("Error writing {}: {}", args.output.display(), e);
            std::process::exit(1);
        }

        if args.verbose {
            eprintln!("Wrote: {}", args.output.display());
        }
    }
}

Hex Dump Output

For debugging, we can output a hex dump instead of binary:

#![allow(unused)]
fn main() {
fn print_hex_dump(binary: &[u8]) {
    const BYTES_PER_LINE: usize = 16;

    for (i, chunk) in binary.chunks(BYTES_PER_LINE).enumerate() {
        // Address
        print!("{:04X}  ", i * BYTES_PER_LINE);

        // Hex bytes
        for (j, byte) in chunk.iter().enumerate() {
            print!("{:02X} ", byte);
            if j == 7 { print!(" "); }  // Extra space in middle
        }

        // Padding for incomplete lines
        if chunk.len() < BYTES_PER_LINE {
            let missing = BYTES_PER_LINE - chunk.len();
            for j in 0..missing {
                print!("   ");
                if chunk.len() + j == 7 { print!(" "); }
            }
        }

        // ASCII representation
        print!(" |");
        for byte in chunk {
            if *byte >= 0x20 && *byte < 0x7F {
                print!("{}", *byte as char);
            } else {
                print!(".");
            }
        }
        println!("|");
    }
}
}

Example Output

$ byte_asm --hex example.s

0000  A9 42 85 00 4C 00 80 00  00 00 00 00 00 00 00 00  |.B..L...........|
0010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|

Integration with Byte Emulator

The output binary is ready to load into the emulator:

# Assemble
cargo run -p byte_asm -- game.s -o game.bin

# Run
cargo run -p byte_emu -- game.bin

Setting Up the Reset Vector

The emulator reads the reset vector from 0xFFFC-0xFFFD:

.equ VID_PTR  0xFD
.equ VID_PAGE 0x01       ; Video page 1 (0x1000 >> 12)

.org 0x8000
reset:
    ; Initialize hardware
    lda #VID_PAGE
    sta VID_PTR         ; Set video page

main_loop:
    ; Game logic
    rti

; Set up vectors
.org 0xFFFC
.dw reset           ; Reset vector - where to start
.dw main_loop       ; IRQ vector - called on VBLANK

Verbose Output Example

$ byte_asm -v game.s -o game.bin

Assembling: game.s
Parsed 42 statements
Generated 156 bytes
Defined 12 symbols
Wrote: game.bin

Error Output Example

$ byte_asm broken.s

error: undefined symbol 'sprit_x'
  --> broken.s:15:9
   |
15 |     lda sprit_x
   |         ^^^^^^^ symbol not defined

error: branch target out of range
  --> broken.s:42:5
   |
42 |     bne far_label
   |     ^^^^^^^^^^^^^ offset -150, must be -128 to +127

Found 2 errors.

Usage Workflow

A typical workflow:

# Edit source
vim game.s

# Assemble
byte_asm game.s -o game.bin

# Test in emulator
byte_emu game.bin

# Debug with hex dump
byte_asm game.s --hex | less

# Verbose build
byte_asm -v game.s -o game.bin

Exit Codes

#![allow(unused)]
fn main() {
// Success
std::process::exit(0);

// Error (parse, assembly, I/O)
std::process::exit(1);
}

Summary

In this chapter, we built a command-line interface that:

  • Parses command-line arguments
  • Reads source files
  • Invokes the parser and assembler
  • Outputs binary or hex dump
  • Reports errors with context
  • Integrates with the byte emulator

In the next chapter, we’ll write tests to verify our assembler works correctly.


Previous: Chapter 11 - Error Handling | Next: Chapter 13 - Testing

Chapter 13: Testing the Assembler

In this chapter, we’ll write tests to verify our assembler produces correct machine code.

Testing Strategy

We’ll test at multiple levels:

  1. Unit tests: Individual components (scanner, parser, evaluator)
  2. Integration tests: Complete assembly of programs
  3. Comparison tests: Compare output to known-good binaries

Scanner Tests

Test that the scanner produces correct tokens:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod scanner_tests {
    use byte_asm::scanner::*;

    #[test]
    fn test_number_formats() {
        let mut scanner = Scanner::new("0xFF 0b1010 42");

        let tok = scanner.scan_token().unwrap();
        assert_eq!(tok.kind, TokenKind::Number);
        assert_eq!(tok.number(), Some(255));

        let tok = scanner.scan_token().unwrap();
        assert_eq!(tok.kind, TokenKind::Number);
        assert_eq!(tok.number(), Some(10));

        let tok = scanner.scan_token().unwrap();
        assert_eq!(tok.kind, TokenKind::Number);
        assert_eq!(tok.number(), Some(42));
    }

    #[test]
    fn test_instruction() {
        let mut scanner = Scanner::new("lda");
        let tok = scanner.scan_token().unwrap();
        assert_eq!(tok.kind, TokenKind::Instruction);
    }

    #[test]
    fn test_directive() {
        let mut scanner = Scanner::new(".org");
        let tok = scanner.scan_token().unwrap();
        assert_eq!(tok.kind, TokenKind::Directive);
    }

    #[test]
    fn test_local_label() {
        let mut scanner = Scanner::new(".loop @temp");

        let tok = scanner.scan_token().unwrap();
        assert_eq!(tok.kind, TokenKind::LocalLabel);

        let tok = scanner.scan_token().unwrap();
        assert_eq!(tok.kind, TokenKind::LocalLabel);
    }
}
}

Parser Tests

Test that parsing produces correct AST:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod parser_tests {
    use byte_asm::parser;
    use byte_asm::ast::*;

    #[test]
    fn test_parse_instruction() {
        let program = parser::parse("lda #0x42").unwrap();
        assert_eq!(program.statements.len(), 1);

        match &program.statements[0] {
            Statement::Instruction(i) => {
                assert!(matches!(i.operand, Some(Operand::Immediate(_))));
            }
            _ => panic!("Expected instruction"),
        }
    }

    #[test]
    fn test_parse_label() {
        let program = parser::parse("main:").unwrap();
        match &program.statements[0] {
            Statement::Label(l) => {
                assert_eq!(l.name, "main");
                assert!(!l.is_local);
            }
            _ => panic!("Expected label"),
        }
    }

    #[test]
    fn test_parse_addressing_modes() {
        // Immediate
        let p = parser::parse("lda #0x42").unwrap();
        assert!(matches!(
            &p.statements[0],
            Statement::Instruction(i) if matches!(i.operand, Some(Operand::Immediate(_)))
        ));

        // Zero Page / Absolute
        let p = parser::parse("lda 0x80").unwrap();
        assert!(matches!(
            &p.statements[0],
            Statement::Instruction(i) if matches!(i.operand, Some(Operand::Address(_)))
        ));

        // Indirect X
        let p = parser::parse("lda (0x80,x)").unwrap();
        assert!(matches!(
            &p.statements[0],
            Statement::Instruction(i) if matches!(i.operand, Some(Operand::IndirectX(_)))
        ));

        // Indirect Y
        let p = parser::parse("lda (0x80),y").unwrap();
        assert!(matches!(
            &p.statements[0],
            Statement::Instruction(i) if matches!(i.operand, Some(Operand::IndirectY(_)))
        ));
    }
}
}

Assembler Tests

Test that assembly produces correct bytes:

#![allow(unused)]
fn main() {
#[cfg(test)]
mod assembler_tests {
    use byte_asm::{parser, Assembler};

    fn assemble(source: &str) -> Vec<u8> {
        let program = parser::parse(source).unwrap();
        let mut asm = Assembler::new();
        asm.assemble(&program).unwrap()
    }

    #[test]
    fn test_nop() {
        let binary = assemble(".org 0x8000\nnop");
        assert_eq!(binary[0], 0xEA);
    }

    #[test]
    fn test_lda_immediate() {
        let binary = assemble(".org 0x8000\nlda #0x42");
        assert_eq!(&binary[0..2], &[0xA9, 0x42]);
    }

    #[test]
    fn test_lda_zero_page() {
        let binary = assemble(".org 0x8000\nlda 0x80");
        assert_eq!(&binary[0..2], &[0xA5, 0x80]);
    }

    #[test]
    fn test_lda_absolute() {
        let binary = assemble(".org 0x8000\nlda 0x2000");
        assert_eq!(&binary[0..3], &[0xAD, 0x00, 0x20]);
    }

    #[test]
    fn test_jmp() {
        let binary = assemble(".org 0x8000\njmp 0x9000");
        assert_eq!(&binary[0..3], &[0x4C, 0x00, 0x90]);
    }

    #[test]
    fn test_label_resolution() {
        let source = r#"
.org 0x8000
start:
    jmp end
    nop
end:
    rts
"#;
        let binary = assemble(source);
        // JMP end = 4C 05 80 (end is at 0x8005)
        assert_eq!(&binary[0..3], &[0x4C, 0x05, 0x80]);
    }

    #[test]
    fn test_branch() {
        let source = r#"
.org 0x8000
    ldx #5
loop:
    dex
    bne loop
"#;
        let binary = assemble(source);
        // BNE loop: D0 FD (offset -3)
        assert_eq!(&binary[3..5], &[0xD0, 0xFD]);
    }

    #[test]
    fn test_db_bytes() {
        let binary = assemble(".org 0x8000\n.db 0x01, 0x02, 0x03");
        assert_eq!(&binary[0..3], &[0x01, 0x02, 0x03]);
    }

    #[test]
    fn test_db_string() {
        let binary = assemble(".org 0x8000\n.db \"Hi\", 0");
        assert_eq!(&binary[0..3], &[0x48, 0x69, 0x00]);
    }

    #[test]
    fn test_dw() {
        let binary = assemble(".org 0x8000\n.dw 0x1234");
        assert_eq!(&binary[0..2], &[0x34, 0x12]); // Little-endian
    }

    #[test]
    fn test_equ() {
        let source = r#"
.equ VALUE 0x42
.org 0x8000
lda #VALUE
"#;
        let binary = assemble(source);
        assert_eq!(&binary[0..2], &[0xA9, 0x42]);
    }

    #[test]
    fn test_expressions() {
        let source = r#"
.equ BASE 0x1000
.org 0x8000
lda #>BASE
lda #<BASE
lda #10 + 5
"#;
        let binary = assemble(source);
        assert_eq!(&binary[0..6], &[0xA9, 0x10, 0xA9, 0x00, 0xA9, 0x0F]);
    }
}
}

Integration Tests

Test complete programs:

#![allow(unused)]
fn main() {
// tests/integration.rs

use byte_asm::{parser, Assembler};
use std::fs;

fn assemble_file(path: &str) -> Vec<u8> {
    let source = fs::read_to_string(path).unwrap();
    let program = parser::parse(&source).unwrap();
    let mut asm = Assembler::new();
    asm.assemble(&program).unwrap()
}

#[test]
fn test_basic_program() {
    let binary = assemble_file("tests/fixtures/basic.s");
    assert!(!binary.is_empty());
}

#[test]
fn test_all_addressing_modes() {
    let binary = assemble_file("tests/fixtures/addressing_modes.s");
    assert!(!binary.is_empty());
}
}

Test Fixtures

Create test assembly files:

; tests/fixtures/basic.s
.org 0x8000
start:
    lda #0x42
    sta 0x00
    nop
    brk
.org 0xFFFC
.dw start
; tests/fixtures/addressing_modes.s
.org 0x8000
    nop
    asl a
    lda #0xFF
    lda 0x80
    lda 0x80,x
    lda 0x2000
    lda 0x2000,x
    lda 0x2000,y
    lda (0x80,x)
    lda (0x80),y
.org 0xFFFC
.dw 0x8000

Running Tests

# Run all tests
cargo test -p byte_asm

# Run specific test
cargo test -p byte_asm test_lda_immediate

# Run with output
cargo test -p byte_asm -- --nocapture

Test Output

running 22 tests
test assembler::codegen::tests::test_absolute ... ok
test assembler::codegen::tests::test_accumulator ... ok
test assembler::codegen::tests::test_immediate ... ok
test assembler::codegen::tests::test_implied ... ok
test assembler::codegen::tests::test_indexed_x ... ok
test assembler::codegen::tests::test_indirect_x ... ok
test assembler::codegen::tests::test_indirect_y ... ok
test assembler::codegen::tests::test_zero_page ... ok
test assembler::directives::tests::test_db_bytes ... ok
test assembler::directives::tests::test_db_string ... ok
test assembler::directives::tests::test_dw ... ok
test assembler::directives::tests::test_equ ... ok
test assembler::directives::tests::test_org ... ok
test assembler::eval::tests::test_binary_ops ... ok
test assembler::eval::tests::test_current_address ... ok
test assembler::eval::tests::test_identifier ... ok
test assembler::eval::tests::test_number ... ok
test assembler::eval::tests::test_unary_ops ... ok
test symbol::tests::test_constants ... ok
test symbol::tests::test_define_and_lookup ... ok
test symbol::tests::test_duplicate_symbol ... ok
test symbol::tests::test_local_labels ... ok

test result: ok. 22 passed; 0 failed

Summary

In this chapter, we wrote comprehensive tests:

  • Scanner tests: Token recognition for all types
  • Parser tests: AST structure for statements and operands
  • Assembler tests: Machine code output verification
  • Integration tests: Complete program assembly
  • Test fixtures: Reusable assembly test files

In the final chapter, we’ll put it all together with a complete game example.


Previous: Chapter 12 - The CLI | Next: Chapter 14 - Complete Example

Chapter 14: Complete Example - A Bouncing Ball Game

In this final chapter, we’ll put everything together by examining a complete game that demonstrates all assembler features.

The Bouncing Ball Demo

This program displays a white ball that bounces around the screen, demonstrating:

  • Hardware initialization
  • Game loop structure
  • Sprite movement and collision
  • Video memory access

Memory Map

; Memory-mapped I/O registers
.equ VID_PTR  0xFD       ; Video page pointer (page number, not high byte)
.equ RANDOM   0xFE       ; Random number generator
.equ INPUT    0xFF       ; Input register

; VRAM location
.equ VRAM     0x1000     ; Start of video RAM
.equ VID_PAGE 0x01       ; Video page number (0x1000 >> 12)

; Screen dimensions
.equ WIDTH    64
.equ HEIGHT   64

; Zero page variables
.equ BALL_X   0x00       ; Ball X position
.equ BALL_Y   0x01       ; Ball Y position
.equ VEL_X    0x02       ; X velocity (signed)
.equ VEL_Y    0x03       ; Y velocity (signed)
.equ OLD_X    0x04       ; Previous X position
.equ OLD_Y    0x05       ; Previous Y position
.equ TEMP     0x06       ; Temporary variable

Using .equ for constants makes the code readable and maintainable.

Initialization

.org 0x8000

reset:
    ; Set video page to VRAM (page 1 = 0x1000-0x1FFF)
    lda #VID_PAGE
    sta VID_PTR

    ; Initialize ball position to center
    lda #32
    sta BALL_X
    sta BALL_Y

    ; Initialize velocity (moving down-right)
    lda #1
    sta VEL_X
    sta VEL_Y

    ; Clear the screen
    jsr clear_screen

The VID_PTR register takes a page number (0-15), where each page is 4KB. Page 1 corresponds to addresses 0x1000-0x1FFF.

The Main Loop

The main loop is driven by the IRQ (VBLANK):

main_loop:
    ; Save old position for erasing
    lda BALL_X
    sta OLD_X
    lda BALL_Y
    sta OLD_Y

    ; Update ball position
    jsr update_ball

    ; Erase old ball
    ldx OLD_X
    ldy OLD_Y
    lda #0          ; black
    jsr draw_pixel

    ; Draw new ball
    ldx BALL_X
    ldy BALL_Y
    lda #1          ; white
    jsr draw_pixel

    ; Wait for next frame
    rti

The rti (return from interrupt) waits until the next VBLANK.

Ball Update with Bouncing

update_ball:
    ; Update X position
    lda BALL_X
    clc
    adc VEL_X
    sta BALL_X

    ; Check X bounds
    cmp #WIDTH - 1
    bcs .bounce_x
    cmp #0
    beq .bounce_x
    jmp .check_y

.bounce_x:
    ; Reverse X velocity: VEL_X = 0 - VEL_X
    lda #0
    sec
    sbc VEL_X
    sta VEL_X

    ; Clamp X position
    lda BALL_X
    cmp #WIDTH
    bcc .clamp_x_done
    lda #WIDTH - 2
    sta BALL_X
.clamp_x_done:
    lda BALL_X
    bne .check_y
    lda #1
    sta BALL_X

.check_y:
    ; Similar logic for Y...
    rts

This demonstrates:

  • Local labels (.bounce_x, .check_y)
  • Expression in immediate (#WIDTH - 1)
  • Conditional branching

Pixel Drawing

; Draw a pixel at (X, Y) with color in A
; X = column (0-63)
; Y = row (0-63)
; A = color
draw_pixel:
    sta TEMP            ; Save color

    ; Calculate VRAM offset: Y * 64 + X
    tya                 ; A = Y
    asl a               ; A = Y * 2
    asl a               ; A = Y * 4
    asl a               ; A = Y * 8
    asl a               ; A = Y * 16
    asl a               ; A = Y * 32
    asl a               ; A = Y * 64

    ; Add X
    stx TEMP + 1        ; Save X
    clc
    adc TEMP + 1        ; A = Y * 64 + X

    ; Store to VRAM
    tax
    lda TEMP            ; Get color back
    sta VRAM,x          ; Write to VRAM

    rts

Screen Clearing

clear_screen:
    ldx #0
    lda #0
.loop:
    sta VRAM,x
    sta VRAM + 0x100,x
    sta VRAM + 0x200,x
    sta VRAM + 0x300,x
    ; ... (more pages)
    inx
    bne .loop
    rts

Using expressions like VRAM + 0x100 makes the code clearer.

Interrupt Vectors

; Set up reset and IRQ vectors
.org 0xFFFC
.dw reset           ; Reset vector
.dw main_loop       ; IRQ vector (VBLANK)

The .dw directive writes 16-bit addresses in little-endian format.

Building and Running

# Assemble the game
cargo run -p byte_asm -- bouncing_ball.s -o game.bin

# Run in emulator
cargo run -p byte_emu -- game.bin

With verbose output:

$ cargo run -p byte_asm -- -v bouncing_ball.s -o game.bin

Assembling: bouncing_ball.s
Parsed 85 statements
Generated 312 bytes
Defined 15 symbols
Wrote: game.bin

Features Demonstrated

This example uses all major assembler features:

FeatureExample
.org.org 0x8000
.equ.equ WIDTH 64
.db(could add strings)
.dw.dw reset
Labelsreset:, main_loop:
Local labels.bounce_x:, .loop:
Immediatelda #32
Zero Pagesta BALL_X
Absolutesta VRAM,x
Indexedsta VRAM + 0x100,x
Expressions#WIDTH - 1, VRAM + 0x100
Branchesbne .loop, bcs .bounce_x
Subroutinesjsr update_ball, rts

Extending the Example

Ideas for additions:

  • User input to control ball direction
  • Multiple balls
  • Score counter
  • Sound effects
  • Paddle for a Pong-like game

Complete Source

See byte_asm/examples/bouncing_ball.s for the full source code.

Summary

In this tutorial, we built a complete 6502 assembler from scratch:

  1. Scanner: Tokenizes source into meaningful chunks
  2. Parser: Builds an AST representing program structure
  3. Symbol Table: Tracks labels and constants
  4. Two-Pass Assembler: Resolves forward references
  5. Code Generator: Emits correct machine code
  6. Expression Evaluator: Computes values at assembly time
  7. Directive Handlers: Processes .org, .db, .dw, .equ
  8. Error Handling: Provides helpful error messages
  9. CLI: User-friendly command-line interface
  10. Testing: Verifies correctness

The result is a fully functional assembler that can build real programs for the Byte fantasy console.

What’s Next?

  • Macros: Add .macro and .endmacro for code reuse
  • Conditional Assembly: Add .if, .else, .endif
  • Include Path: Search multiple directories for includes
  • Listing Files: Generate human-readable assembly listings
  • Debug Symbols: Output symbol tables for debuggers

Happy assembling!


Previous: Chapter 13 - Testing | Back to Index