Building a 6502 Assembler: A Complete Tutorial
Welcome to the Byte Fantasy Console Assembler Tutorial! This comprehensive guide will walk you through building a complete 6502 assembler from scratch.
What You’ll Build
By the end of this tutorial, you’ll have a fully functional assembler that can:
- Parse 6502 assembly source code using the ByteASM language
- Resolve labels and forward references
- Generate machine code binaries
- Handle directives like
.org,.db,.dw,.equ,.include - Evaluate expressions (e.g.,
label + 5) - Report meaningful errors with line/column information
Prerequisites
- Basic Rust knowledge (ownership, structs, enums, pattern matching)
- No prior knowledge of assemblers or compilers required
Tutorial Chapters
Part 1: Foundations
Part 2: Parsing
- Designing the Abstract Syntax Tree
- Building the Parser - Structure
- Building the Parser - Implementation
Part 3: Assembly
Part 4: Completion
- Implementing Directives
- Error Handling and Reporting
- The Command-Line Interface
- Testing the Assembler
- Complete Example - A Game
Quick Start
Once you’ve completed the tutorial, you can assemble programs like this:
cargo run -p byte_asm -- bouncing_ball.s -o game.bin
cargo run -p byte_emu -- game.bin
ByteASM Language Overview
ByteASM is a clean, modern 6502 assembly language designed for the Byte fantasy console:
; bouncing_ball.s - A simple demo for Byte console
.equ BALL_X 0x00
.equ BALL_Y 0x01
.equ VRAM 0x1000
.org 0x8000
reset:
lda #32 ; initialize ball position
sta BALL_X
sta BALL_Y
main_loop:
jsr update_ball
jsr draw_ball
rti ; wait for next frame
update_ball:
lda BALL_X
clc
adc #1
sta BALL_X
rts
draw_ball:
ldx BALL_X
lda #1 ; white color
sta VRAM,x
rts
.org 0xFFFC
.dw reset ; reset vector
.dw main_loop ; IRQ vector
Key Features
- Numeric literals: Decimal (
255), hexadecimal (0xFF), binary (0b11111111) - All lowercase mnemonics and registers
- Local labels with
.or@prefix - Expressions with
+,-,*,/operators - Byte extraction with
<(low byte) and>(high byte) - Current address with
$
Project Structure
byte_asm/
├── src/
│ ├── lib.rs # Library exports
│ ├── main.rs # CLI application
│ ├── ast.rs # Abstract Syntax Tree
│ ├── error.rs # Unified error types
│ ├── symbol.rs # Symbol table
│ ├── scanner/ # Lexical analysis
│ ├── parser/ # Syntax analysis
│ └── assembler/ # Code generation
├── tests/ # Integration tests
└── examples/ # Example assembly programs
Let’s begin with Chapter 1: Introduction to Assemblers and the 6502!
Chapter 1: Introduction to Assemblers and the 6502
Welcome to the first chapter of our journey into building a 6502 assembler! In this chapter, we’ll establish the foundational concepts you need before diving into code.
What is an Assembler?
An assembler is a program that translates assembly language source code into machine code that a processor can execute directly.
The Translation Pipeline
Source Code → Assembler → Machine Code
(text) (bytes)
Consider this simple example:
lda #0x42 ; Load 0x42 into accumulator
sta 0x00 ; Store to address 0x00
The assembler transforms this into:
A9 42 85 00
These four bytes are the actual instructions the 6502 CPU will execute.
Assembly vs Machine Code vs High-Level Languages
Machine Code: Raw bytes that the CPU understands. Each byte has a specific meaning - opcodes, operands, data. Writing directly in machine code is tedious and error-prone.
Assembly Language: A human-readable representation of machine code. Each instruction has a mnemonic (like LDA for “Load Accumulator”) that maps to a specific opcode. Assembly is a 1-to-1 mapping with machine code.
High-Level Languages: Languages like C, Rust, or Python abstract away machine details. One line of high-level code might compile to dozens of machine instructions.
The 6502 Microprocessor
A Brief History
The 6502, designed in 1975 by MOS Technology, became one of the most influential processors in computing history. It powered:
- Apple II (1977)
- Commodore 64 (1982)
- Nintendo Entertainment System (1983)
- Atari 2600 (1977)
Its simplicity and low cost made it ideal for early personal computers and game consoles.
Architecture Overview
The 6502 is an 8-bit processor with a 16-bit address bus:
- 8-bit: It processes data 8 bits (1 byte) at a time
- 16-bit address bus: It can address up to 64KB of memory (2^16 = 65,536 bytes)
Registers
The 6502 has a small set of registers:
| Register | Size | Purpose |
|---|---|---|
| A (Accumulator) | 8-bit | Main register for arithmetic and logic operations |
| X | 8-bit | Index register, used for addressing and counting |
| Y | 8-bit | Index register, similar to X |
| SP (Stack Pointer) | 8-bit | Points to current position in the stack (page $01) |
| PC (Program Counter) | 16-bit | Address of the next instruction to execute |
| P (Status/Flags) | 8-bit | Processor status flags |
Status Flags
The P register contains flags that reflect the result of operations:
7 6 5 4 3 2 1 0
N V - B D I Z C
| Flag | Name | Meaning |
|---|---|---|
| N | Negative | Set if result is negative (bit 7 is 1) |
| V | Overflow | Set on signed arithmetic overflow |
| B | Break | Set by BRK instruction |
| D | Decimal | Enables BCD arithmetic mode |
| I | Interrupt | Disables IRQ when set |
| Z | Zero | Set if result is zero |
| C | Carry | Set on unsigned overflow/underflow |
6502 Instruction Format
Each 6502 instruction consists of:
- Opcode (1 byte): Identifies the instruction and addressing mode
- Operand (0, 1, or 2 bytes): The data or address the instruction operates on
Instruction Sizes
- 1 byte: Instructions with no operand (implied, accumulator modes)
- 2 bytes: Opcode + 8-bit operand (immediate, zero page, relative)
- 3 bytes: Opcode + 16-bit operand (absolute addressing)
Little-Endian Byte Order
The 6502 uses little-endian byte order for 16-bit values. The low byte comes first:
Address $1234 is stored as: 34 12
This is important when emitting word values in our assembler.
Addressing Modes
The 6502 supports multiple addressing modes that determine how the operand is interpreted. Each combination of instruction and addressing mode has a unique opcode.
Implied Mode
No operand - the instruction operates on a specific register or performs a fixed action.
nop ; No operation (1 byte: EA)
clc ; Clear carry flag (1 byte: 18)
rts ; Return from subroutine (1 byte: 60)
Accumulator Mode
Operates directly on the A register.
asl a ; Arithmetic shift left on A (1 byte: 0A)
ror a ; Rotate right on A (1 byte: 6A)
Immediate Mode
The operand is the actual value to use.
lda #0xFF ; Load 0xFF into A (2 bytes: A9 FF)
ldx #0x10 ; Load 0x10 into X (2 bytes: A2 10)
The # prefix indicates immediate mode.
Zero Page Mode
The operand is an 8-bit address in the first 256 bytes of memory (page zero).
lda 0x80 ; Load from address 0x0080 (2 bytes: A5 80)
sta 0x00 ; Store to address 0x0000 (2 bytes: 85 00)
Zero page access is faster and uses fewer bytes than absolute addressing.
Absolute Mode
The operand is a full 16-bit address.
lda 0x2000 ; Load from address 0x2000 (3 bytes: AD 00 20)
jmp 0x8000 ; Jump to address 0x8000 (3 bytes: 4C 00 80)
Indexed Modes
Add an index register to the address:
lda 0x2000,x ; Load from 0x2000 + X (Absolute,X)
lda 0x2000,y ; Load from 0x2000 + Y (Absolute,Y)
lda 0x80,x ; Load from 0x80 + X (Zero Page,X)
Indirect Modes
Use an address stored in memory:
jmp (0x2000) ; Jump to address stored at 0x2000-0x2001 (Indirect)
lda (0x80,x) ; Indexed Indirect: address at (0x80+X)
lda (0x80),y ; Indirect Indexed: (address at 0x80) + Y
Relative Mode
Used only for branch instructions. The operand is a signed 8-bit offset from the next instruction.
beq label ; Branch if zero flag set
bne loop ; Branch if zero flag clear
The Byte Fantasy Console Memory Map
Our assembler targets the Byte Fantasy Console, which has a specific memory layout:
0x0000 - 0x00FF : Zero Page (fast access RAM)
0x0100 - 0x01FF : Stack
0x0200 - 0x0FFF : General RAM
0x1000 - 0x1FFF : Video RAM (64x64 pixels)
...
0x8000 - 0xFFFB : Program ROM
0xFFFC - 0xFFFD : Reset Vector (16-bit address)
0xFFFE - 0xFFFF : IRQ Vector (16-bit address)
Special Registers
| Address | Name | Purpose |
|---|---|---|
0xFD | VID_PTR | Video page pointer (high byte of VRAM address) |
0xFE | RANDOM | Random number generator |
0xFF | INPUT | Controller input state |
Reset and IRQ Vectors
When the console powers on:
- It reads the 16-bit address at
0xFFFC-0xFFFD - Jumps to that address (your reset/init code)
When an IRQ occurs (like VBLANK):
- It reads the address at
0xFFFE-0xFFFF - Jumps to that address (your interrupt handler)
What We’ll Build
Our assembler will:
- Scan source code into tokens (lexical analysis)
- Parse tokens into an Abstract Syntax Tree (syntax analysis)
- Resolve labels and forward references (symbol table)
- Generate machine code bytes (code generation)
- Output a binary file ready for the emulator
Example Input
.org 0x8000
start:
lda #0x42
sta 0x00
jmp start
.org 0xFFFC
.dw start
Example Output
A binary file containing:
- At offset 0x0000 (address 0x8000):
A9 42 85 00 4C 00 80 - At offset 0x7FFC (address 0xFFFC):
00 80
Summary
In this chapter, we learned:
- An assembler translates human-readable assembly into machine code
- The 6502 is an 8-bit processor with a 16-bit address space
- Instructions consist of opcodes and operands
- Different addressing modes determine how operands are interpreted
- The Byte console has a specific memory map with special registers
In the next chapter, we’ll start building our scanner to tokenize assembly source code.
Next: Chapter 2 - Building the Scanner
Chapter 2: Building the Scanner (Lexer)
In this chapter, we’ll build the scanner (also called a lexer) - the first stage of our assembler. The scanner converts source text into a stream of tokens.
What is Lexical Analysis?
The scanner performs lexical analysis: breaking the input stream of characters into meaningful chunks called tokens. Think of tokens as the “words” of our assembly language.
Characters to Tokens
Source: lda #0xFF
Tokens: [INSTRUCTION(LDA)] [HASH] [NUMBER(255)]
Each token has:
- A kind (what type of token it is)
- Optionally a value (the parsed data)
- A location (line, column, start position)
Token Types in ByteASM
Let’s define all the token types our scanner will produce:
#![allow(unused)]
fn main() {
pub enum TokenKind {
// Punctuation
CloseParen, // )
Colon, // :
Comma, // ,
OpenParen, // (
// Operators
Hash, // #
Minus, // -
Plus, // +
Slash, // /
Star, // *
LessThan, // <
GreaterThan, // >
// Special symbols
Dollar, // $ (current address)
Semicolon, // ; (comment start)
// Literals
Number, // 0xFF, 0b1010, 42
String, // "hello"
// Identifiers and keywords
Directive, // .org, .db, etc.
Instruction, // lda, sta, etc.
Register, // a, x, y
Identifier, // label names
LocalLabel, // .loop or @loop
// Structure
Comment, // ; to end of line
NewLine, // \n
EOF, // End of file
}
}
The Token Structure
Each token carries information about its type, value, and position:
#![allow(unused)]
fn main() {
pub struct Token {
pub kind: TokenKind,
pub value: Option<TokenValue>,
pub location: Location,
}
pub struct Location {
pub line: usize, // Line number (1-indexed)
pub column: usize, // Column number (1-indexed)
pub start: usize, // Byte offset in source
pub length: usize, // Length in bytes
}
pub enum TokenValue {
Number(u64),
String(String),
Directive(Directive),
Instruction(Mnemonic),
}
}
Scanner Architecture
Our scanner uses a cursor to track our position in the source:
#![allow(unused)]
fn main() {
pub struct Scanner<'a> {
cursor: Cursor<'a>,
source: &'a str,
}
pub struct Cursor<'a> {
chars: Peekable<Chars<'a>>,
line: usize,
column: usize,
current: usize, // Current byte position
start: usize, // Start of current token
}
}
The Cursor
The cursor provides these operations:
#![allow(unused)]
fn main() {
impl Cursor {
/// Peek at the next character without consuming it
fn peek(&mut self) -> Option<char>
/// Advance and return the next character
fn advance(&mut self) -> Option<char>
/// Mark the start of a new token
fn sync(&mut self)
/// Create a Location for the current token
fn location(&self) -> Location
/// Advance the line counter (after newline)
fn advance_line(&mut self)
}
}
The Main Scanning Loop
The heart of the scanner is the scan_token method:
#![allow(unused)]
fn main() {
pub fn scan_token(&mut self) -> Result<Token, ScannerError> {
self.skip_whitespace();
self.cursor.sync();
match self.cursor.advance() {
None => self.make_token(TokenKind::EOF, None),
Some(c) => match c {
// Single-character tokens
')' => self.make_token(TokenKind::CloseParen, None),
'(' => self.make_token(TokenKind::OpenParen, None),
',' => self.make_token(TokenKind::Comma, None),
':' => self.make_token(TokenKind::Colon, None),
'#' => self.make_token(TokenKind::Hash, None),
'+' => self.make_token(TokenKind::Plus, None),
'-' => self.make_token(TokenKind::Minus, None),
'*' => self.make_token(TokenKind::Star, None),
'/' => self.make_token(TokenKind::Slash, None),
'<' => self.make_token(TokenKind::LessThan, None),
'>' => self.make_token(TokenKind::GreaterThan, None),
'$' => self.make_token(TokenKind::Dollar, None),
// Newline
'\n' => {
let token = self.make_token(TokenKind::NewLine, None);
self.cursor.advance_line();
token
}
// Comment
';' => {
self.scan_comment();
self.make_token(TokenKind::Comment, None)
}
// More complex tokens...
_ => self.scan_complex_token(c)
}
}
}
}
Scanning Numbers
ByteASM supports three number formats:
0xFF- Hexadecimal (0x prefix)0b1010- Binary (0b prefix)42- Decimal (no prefix)
#![allow(unused)]
fn main() {
fn scan_number(&mut self, first_char: char) -> Result<Token, ScannerError> {
// Check for prefix
if first_char == '0' {
match self.cursor.peek() {
Some('x') | Some('X') => {
self.cursor.advance();
return self.scan_hex();
}
Some('b') | Some('B') => {
self.cursor.advance();
return self.scan_binary();
}
_ => {}
}
}
self.scan_decimal()
}
fn scan_hex(&mut self) -> Result<Token, ScannerError> {
let start = self.cursor.current;
// Consume hex digits
while let Some(c) = self.cursor.peek() {
if c.is_ascii_hexdigit() {
self.cursor.advance();
} else {
break;
}
}
// Must have at least one digit after 0x
if self.cursor.current == start {
return Err(ScannerError::NumberExpected {
line: self.cursor.line,
column: self.cursor.column,
symbol: 'x',
});
}
// Parse the hex value
let hex_str = &self.source[start..self.cursor.current];
let value = u64::from_str_radix(hex_str, 16)?;
self.make_token(TokenKind::Number, Some(TokenValue::Number(value)))
}
}
Scanning Identifiers
Identifiers include:
- Label names (
main,loop_counter) - Instruction mnemonics (
lda,sta) - Register names (
a,x,y)
#![allow(unused)]
fn main() {
fn scan_identifier(&mut self) -> Result<Token, ScannerError> {
// Consume alphanumeric characters and underscores
while let Some(c) = self.cursor.peek() {
if c.is_ascii_alphanumeric() || c == '_' {
self.cursor.advance();
} else {
break;
}
}
let text = &self.source[self.cursor.start..self.cursor.current];
let lower = text.to_lowercase();
// Check for register names
if lower == "a" || lower == "x" || lower == "y" {
return Ok(self.make_token(TokenKind::Register, None));
}
// Check for instruction mnemonics
if let Ok(mnemonic) = Mnemonic::try_from(lower.to_uppercase().as_str()) {
return Ok(self.make_token(
TokenKind::Instruction,
Some(TokenValue::Instruction(mnemonic))
));
}
// It's a general identifier
Ok(self.make_token(TokenKind::Identifier, None))
}
}
Scanning Directives and Local Labels
Both directives and local labels start with .:
#![allow(unused)]
fn main() {
fn scan_dot(&mut self) -> Result<Token, ScannerError> {
// Consume the identifier after the dot
let start = self.cursor.current;
while let Some(c) = self.cursor.peek() {
if c.is_ascii_alphanumeric() || c == '_' {
self.cursor.advance();
} else {
break;
}
}
let text = &self.source[start..self.cursor.current];
// Try to parse as directive
if let Ok(directive) = Directive::try_from(text.to_uppercase().as_str()) {
return Ok(self.make_token(
TokenKind::Directive,
Some(TokenValue::Directive(directive))
));
}
// It's a local label
Ok(self.make_token(TokenKind::LocalLabel, None))
}
}
Local labels can also start with @:
#![allow(unused)]
fn main() {
'@' => {
self.scan_identifier_rest();
Ok(self.make_token(TokenKind::LocalLabel, None))
}
}
Scanning Strings
Strings appear in .db directives:
#![allow(unused)]
fn main() {
fn scan_string(&mut self, quote: char) -> Result<String, ScannerError> {
let mut result = String::new();
while let Some(c) = self.cursor.peek() {
if c == quote || c == '\n' {
break;
}
self.cursor.advance();
// Handle escape sequences
if c == '\\' {
match self.cursor.peek() {
Some('n') => { result.push('\n'); self.cursor.advance(); }
Some('r') => { result.push('\r'); self.cursor.advance(); }
Some('t') => { result.push('\t'); self.cursor.advance(); }
Some('"') => { result.push('"'); self.cursor.advance(); }
Some('\'') => { result.push('\''); self.cursor.advance(); }
Some('\\') => { result.push('\\'); self.cursor.advance(); }
Some(e) => { result.push(e); self.cursor.advance(); }
None => continue,
}
} else {
result.push(c);
}
}
// Check for unterminated string
if self.cursor.peek() != Some(quote) {
return Err(ScannerError::UnterminatedString {
line: self.cursor.line,
column: self.cursor.column,
quote,
});
}
// Consume closing quote
self.cursor.advance();
Ok(result)
}
}
Handling Comments
Comments run from ; to the end of the line:
#![allow(unused)]
fn main() {
fn scan_comment(&mut self) {
while let Some(c) = self.cursor.peek() {
if c == '\n' {
break;
}
self.cursor.advance();
}
}
}
Whitespace Handling
We skip spaces, tabs, and carriage returns (but not newlines, which are significant):
#![allow(unused)]
fn main() {
fn skip_whitespace(&mut self) {
while let Some(c) = self.cursor.peek() {
match c {
' ' | '\r' | '\t' => { self.cursor.advance(); }
_ => break,
}
}
}
}
Error Handling
The scanner can produce these errors:
#![allow(unused)]
fn main() {
pub enum ScannerError {
UnknownCharacter { line: usize, column: usize, character: char },
UnknownDirective { line: usize, column: usize, directive: String },
NumberExpected { line: usize, column: usize, symbol: char },
UnterminatedString { line: usize, column: usize, quote: char },
}
}
Running the Scanner
Let’s trace through scanning this example:
.org 0x8000
start:
lda #0x00
| Input | Token Kind | Value |
|---|---|---|
.org | Directive | ORG |
0x8000 | Number | 32768 |
\n | NewLine | - |
start | Identifier | - |
: | Colon | - |
\n | NewLine | - |
lda | Instruction | LDA |
# | Hash | - |
0x00 | Number | 0 |
\n | NewLine | - |
| (end) | EOF | - |
Summary
In this chapter, we built a scanner that:
- Recognizes all ByteASM token types
- Handles three number formats (hex, binary, decimal)
- Distinguishes between identifiers, instructions, and registers
- Parses directives and local labels
- Handles strings with escape sequences
- Tracks location information for error reporting
In the next chapter, we’ll design the Abstract Syntax Tree that will represent our parsed program.
Previous: Chapter 1 - Introduction | Next: Chapter 3 - Designing the AST
Chapter 3: Designing the Abstract Syntax Tree
In this chapter, we’ll design the Abstract Syntax Tree (AST) - the data structure that represents a parsed assembly program.
What is an AST?
An Abstract Syntax Tree is a tree representation of the syntactic structure of source code. Unlike the flat list of tokens from the scanner, the AST captures the hierarchical relationships between program elements.
Why Not Just Use Tokens?
Tokens tell us what elements we have, but not how they relate to each other. Consider:
lda (0x80,x)
The tokens are: INSTRUCTION OPENPAREN NUMBER COMMA REGISTER CLOSEPAREN
But what we need to know is:
- This is an instruction:
LDA - With an operand in Indexed Indirect mode:
(0x80,x) - The base address is:
0x80
The AST captures this structure explicitly.
Program Structure
A ByteASM program consists of statements:
#![allow(unused)]
fn main() {
pub struct Program {
pub statements: Vec<Statement>,
pub source_file: Option<String>,
}
}
Statement Types
There are three kinds of statements:
#![allow(unused)]
fn main() {
pub enum Statement {
Label(LabelDef),
Instruction(InstructionStmt),
Directive(DirectiveStmt),
}
}
Label Definitions
#![allow(unused)]
fn main() {
pub struct LabelDef {
pub name: String,
pub is_local: bool, // true for .loop or @loop
pub location: Location,
}
}
Examples:
main:→LabelDef { name: "main", is_local: false }.loop:→LabelDef { name: ".loop", is_local: true }
Instruction Statements
#![allow(unused)]
fn main() {
pub struct InstructionStmt {
pub mnemonic: Mnemonic,
pub operand: Option<Operand>,
pub location: Location,
}
}
Examples:
nop→InstructionStmt { mnemonic: NOP, operand: None }lda #0x42→InstructionStmt { mnemonic: LDA, operand: Some(Immediate(Number(66))) }
Directive Statements
#![allow(unused)]
fn main() {
pub enum DirectiveStmt {
Org {
address: Expression,
location: Location,
},
Db {
values: Vec<DataValue>,
location: Location,
},
Dw {
values: Vec<Expression>,
location: Location,
},
Equ {
name: String,
value: Expression,
location: Location,
},
Include {
path: String,
location: Location,
},
}
}
Operands and Addressing Modes
The operand type directly encodes the addressing mode:
#![allow(unused)]
fn main() {
pub enum Operand {
// #value - Immediate mode
Immediate(Expression),
// a - Accumulator mode (for ASL, ROL, etc.)
Accumulator,
// address - Zero Page or Absolute
Address(Expression),
// address,x - Indexed by X
IndexedX(Expression),
// address,y - Indexed by Y
IndexedY(Expression),
// (address) - Indirect (JMP only)
Indirect(Expression),
// (zp,x) - Indexed Indirect
IndirectX(Expression),
// (zp),y - Indirect Indexed
IndirectY(Expression),
}
}
Addressing Mode Mapping
| Syntax | Operand Type | 6502 Mode |
|---|---|---|
| (none) | None | Implied |
a | Accumulator | Accumulator |
#expr | Immediate(expr) | Immediate |
expr | Address(expr) | Zero Page or Absolute |
expr,x | IndexedX(expr) | Zero Page,X or Absolute,X |
expr,y | IndexedY(expr) | Zero Page,Y or Absolute,Y |
(expr) | Indirect(expr) | Indirect |
(expr,x) | IndirectX(expr) | Indexed Indirect |
(expr),y | IndirectY(expr) | Indirect Indexed |
Note: The distinction between Zero Page and Absolute is determined during code generation based on the expression value.
Expressions
Expressions represent numeric values that can be computed:
#![allow(unused)]
fn main() {
pub enum Expression {
// Literal number: 0xFF, 255, 0b1010
Number(i64),
// Label or constant name
Identifier(String),
// Local label: .loop, @loop
LocalIdentifier(String),
// Binary operation: left op right
Binary {
left: Box<Expression>,
op: BinaryOp,
right: Box<Expression>,
},
// Unary operation: op operand
Unary {
op: UnaryOp,
operand: Box<Expression>,
},
// Current address: $
CurrentAddress,
}
}
Binary Operators
#![allow(unused)]
fn main() {
pub enum BinaryOp {
Add, // +
Sub, // -
Mul, // *
Div, // /
}
}
Unary Operators
#![allow(unused)]
fn main() {
pub enum UnaryOp {
Neg, // - (negation)
LoByte, // < (extract low byte)
HiByte, // > (extract high byte)
}
}
Expression Examples
| Source | AST |
|---|---|
42 | Number(42) |
label | Identifier("label") |
.loop | LocalIdentifier(".loop") |
$ | CurrentAddress |
10 + 5 | Binary { left: Number(10), op: Add, right: Number(5) } |
<0x1234 | Unary { op: LoByte, operand: Number(0x1234) } |
>label | Unary { op: HiByte, operand: Identifier("label") } |
Data Values for .db
The .db directive can contain bytes or strings:
#![allow(unused)]
fn main() {
pub enum DataValue {
Byte(Expression),
String(String),
}
}
Example:
.db "Hello", 0x0A, 0
Parses to:
#![allow(unused)]
fn main() {
Db {
values: vec![
DataValue::String("Hello"),
DataValue::Byte(Number(0x0A)),
DataValue::Byte(Number(0)),
]
}
}
Complete AST Example
Let’s trace through parsing this program:
.equ SCREEN 0x1000
.org 0x8000
start:
lda #>SCREEN
sta 0xFD
.loop:
jmp .loop
.org 0xFFFC
.dw start
The AST would be:
#![allow(unused)]
fn main() {
Program {
statements: [
Directive(Equ {
name: "SCREEN",
value: Number(0x1000),
}),
Directive(Org {
address: Number(0x8000),
}),
Label(LabelDef {
name: "start",
is_local: false,
}),
Instruction(InstructionStmt {
mnemonic: LDA,
operand: Some(Immediate(
Unary { op: HiByte, operand: Identifier("SCREEN") }
)),
}),
Instruction(InstructionStmt {
mnemonic: STA,
operand: Some(Address(Number(0xFD))),
}),
Label(LabelDef {
name: ".loop",
is_local: true,
}),
Instruction(InstructionStmt {
mnemonic: JMP,
operand: Some(Address(LocalIdentifier(".loop"))),
}),
Directive(Org {
address: Number(0xFFFC),
}),
Directive(Dw {
values: [Identifier("start")],
}),
],
source_file: Some("example.s"),
}
}
Design Principles
1. Preserve Source Information
Every node includes a Location so we can report errors with line numbers:
#![allow(unused)]
fn main() {
impl Statement {
pub fn location(&self) -> &Location {
match self {
Statement::Label(l) => &l.location,
Statement::Instruction(i) => &i.location,
Statement::Directive(d) => d.location(),
}
}
}
}
2. Explicit Over Implicit
Rather than encoding addressing modes as strings or enums, we use distinct types that make the structure clear:
#![allow(unused)]
fn main() {
// Good: Structure is explicit
Operand::IndirectY(Expression::Number(0x80))
// Bad: Structure is implicit
Operand { mode: "indirect_y", value: 0x80 }
}
3. Expressions Are Recursive
Using Box<Expression> allows expressions to be nested:
#![allow(unused)]
fn main() {
// Represents: (label + offset) * 2
Binary {
left: Box::new(Binary {
left: Box::new(Identifier("label")),
op: Add,
right: Box::new(Identifier("offset")),
}),
op: Mul,
right: Box::new(Number(2)),
}
}
4. Helper Constructors
We provide convenient constructors:
#![allow(unused)]
fn main() {
impl Expression {
pub fn binary(left: Expression, op: BinaryOp, right: Expression) -> Self {
Expression::Binary {
left: Box::new(left),
op,
right: Box::new(right),
}
}
pub fn unary(op: UnaryOp, operand: Expression) -> Self {
Expression::Unary {
op,
operand: Box::new(operand),
}
}
}
}
Summary
In this chapter, we designed an AST that:
- Represents the complete structure of a ByteASM program
- Distinguishes between labels, instructions, and directives
- Encodes addressing modes through operand types
- Supports recursive expressions with operators
- Preserves source location for error reporting
In the next chapter, we’ll start building the parser that constructs this AST from tokens.
Previous: Chapter 2 - Building the Scanner | Next: Chapter 4 - Parser Structure
Chapter 4: Building the Parser - Structure
In this chapter, we’ll set up the infrastructure for our parser. The parser transforms tokens into an Abstract Syntax Tree.
Parser Design Pattern
We’ll use recursive descent parsing - a simple and intuitive approach where:
- Each grammar rule becomes a function
- Functions call each other to match nested structures
- We look ahead at tokens to decide which rule to apply
This approach works well for assembly language because the grammar is simple and unambiguous.
The Parser State
Our parser maintains this state:
#![allow(unused)]
fn main() {
pub struct Parser<'a, 'b> {
scanner: &'a mut Scanner<'b>,
source: &'b str,
current: Token,
previous: Token,
errors: Vec<ParseError>,
}
}
Let’s understand each field:
Scanner Reference
#![allow(unused)]
fn main() {
scanner: &'a mut Scanner<'b>,
}
The parser doesn’t store tokens in advance. It asks the scanner for tokens one at a time. This is more memory efficient and allows for streaming parsing.
Source Reference
#![allow(unused)]
fn main() {
source: &'b str,
}
We keep a reference to the original source so we can extract text for identifiers and error messages.
Current and Previous Tokens
#![allow(unused)]
fn main() {
current: Token,
previous: Token,
}
current: The token we’re about to processprevious: The token we just processed
This gives us one token of lookahead, which is enough for ByteASM’s grammar.
Error Collection
#![allow(unused)]
fn main() {
errors: Vec<ParseError>,
}
Rather than stopping at the first error, we collect errors and continue parsing. This lets us report multiple problems in one pass.
Core Parser Utilities
Creating the Parser
#![allow(unused)]
fn main() {
impl<'a, 'b> Parser<'a, 'b> {
pub fn new(scanner: &'a mut Scanner<'b>, source: &'b str) -> Self {
// Get the first token
let current = scanner.scan_token().unwrap_or_else(|_| Token {
kind: TokenKind::EOF,
value: None,
location: Location::default(),
});
Self {
scanner,
source,
current: current.clone(),
previous: current,
errors: Vec::new(),
}
}
}
}
We immediately scan the first token so current is ready.
Advancing Through Tokens
#![allow(unused)]
fn main() {
pub fn advance(&mut self) -> Token {
let fallback_location = self.current.location;
// Move current to previous, scan new current
self.previous = std::mem::replace(
&mut self.current,
self.scanner.scan_token().unwrap_or_else(|_| Token {
kind: TokenKind::EOF,
value: None,
location: fallback_location,
}),
);
self.previous.clone()
}
}
advance() returns the old current token (now previous) and loads the next token.
Checking Token Types
#![allow(unused)]
fn main() {
pub fn check(&self, kind: TokenKind) -> bool {
self.current.kind == kind
}
pub fn is_at_end(&self) -> bool {
self.current.kind == TokenKind::EOF
}
}
Expecting Specific Tokens
#![allow(unused)]
fn main() {
pub fn expect(&mut self, kind: TokenKind, expected: &str) -> ParseResult<Token> {
if self.check(kind) {
Ok(self.advance())
} else {
Err(ParseError::UnexpectedToken {
expected: expected.to_string(),
found: self.current.kind,
location: self.current.location,
})
}
}
}
expect() is used when we know what must come next. If the token doesn’t match, we report an error.
Error Types
Our parser can produce these errors:
#![allow(unused)]
fn main() {
pub enum ParseError {
UnexpectedToken {
expected: String,
found: TokenKind,
location: Location,
},
UnexpectedEof {
location: Location,
},
InvalidOperand {
message: String,
location: Location,
},
InvalidExpression {
message: String,
location: Location,
},
InvalidDirective {
message: String,
location: Location,
},
InvalidLabel {
message: String,
location: Location,
},
}
}
Each error includes a Location for precise error reporting.
Getting Error Location
#![allow(unused)]
fn main() {
impl ParseError {
pub fn location(&self) -> &Location {
match self {
ParseError::UnexpectedToken { location, .. } => location,
ParseError::UnexpectedEof { location } => location,
ParseError::InvalidOperand { location, .. } => location,
ParseError::InvalidExpression { location, .. } => location,
ParseError::InvalidDirective { location, .. } => location,
ParseError::InvalidLabel { location, .. } => location,
}
}
}
}
Error Recovery
When we encounter an error, we don’t want to stop immediately. We use panic mode recovery - skip tokens until we find a safe point to continue.
For assembly language, the safe point is the next line:
#![allow(unused)]
fn main() {
fn synchronize(&mut self) {
while !self.is_at_end() {
if self.check(TokenKind::NewLine) {
self.advance();
return;
}
self.advance();
}
}
}
This means:
- When an error occurs, add it to
errors - Call
synchronize()to skip to the next line - Continue parsing
- At the end, return all collected errors
The Main Parse Loop
#![allow(unused)]
fn main() {
pub fn parse(&mut self) -> Result<Program, Vec<ParseError>> {
let mut program = Program::new();
while !self.is_at_end() {
// Skip blank lines
self.skip_empty_lines();
if self.is_at_end() {
break;
}
// Try to parse a line
match self.parse_line() {
Ok(statements) => {
program.statements.extend(statements);
}
Err(e) => {
self.errors.push(e);
self.synchronize();
}
}
}
if self.errors.is_empty() {
Ok(program)
} else {
Err(std::mem::take(&mut self.errors))
}
}
}
Skipping Empty Lines
#![allow(unused)]
fn main() {
fn skip_empty_lines(&mut self) {
while self.check(TokenKind::NewLine) || self.check(TokenKind::Comment) {
self.advance();
}
}
}
Line Structure
A single line can contain:
- Just a label:
main: - Just an instruction:
nop - Just a directive:
.org 0x8000 - A label followed by an instruction:
loop: dex
Our parse_line() function handles all these cases:
#![allow(unused)]
fn main() {
fn parse_line(&mut self) -> ParseResult<Vec<Statement>> {
let mut statements = Vec::new();
// Check for label
if self.is_label_start() {
statements.push(self.parse_label()?);
}
// Check for instruction or directive
if self.check(TokenKind::Instruction) {
statements.push(self.parse_instruction()?);
} else if self.check(TokenKind::Directive) {
statements.push(self.parse_directive()?);
}
// Expect end of line
if !self.is_at_end() &&
!self.check(TokenKind::NewLine) &&
!self.check(TokenKind::Comment)
{
return Err(ParseError::UnexpectedToken {
expected: "end of line".to_string(),
found: self.current.kind,
location: self.current.location,
});
}
// Consume newline if present
if self.check(TokenKind::NewLine) {
self.advance();
}
Ok(statements)
}
}
Detecting Labels
A label is an identifier (or local label) followed by a colon:
#![allow(unused)]
fn main() {
fn is_label_start(&self) -> bool {
(self.check(TokenKind::Identifier) || self.check(TokenKind::LocalLabel))
// We need to look at what follows to confirm it's a label
// For now, we try to parse and see
}
}
Result Type
#![allow(unused)]
fn main() {
pub type ParseResult<T> = Result<T, ParseError>;
}
Public API
The parser module exposes a simple function:
#![allow(unused)]
fn main() {
pub fn parse(source: &str) -> Result<Program, Vec<ParseError>> {
let mut scanner = Scanner::new(source);
let mut parser = Parser::new(&mut scanner, source);
parser.parse()
}
}
Usage:
#![allow(unused)]
fn main() {
let source = ".org 0x8000\nlda #0x42";
let program = byte_asm::parser::parse(source)?;
}
Summary
In this chapter, we set up the parser infrastructure:
- Parser state: scanner reference, current/previous tokens, error list
- Core utilities:
advance(),check(),expect() - Error types: with location information for each error
- Error recovery: synchronize on newlines to continue after errors
- Main loop: parse lines until EOF, collecting errors
In the next chapter, we’ll implement the actual parsing logic for labels, instructions, and directives.
Previous: Chapter 3 - Designing the AST | Next: Chapter 5 - Parser Implementation
Chapter 5: Building the Parser - Implementation
In this chapter, we’ll implement the complete parser for ByteASM, building on the infrastructure from Chapter 4.
Parsing Labels
Labels are identifiers followed by a colon:
#![allow(unused)]
fn main() {
fn parse_label(&mut self) -> ParseResult<Statement> {
let token = self.advance();
let name = token.text(self.source).to_string();
let is_local = token.kind == TokenKind::LocalLabel;
let location = token.location;
// Expect colon after label name
if !self.check(TokenKind::Colon) {
return Err(ParseError::InvalidLabel {
message: "expected ':' after label name".to_string(),
location,
});
}
self.advance(); // consume colon
Ok(Statement::Label(LabelDef {
name,
is_local,
location,
}))
}
}
Examples
| Input | Result |
|---|---|
main: | LabelDef { name: "main", is_local: false } |
.loop: | LabelDef { name: ".loop", is_local: true } |
@temp: | LabelDef { name: "@temp", is_local: true } |
Parsing Instructions
Instructions have an optional operand:
#![allow(unused)]
fn main() {
fn parse_instruction(&mut self) -> ParseResult<Statement> {
let token = self.advance();
let mnemonic = token.mnemonic().unwrap();
let location = token.location;
// Check if there's an operand
let operand = if self.has_operand() {
Some(self.parse_operand()?)
} else {
None
};
Ok(Statement::Instruction(InstructionStmt {
mnemonic,
operand,
location,
}))
}
fn has_operand(&self) -> bool {
matches!(
self.current.kind,
TokenKind::Hash
| TokenKind::OpenParen
| TokenKind::Identifier
| TokenKind::LocalLabel
| TokenKind::Number
| TokenKind::Dollar
| TokenKind::LessThan
| TokenKind::GreaterThan
| TokenKind::Register
)
}
}
Parsing Operands
The operand determines the addressing mode. Here’s the decision tree:
Token → Operand Type
──────────────────────────────────
# → Immediate
a (register) → Accumulator
( → Indirect, IndirectX, or IndirectY
other → Address, IndexedX, or IndexedY
#![allow(unused)]
fn main() {
fn parse_operand(&mut self) -> ParseResult<Operand> {
// Immediate: #expr
if self.check(TokenKind::Hash) {
self.advance();
let expr = self.parse_expression()?;
return Ok(Operand::Immediate(expr));
}
// Accumulator: a
if self.check(TokenKind::Register) {
let text = self.current.text(self.source).to_lowercase();
if text == "a" {
self.advance();
return Ok(Operand::Accumulator);
}
}
// Indirect modes: (...)
if self.check(TokenKind::OpenParen) {
return self.parse_indirect_operand();
}
// Address or indexed: expr or expr,x or expr,y
self.parse_address_operand()
}
}
Parsing Address Operands
#![allow(unused)]
fn main() {
fn parse_address_operand(&mut self) -> ParseResult<Operand> {
let expr = self.parse_expression()?;
// Check for indexing
if self.check(TokenKind::Comma) {
self.advance();
if self.check(TokenKind::Register) {
let reg = self.current.text(self.source).to_lowercase();
self.advance();
return match reg.as_str() {
"x" => Ok(Operand::IndexedX(expr)),
"y" => Ok(Operand::IndexedY(expr)),
_ => Err(ParseError::InvalidOperand {
message: format!("expected 'x' or 'y', found '{}'", reg),
location: self.previous.location,
}),
};
}
}
Ok(Operand::Address(expr))
}
}
Parsing Indirect Operands
Indirect operands have three forms:
(addr)- Indirect(zp,x)- Indexed Indirect(zp),y- Indirect Indexed
#![allow(unused)]
fn main() {
fn parse_indirect_operand(&mut self) -> ParseResult<Operand> {
self.advance(); // consume '('
let expr = self.parse_expression()?;
// (zp,x) - Indexed Indirect
if self.check(TokenKind::Comma) {
self.advance();
if self.check(TokenKind::Register) {
let reg = self.current.text(self.source).to_lowercase();
self.advance();
if reg != "x" {
return Err(ParseError::InvalidOperand {
message: "indexed indirect only supports X register".to_string(),
location: self.previous.location,
});
}
self.expect(TokenKind::CloseParen, "')'")?;
return Ok(Operand::IndirectX(expr));
}
}
self.expect(TokenKind::CloseParen, "')'")?;
// (addr),y - Indirect Indexed
if self.check(TokenKind::Comma) {
self.advance();
if self.check(TokenKind::Register) {
let reg = self.current.text(self.source).to_lowercase();
self.advance();
if reg != "y" {
return Err(ParseError::InvalidOperand {
message: "indirect indexed only supports Y register".to_string(),
location: self.previous.location,
});
}
return Ok(Operand::IndirectY(expr));
}
}
// (addr) - Plain Indirect
Ok(Operand::Indirect(expr))
}
}
Parsing Expressions
Expressions follow standard precedence rules:
*,/bind tighter than+,-- Unary operators (
-,<,>) bind tightest
Expression Grammar
expression → additive
additive → multiplicative ( ('+' | '-') multiplicative )*
multiplicative → unary ( ('*' | '/') unary )*
unary → ('-' | '<' | '>') unary | primary
primary → NUMBER | IDENTIFIER | LOCAL_LABEL | '$' | '(' expression ')'
Implementation
#![allow(unused)]
fn main() {
pub fn parse_expression(&mut self) -> ParseResult<Expression> {
self.parse_additive()
}
fn parse_additive(&mut self) -> ParseResult<Expression> {
let mut left = self.parse_multiplicative()?;
while self.check(TokenKind::Plus) || self.check(TokenKind::Minus) {
let op = if self.check(TokenKind::Plus) {
self.advance();
BinaryOp::Add
} else {
self.advance();
BinaryOp::Sub
};
let right = self.parse_multiplicative()?;
left = Expression::binary(left, op, right);
}
Ok(left)
}
fn parse_multiplicative(&mut self) -> ParseResult<Expression> {
let mut left = self.parse_unary()?;
while self.check(TokenKind::Star) || self.check(TokenKind::Slash) {
let op = if self.check(TokenKind::Star) {
self.advance();
BinaryOp::Mul
} else {
self.advance();
BinaryOp::Div
};
let right = self.parse_unary()?;
left = Expression::binary(left, op, right);
}
Ok(left)
}
fn parse_unary(&mut self) -> ParseResult<Expression> {
if self.check(TokenKind::Minus) {
self.advance();
let operand = self.parse_unary()?;
return Ok(Expression::unary(UnaryOp::Neg, operand));
}
if self.check(TokenKind::LessThan) {
self.advance();
let operand = self.parse_unary()?;
return Ok(Expression::unary(UnaryOp::LoByte, operand));
}
if self.check(TokenKind::GreaterThan) {
self.advance();
let operand = self.parse_unary()?;
return Ok(Expression::unary(UnaryOp::HiByte, operand));
}
self.parse_primary()
}
fn parse_primary(&mut self) -> ParseResult<Expression> {
// Number literal
if self.check(TokenKind::Number) {
let token = self.advance();
let value = token.number().unwrap_or(0);
return Ok(Expression::Number(value as i64));
}
// Identifier
if self.check(TokenKind::Identifier) {
let token = self.advance();
let name = token.text(self.source).to_string();
return Ok(Expression::Identifier(name));
}
// Local label
if self.check(TokenKind::LocalLabel) {
let token = self.advance();
let name = token.text(self.source).to_string();
return Ok(Expression::LocalIdentifier(name));
}
// Current address ($)
if self.check(TokenKind::Dollar) {
self.advance();
return Ok(Expression::CurrentAddress);
}
// Parenthesized expression
if self.check(TokenKind::OpenParen) {
self.advance();
let expr = self.parse_expression()?;
self.expect(TokenKind::CloseParen, "')'")?;
return Ok(expr);
}
Err(ParseError::InvalidExpression {
message: format!("expected expression, found {:?}", self.current.kind),
location: self.current.location,
})
}
}
Parsing Directives
Each directive has its own syntax:
#![allow(unused)]
fn main() {
fn parse_directive(&mut self) -> ParseResult<Statement> {
let token = self.advance();
let directive = token.directive().unwrap();
let location = token.location;
match directive {
Directive::ORG => {
let address = self.parse_expression()?;
Ok(Statement::Directive(DirectiveStmt::Org { address, location }))
}
Directive::DB => {
let values = self.parse_db_values()?;
Ok(Statement::Directive(DirectiveStmt::Db { values, location }))
}
Directive::DW => {
let values = self.parse_expression_list()?;
Ok(Statement::Directive(DirectiveStmt::Dw { values, location }))
}
Directive::EQU => {
// .equ NAME value
if !self.check(TokenKind::Identifier) {
return Err(ParseError::InvalidDirective {
message: "expected identifier after .equ".to_string(),
location: self.current.location,
});
}
let name = self.advance().text(self.source).to_string();
let value = self.parse_expression()?;
Ok(Statement::Directive(DirectiveStmt::Equ { name, value, location }))
}
Directive::INCLUDE => {
// .include "filename"
if !self.check(TokenKind::String) {
return Err(ParseError::InvalidDirective {
message: "expected string after .include".to_string(),
location: self.current.location,
});
}
let path = self.advance().string().unwrap_or("").to_string();
Ok(Statement::Directive(DirectiveStmt::Include { path, location }))
}
}
}
}
Parsing .db Values
The .db directive accepts bytes and strings:
#![allow(unused)]
fn main() {
fn parse_db_values(&mut self) -> ParseResult<Vec<DataValue>> {
let mut values = Vec::new();
loop {
if self.check(TokenKind::String) {
let token = self.advance();
let s = token.string().unwrap_or("").to_string();
values.push(DataValue::String(s));
} else if can_start_expression(self.current.kind) {
let expr = self.parse_expression()?;
values.push(DataValue::Byte(expr));
} else {
break;
}
if !self.check(TokenKind::Comma) {
break;
}
self.advance(); // consume comma
}
if values.is_empty() {
return Err(ParseError::InvalidDirective {
message: "expected at least one value for .db".to_string(),
location: self.current.location,
});
}
Ok(values)
}
}
Parsing Expression Lists
Used by .dw:
#![allow(unused)]
fn main() {
pub fn parse_expression_list(&mut self) -> ParseResult<Vec<Expression>> {
let mut exprs = vec![self.parse_expression()?];
while self.check(TokenKind::Comma) {
self.advance();
exprs.push(self.parse_expression()?);
}
Ok(exprs)
}
}
Local Label Resolution
Local labels are scoped to their parent global label. When parsing:
main:
.loop:
bne .loop
other:
.loop: ; different from main.loop
bne .loop
The parser keeps track of the current global label. Local labels like .loop are qualified:
- First
.loop→main.loop - Second
.loop→other.loop
This is handled in the symbol table (next chapter), not the parser.
Complete Parsing Example
Let’s trace parsing:
.org 0x8000
start:
lda (0x80),y
-
Parse
.org 0x8000- Token: Directive(ORG)
- Parse expression: Number(0x8000)
- Result:
Directive(Org { address: Number(32768) })
-
Parse
start:- Token: Identifier
- Text: “start”
- Token: Colon
- Result:
Label(LabelDef { name: "start", is_local: false })
-
Parse
lda (0x80),y- Token: Instruction(LDA)
- Has operand: yes (starts with
() - Parse indirect operand:
- Token: OpenParen
- Parse expression: Number(0x80)
- Token: CloseParen
- Token: Comma
- Token: Register (y)
- Result:
Instruction(InstructionStmt { mnemonic: LDA, operand: IndirectY(Number(128)) })
Summary
In this chapter, we implemented:
- Label parsing: identifier + colon, detecting local labels
- Instruction parsing: mnemonic + optional operand
- Operand parsing: immediate, accumulator, address, indexed, indirect modes
- Expression parsing: with operator precedence
- Directive parsing: .org, .db, .dw, .equ, .include
The parser now produces a complete AST from source code. In the next chapter, we’ll build the symbol table to track labels and constants.
Previous: Chapter 4 - Parser Structure | Next: Chapter 6 - The Symbol Table
Chapter 6: The Symbol Table
In this chapter, we’ll build the symbol table - the data structure that tracks labels and constants during assembly.
What Symbols Do We Track?
The symbol table stores:
-
Labels: Names that refer to addresses in the program
main:→ address where code follows.loop:→ local label within a function
-
Constants: Named values defined with
.equ.equ SCREEN 0x1000→ SCREEN = 4096
Symbol Structure
#![allow(unused)]
fn main() {
pub struct Symbol {
pub name: String,
pub value: SymbolValue,
pub defined_at: Location,
pub referenced: bool,
}
pub enum SymbolValue {
Address(u16), // Label pointing to an address
Constant(i64), // Constant value from .equ
Undefined, // Forward reference not yet resolved
}
}
Symbol Constructors
#![allow(unused)]
fn main() {
impl Symbol {
pub fn address(name: impl Into<String>, address: u16, location: Location) -> Self {
Self {
name: name.into(),
value: SymbolValue::Address(address),
defined_at: location,
referenced: false,
}
}
pub fn constant(name: impl Into<String>, value: i64, location: Location) -> Self {
Self {
name: name.into(),
value: SymbolValue::Constant(value),
defined_at: location,
referenced: false,
}
}
pub fn is_defined(&self) -> bool {
!matches!(self.value, SymbolValue::Undefined)
}
pub fn numeric_value(&self) -> Option<i64> {
match self.value {
SymbolValue::Address(addr) => Some(addr as i64),
SymbolValue::Constant(val) => Some(val),
SymbolValue::Undefined => None,
}
}
}
}
The Symbol Table
#![allow(unused)]
fn main() {
pub struct SymbolTable {
symbols: HashMap<String, Symbol>,
current_parent: Option<String>,
}
}
The current_parent tracks the most recent global label for local label resolution.
Basic Operations
#![allow(unused)]
fn main() {
impl SymbolTable {
pub fn new() -> Self {
Self {
symbols: HashMap::new(),
current_parent: None,
}
}
pub fn set_parent(&mut self, parent: Option<String>) {
self.current_parent = parent;
}
pub fn parent(&self) -> Option<&str> {
self.current_parent.as_deref()
}
}
}
Defining Symbols
#![allow(unused)]
fn main() {
pub fn define(&mut self, symbol: Symbol) -> Result<(), CodeGenError> {
let name = symbol.name.clone();
let location = symbol.defined_at;
if let Some(existing) = self.symbols.get(&name) {
// If existing symbol is undefined (forward reference), update it
if !existing.is_defined() {
self.symbols.insert(name, symbol);
return Ok(());
}
// Already defined - that's an error
return Err(CodeGenError::DuplicateSymbol {
name,
first: existing.defined_at,
second: location,
});
}
self.symbols.insert(name, symbol);
Ok(())
}
}
Convenience Methods
#![allow(unused)]
fn main() {
pub fn define_label(
&mut self,
name: impl Into<String>,
address: u16,
location: Location,
) -> Result<(), CodeGenError> {
self.define(Symbol::address(name, address, location))
}
pub fn define_constant(
&mut self,
name: impl Into<String>,
value: i64,
location: Location,
) -> Result<(), CodeGenError> {
self.define(Symbol::constant(name, value, location))
}
}
Looking Up Symbols
#![allow(unused)]
fn main() {
pub fn lookup(&self, name: &str) -> Option<&Symbol> {
// Try direct lookup
if let Some(sym) = self.symbols.get(name) {
return Some(sym);
}
// For local labels, try with parent prefix
if (name.starts_with('.') || name.starts_with('@'))
&& self.current_parent.is_some()
{
let qualified = self.qualify_local_label(name);
return self.symbols.get(&qualified);
}
None
}
pub fn lookup_value(&self, name: &str) -> Option<i64> {
self.lookup(name).and_then(|s| s.numeric_value())
}
pub fn is_defined(&self, name: &str) -> bool {
self.lookup(name)
.map(|s| s.is_defined())
.unwrap_or(false)
}
}
Local Label Handling
Local labels are scoped to their parent global label:
#![allow(unused)]
fn main() {
pub fn qualify_local_label(&self, name: &str) -> String {
if let Some(parent) = &self.current_parent {
if name.starts_with('.') {
// .loop -> parent.loop
format!("{}{}", parent, name)
} else if name.starts_with('@') {
// @loop -> parent.loop
format!("{}.{}", parent, &name[1..])
} else {
name.to_string()
}
} else {
name.to_string()
}
}
}
Example
main: ; current_parent = "main"
.loop: ; stored as "main.loop"
bne .loop ; resolved to "main.loop"
other: ; current_parent = "other"
.loop: ; stored as "other.loop"
bne .loop ; resolved to "other.loop"
When we encounter main:, we call set_parent(Some("main")).
When we encounter .loop:, we store it as main.loop.
When we reference .loop, we look up main.loop.
The Forward Reference Problem
Consider this code:
jmp end ; 'end' not defined yet!
nop
end:
rts
When we encounter jmp end, the label end hasn’t been defined yet. This is called a forward reference.
The Two-Pass Solution
We solve this with two passes:
- Pass 1: Scan through the code, recording where each label is defined
- Pass 2: Generate code, now that all labels are known
In Pass 1, when we see jmp end:
- We don’t know end’s address
- We just calculate that this instruction takes 3 bytes
- We move on
In Pass 2, when we generate code for jmp end:
- We look up
endin the symbol table - We now have its address
- We emit the correct bytes
Tracking References
We track whether symbols are referenced:
#![allow(unused)]
fn main() {
pub fn mark_referenced(&mut self, name: &str) {
if let Some(sym) = self.symbols.get_mut(name) {
sym.referenced = true;
} else if (name.starts_with('.') || name.starts_with('@'))
&& self.current_parent.is_some()
{
let qualified = self.qualify_local_label(name);
if let Some(sym) = self.symbols.get_mut(&qualified) {
sym.referenced = true;
}
}
}
}
This allows us to warn about unused labels:
#![allow(unused)]
fn main() {
pub fn unreferenced_symbols(&self) -> Vec<&Symbol> {
self.symbols
.values()
.filter(|s| !s.referenced && s.is_defined())
.collect()
}
}
Finding Undefined Symbols
After Pass 1, we can check for undefined symbols:
#![allow(unused)]
fn main() {
pub fn undefined_symbols(&self) -> Vec<&Symbol> {
self.symbols
.values()
.filter(|s| !s.is_defined())
.collect()
}
}
Complete Example
Let’s trace symbol table operations for:
.equ SCREEN 0x1000
.org 0x8000
start:
lda #>SCREEN
.loop:
jmp .loop
.dw start
Pass 1
| Operation | Symbol Table |
|---|---|
.equ SCREEN 0x1000 | { SCREEN: Constant(4096) } |
.org 0x8000 | (no change, just sets address) |
start: | { SCREEN, start: Address(0x8000) } |
lda #>SCREEN | (no change, just advances address) |
.loop: | { SCREEN, start, start.loop: Address(0x8002) } |
jmp .loop | (no change) |
.dw start | (no change) |
Pass 2
When generating code:
lda #>SCREEN→ look up SCREEN → 0x1000 → high byte is 0x10 → emitA9 10jmp .loop→ look up start.loop → 0x8002 → emit4C 02 80.dw start→ look up start → 0x8000 → emit00 80
Summary
In this chapter, we built a symbol table that:
- Stores labels (addresses) and constants (values)
- Handles local labels scoped to parent global labels
- Detects duplicate symbol definitions
- Supports forward references via undefined symbols
- Tracks which symbols are referenced
In the next chapter, we’ll implement the two-pass assembly process that uses this symbol table.
Previous: Chapter 5 - Parser Implementation | Next: Chapter 7 - Two-Pass Assembly
Chapter 7: Two-Pass Assembly
In this chapter, we’ll implement the two-pass assembly process that transforms our AST into machine code.
Why Two Passes?
Consider this program:
jmp end
nop
end:
rts
When we reach jmp end, we need to know the address of end. But we haven’t seen end yet! This is the forward reference problem.
The solution is two passes:
- Pass 1: Collect all labels and calculate their addresses
- Pass 2: Generate code using the complete symbol table
The Assembler Structure
#![allow(unused)]
fn main() {
pub struct Assembler {
symbols: SymbolTable,
current_address: u16,
origin: u16,
output: Vec<u8>,
errors: Vec<CodeGenError>,
current_file: Option<String>,
}
impl Assembler {
pub fn new() -> Self {
Self {
symbols: SymbolTable::new(),
current_address: 0,
origin: 0,
output: Vec::new(),
errors: Vec::new(),
current_file: None,
}
}
}
}
The Main Assemble Function
#![allow(unused)]
fn main() {
pub fn assemble(&mut self, program: &Program) -> Result<Vec<u8>, AssemblerError> {
self.current_file = program.source_file.clone();
// Pass 1: Collect symbols
self.pass1(program)?;
// Pass 2: Generate code
self.pass2(program)?;
if !self.errors.is_empty() {
return Err(AssemblerError::Multiple(
self.errors.iter().cloned().map(AssemblerError::CodeGen).collect(),
));
}
Ok(std::mem::take(&mut self.output))
}
}
Pass 1: Symbol Collection
In Pass 1, we walk through the program and:
- Record label addresses
- Process constants from
.equ - Calculate instruction sizes to track the current address
#![allow(unused)]
fn main() {
fn pass1(&mut self, program: &Program) -> Result<(), AssemblerError> {
self.current_address = self.origin;
for stmt in &program.statements {
match stmt {
Statement::Label(label) => {
self.pass1_label(label)?;
}
Statement::Instruction(instr) => {
self.pass1_instruction(instr)?;
}
Statement::Directive(dir) => {
self.pass1_directive(dir)?;
}
}
}
Ok(())
}
}
Processing Labels
#![allow(unused)]
fn main() {
fn pass1_label(&mut self, label: &LabelDef) -> Result<(), AssemblerError> {
let name = if label.is_local {
self.symbols.qualify_local_label(&label.name)
} else {
// Update parent for subsequent local labels
self.symbols.set_parent(Some(label.name.clone()));
label.name.clone()
};
if let Err(e) = self.symbols.define_label(&name, self.current_address, label.location) {
self.errors.push(e);
}
Ok(())
}
}
Calculating Instruction Size
We need to determine how many bytes an instruction will take:
#![allow(unused)]
fn main() {
fn pass1_instruction(&mut self, instr: &InstructionStmt) -> Result<(), AssemblerError> {
let size = self.instruction_size(instr);
self.current_address = self.current_address.wrapping_add(size as u16);
Ok(())
}
fn instruction_size(&self, instr: &InstructionStmt) -> u8 {
match self.determine_addressing_mode(instr.mnemonic, &instr.operand) {
Ok(mode) => {
if let Some(opcode) = get_opcode(instr.mnemonic, mode) {
opcode.size
} else {
1 // Invalid, will error in pass 2
}
}
Err(_) => 1,
}
}
}
Processing Directives in Pass 1
#![allow(unused)]
fn main() {
fn pass1_directive(&mut self, directive: &DirectiveStmt) -> Result<(), AssemblerError> {
match directive {
DirectiveStmt::Org { address, .. } => {
let addr = self.evaluate(address)? as u16;
self.origin = addr;
self.current_address = addr;
}
DirectiveStmt::Db { values, .. } => {
for value in values {
match value {
DataValue::Byte(_) => {
self.current_address = self.current_address.wrapping_add(1);
}
DataValue::String(s) => {
self.current_address = self.current_address.wrapping_add(s.len() as u16);
}
}
}
}
DirectiveStmt::Dw { values, .. } => {
self.current_address = self.current_address.wrapping_add((values.len() * 2) as u16);
}
DirectiveStmt::Equ { name, value, location } => {
let val = self.evaluate(value)?;
self.symbols.define_constant(name, val, *location)?;
}
DirectiveStmt::Include { .. } => {
// Handle includes (recursive assembly)
}
}
Ok(())
}
}
Pass 2: Code Generation
In Pass 2, we generate the actual machine code:
#![allow(unused)]
fn main() {
fn pass2(&mut self, program: &Program) -> Result<(), AssemblerError> {
self.current_address = self.origin;
self.output.clear();
for stmt in &program.statements {
match stmt {
Statement::Label(label) => {
// Update parent for local label resolution
if !label.is_local {
self.symbols.set_parent(Some(label.name.clone()));
}
}
Statement::Instruction(instr) => {
if let Err(e) = self.emit_instruction(instr) {
self.errors.push(e);
}
}
Statement::Directive(dir) => {
if let Err(e) = self.emit_directive(dir) {
self.errors.push(e);
}
}
}
}
Ok(())
}
}
Determining Addressing Mode
The same operand syntax can map to different addressing modes depending on the value:
#![allow(unused)]
fn main() {
fn determine_addressing_mode(
&self,
mnemonic: Mnemonic,
operand: &Option<Operand>,
) -> Result<AddressingMode, CodeGenError> {
match operand {
None => Ok(AddressingMode::Implied),
Some(Operand::Immediate(_)) => Ok(AddressingMode::Immediate),
Some(Operand::Accumulator) => Ok(AddressingMode::Accumulator),
Some(Operand::IndirectX(_)) => Ok(AddressingMode::IndirectX),
Some(Operand::IndirectY(_)) => Ok(AddressingMode::IndirectY),
Some(Operand::Indirect(_)) => Ok(AddressingMode::Indirect),
Some(Operand::Address(expr)) => {
// Branches always use relative
if is_branch_instruction(mnemonic) {
return Ok(AddressingMode::Relative);
}
// Try to evaluate; if small enough, use zero page
match self.evaluate(expr) {
Ok(value) if value >= 0 && value <= 0xFF => {
if get_opcode(mnemonic, AddressingMode::ZeroPage).is_some() {
Ok(AddressingMode::ZeroPage)
} else {
Ok(AddressingMode::Absolute)
}
}
_ => Ok(AddressingMode::Absolute),
}
}
Some(Operand::IndexedX(expr)) => {
match self.evaluate(expr) {
Ok(value) if value >= 0 && value <= 0xFF => {
if get_opcode(mnemonic, AddressingMode::ZeroPageX).is_some() {
Ok(AddressingMode::ZeroPageX)
} else {
Ok(AddressingMode::AbsoluteX)
}
}
_ => Ok(AddressingMode::AbsoluteX),
}
}
Some(Operand::IndexedY(expr)) => {
match self.evaluate(expr) {
Ok(value) if value >= 0 && value <= 0xFF => {
if get_opcode(mnemonic, AddressingMode::ZeroPageY).is_some() {
Ok(AddressingMode::ZeroPageY)
} else {
Ok(AddressingMode::AbsoluteY)
}
}
_ => Ok(AddressingMode::AbsoluteY),
}
}
}
}
}
Zero Page Optimization
The 6502 has faster, shorter instructions for zero page addresses (0x00-0xFF):
lda 0x80 ; Zero Page: A5 80 (2 bytes, 3 cycles)
lda 0x0180 ; Absolute: AD 80 01 (3 bytes, 4 cycles)
Our assembler automatically uses zero page mode when:
- The address fits in 8 bits (0x00-0xFF)
- A zero page variant exists for that instruction
Branch Instructions
Branches use relative addressing. The helper function identifies branch mnemonics:
#![allow(unused)]
fn main() {
fn is_branch_instruction(mnemonic: Mnemonic) -> bool {
matches!(
mnemonic,
Mnemonic::BCC
| Mnemonic::BCS
| Mnemonic::BEQ
| Mnemonic::BMI
| Mnemonic::BNE
| Mnemonic::BPL
| Mnemonic::BVC
| Mnemonic::BVS
)
}
}
Output Helpers
#![allow(unused)]
fn main() {
fn emit_byte(&mut self, byte: u8) {
self.output.push(byte);
self.current_address = self.current_address.wrapping_add(1);
}
fn emit_word(&mut self, word: u16) {
self.emit_byte((word & 0xFF) as u8); // Low byte first
self.emit_byte(((word >> 8) & 0xFF) as u8); // High byte second
}
fn pad_to(&mut self, address: u16) {
while self.current_address < address {
self.emit_byte(0);
}
}
}
Assembly Trace Example
Let’s trace assembling:
.org 0x8000
start:
lda #0x42
jmp start
Pass 1
| Statement | Action | current_address |
|---|---|---|
.org 0x8000 | Set origin | 0x8000 |
start: | Define start = 0x8000 | 0x8000 |
lda #0x42 | Size = 2 bytes | 0x8002 |
jmp start | Size = 3 bytes | 0x8005 |
Symbol Table: { start: Address(0x8000) }
Pass 2
| Statement | Output | Description |
|---|---|---|
.org 0x8000 | (reset to 0x8000) | |
start: | (nothing) | Just a label |
lda #0x42 | A9 42 | LDA immediate = 0xA9 |
jmp start | 4C 00 80 | JMP absolute = 0x4C, addr = 0x8000 (little-endian) |
Final output: A9 42 4C 00 80
Summary
In this chapter, we implemented two-pass assembly:
-
Pass 1: Collect labels and calculate addresses
- Process
.equconstants - Calculate instruction sizes
- Handle
.orgto set addresses
- Process
-
Pass 2: Generate machine code
- Look up all labels
- Emit opcode and operand bytes
- Handle zero page optimization
In the next chapter, we’ll implement the code generation details.
Previous: Chapter 6 - The Symbol Table | Next: Chapter 8 - Code Generation
Chapter 8: Code Generation
In this chapter, we’ll implement the code generation phase that converts AST nodes into actual machine code bytes.
Instruction Encoding Overview
Each 6502 instruction consists of:
- Opcode byte: Identifies the instruction and addressing mode
- Operand bytes: 0, 1, or 2 bytes depending on addressing mode
| Addressing Mode | Size | Example |
|---|---|---|
| Implied | 1 | NOP → EA |
| Accumulator | 1 | ASL A → 0A |
| Immediate | 2 | LDA #$42 → A9 42 |
| Zero Page | 2 | LDA $80 → A5 80 |
| Zero Page,X | 2 | LDA $80,X → B5 80 |
| Zero Page,Y | 2 | LDX $80,Y → B6 80 |
| Absolute | 3 | LDA $2000 → AD 00 20 |
| Absolute,X | 3 | LDA $2000,X → BD 00 20 |
| Absolute,Y | 3 | LDA $2000,Y → B9 00 20 |
| Indirect | 3 | JMP ($2000) → 6C 00 20 |
| Indexed Indirect | 2 | LDA ($80,X) → A1 80 |
| Indirect Indexed | 2 | LDA ($80),Y → B1 80 |
| Relative | 2 | BEQ label → F0 offset |
The emit_instruction Function
#![allow(unused)]
fn main() {
pub fn emit_instruction(&mut self, instr: &InstructionStmt) -> Result<(), CodeGenError> {
// Save instruction start address for $ evaluation
let instr_start = self.current_address;
// Determine addressing mode
let mode = self.determine_addressing_mode(instr.mnemonic, &instr.operand)?;
// Look up the opcode
let opcode = get_opcode(instr.mnemonic, mode).ok_or_else(|| {
CodeGenError::InvalidAddressingMode {
mnemonic: instr.mnemonic,
mode,
location: instr.location,
}
})?;
// Emit opcode byte
self.emit_byte(opcode.code);
// Emit operand bytes
self.emit_operand_bytes(&instr.operand, mode, instr.location, instr_start)?;
Ok(())
}
}
Emitting Operand Bytes
Each addressing mode requires different operand handling:
#![allow(unused)]
fn main() {
fn emit_operand_bytes(
&mut self,
operand: &Option<Operand>,
mode: AddressingMode,
location: Location,
instr_start: u16,
) -> Result<(), CodeGenError> {
match mode {
AddressingMode::Implied | AddressingMode::Accumulator => {
// No operand bytes
}
AddressingMode::Immediate => {
if let Some(Operand::Immediate(expr)) = operand {
let value = self.evaluate_byte_at(expr, location, instr_start)?;
self.emit_byte(value);
}
}
AddressingMode::ZeroPage => {
if let Some(Operand::Address(expr)) = operand {
let value = self.evaluate_byte_at(expr, location, instr_start)?;
self.emit_byte(value);
}
}
AddressingMode::ZeroPageX => {
if let Some(Operand::IndexedX(expr)) = operand {
let value = self.evaluate_byte_at(expr, location, instr_start)?;
self.emit_byte(value);
}
}
AddressingMode::ZeroPageY => {
if let Some(Operand::IndexedY(expr)) = operand {
let value = self.evaluate_byte_at(expr, location, instr_start)?;
self.emit_byte(value);
}
}
AddressingMode::Absolute => {
if let Some(Operand::Address(expr)) = operand {
let value = self.evaluate_word_at(expr, location, instr_start)?;
self.emit_word(value);
}
}
AddressingMode::AbsoluteX => {
if let Some(Operand::IndexedX(expr)) = operand {
let value = self.evaluate_word_at(expr, location, instr_start)?;
self.emit_word(value);
}
}
AddressingMode::AbsoluteY => {
if let Some(Operand::IndexedY(expr)) = operand {
let value = self.evaluate_word_at(expr, location, instr_start)?;
self.emit_word(value);
}
}
AddressingMode::Indirect => {
if let Some(Operand::Indirect(expr)) = operand {
let value = self.evaluate_word_at(expr, location, instr_start)?;
self.emit_word(value);
}
}
AddressingMode::IndirectX => {
if let Some(Operand::IndirectX(expr)) = operand {
let value = self.evaluate_byte_at(expr, location, instr_start)?;
self.emit_byte(value);
}
}
AddressingMode::IndirectY => {
if let Some(Operand::IndirectY(expr)) = operand {
let value = self.evaluate_byte_at(expr, location, instr_start)?;
self.emit_byte(value);
}
}
AddressingMode::Relative => {
if let Some(Operand::Address(expr)) = operand {
let offset = self.calculate_branch_offset(expr, instr_start, location)?;
self.emit_byte(offset as u8);
}
}
}
Ok(())
}
}
Relative Branch Calculation
Branch instructions use a signed 8-bit offset relative to the instruction after the branch:
#![allow(unused)]
fn main() {
pub fn calculate_branch_offset(
&self,
target_expr: &Expression,
from_address: u16,
location: Location,
) -> Result<i8, CodeGenError> {
let target = self.evaluate_with_location(target_expr, location)?;
// Branch is relative to PC after the branch instruction (PC + 2)
let offset = target - (from_address as i64 + 2);
if offset < -128 || offset > 127 {
return Err(CodeGenError::BranchOutOfRange { offset, location });
}
Ok(offset as i8)
}
}
Branch Offset Example
.org 0x8000
loop: ; Address 0x8000
dex ; Address 0x8000 (1 byte)
bne loop ; Address 0x8001 (2 bytes)
For bne loop:
- Current address when encoding: 0x8001
- Target: 0x8000
- Offset = target - (current + 2) = 0x8000 - 0x8003 = -3 = 0xFD
Output: D0 FD
Little-Endian Word Emission
The 6502 uses little-endian byte order:
#![allow(unused)]
fn main() {
fn emit_word(&mut self, word: u16) {
self.emit_byte((word & 0xFF) as u8); // Low byte
self.emit_byte(((word >> 8) & 0xFF) as u8); // High byte
}
}
So address 0x2000 becomes bytes 00 20.
Opcode Lookup
We use the byte_common crate’s opcode table:
#![allow(unused)]
fn main() {
use byte_common::opcode::{get_opcode, AddressingMode, Mnemonic, Opcode};
// Returns Option<&'static Opcode>
let opcode = get_opcode(Mnemonic::LDA, AddressingMode::Immediate);
// opcode.code = 0xA9
// opcode.size = 2
}
The Opcode structure provides:
code: The actual opcode byte (0x00-0xFF)size: Instruction size in bytestick: Cycle count (for emulation)
Code Generation Examples
Example 1: NOP (Implied)
nop
- Mnemonic: NOP
- Operand: None
- Mode: Implied
- Opcode: 0xEA
- Output:
EA
Example 2: LDA Immediate
lda #0x42
- Mnemonic: LDA
- Operand: Immediate(0x42)
- Mode: Immediate
- Opcode: 0xA9
- Operand byte: 0x42
- Output:
A9 42
Example 3: LDA Zero Page
lda 0x80
- Mnemonic: LDA
- Operand: Address(0x80)
- Value fits in 8 bits → Mode: Zero Page
- Opcode: 0xA5
- Address byte: 0x80
- Output:
A5 80
Example 4: LDA Absolute
lda 0x2000
- Mnemonic: LDA
- Operand: Address(0x2000)
- Value > 0xFF → Mode: Absolute
- Opcode: 0xAD
- Address word: 0x2000 →
00 20(little-endian) - Output:
AD 00 20
Example 5: JMP Indirect
jmp (0x2000)
- Mnemonic: JMP
- Operand: Indirect(0x2000)
- Mode: Indirect
- Opcode: 0x6C
- Address word:
00 20 - Output:
6C 00 20
Example 6: Branch Instruction
.org 0x8000
start:
ldx #5
loop:
dex
bne loop
At bne loop:
- Current address: 0x8003
- Target: 0x8002 (loop)
- Offset: 0x8002 - 0x8005 = -3 = 0xFD
Output for bne loop: D0 FD
Error Handling
Several errors can occur during code generation:
#![allow(unused)]
fn main() {
pub enum CodeGenError {
// Invalid mnemonic + mode combination
InvalidAddressingMode {
mnemonic: Mnemonic,
mode: AddressingMode,
location: Location,
},
// Branch target too far
BranchOutOfRange {
offset: i64,
location: Location,
},
// Value too large for operand
ValueOutOfRange {
value: i64,
max: i64,
location: Location,
},
// Undefined symbol
UndefinedSymbol {
name: String,
location: Location,
},
}
}
Complete Instruction Table
For reference, here’s how common instructions encode:
| Instruction | Mode | Opcode |
|---|---|---|
LDA #nn | Immediate | A9 |
LDA zp | Zero Page | A5 |
LDA zp,x | Zero Page,X | B5 |
LDA abs | Absolute | AD |
LDA abs,x | Absolute,X | BD |
LDA abs,y | Absolute,Y | B9 |
LDA (zp,x) | Indexed Indirect | A1 |
LDA (zp),y | Indirect Indexed | B1 |
STA zp | Zero Page | 85 |
STA abs | Absolute | 8D |
JMP abs | Absolute | 4C |
JMP (abs) | Indirect | 6C |
JSR abs | Absolute | 20 |
RTS | Implied | 60 |
NOP | Implied | EA |
BEQ rel | Relative | F0 |
BNE rel | Relative | D0 |
Summary
In this chapter, we implemented code generation:
- Opcode lookup using mnemonic + addressing mode
- Operand byte emission for each addressing mode
- Little-endian word encoding
- Relative branch offset calculation
- Error handling for invalid combinations
In the next chapter, we’ll implement expression evaluation for calculating addresses and values.
Previous: Chapter 7 - Two-Pass Assembly | Next: Chapter 9 - Expression Evaluation
Chapter 9: Expression Evaluation
In this chapter, we’ll implement the expression evaluator that computes numeric values from AST expressions.
Why Expressions Matter
Expressions allow powerful constructs in assembly:
.equ BUFFER 0x0200
.equ BUFFER_SIZE 256
lda #BUFFER_SIZE - 1 ; Compute at assembly time
ldx #<BUFFER ; Low byte of address
ldy #>BUFFER ; High byte of address
.dw BUFFER + BUFFER_SIZE ; End of buffer
jmp $ ; Jump to current address
The Evaluator
The evaluator recursively walks an Expression tree and computes the result:
#![allow(unused)]
fn main() {
impl Assembler {
pub fn evaluate(&self, expr: &Expression) -> Result<i64, CodeGenError> {
self.evaluate_at(expr, self.current_address, Location::default())
}
fn evaluate_at(
&self,
expr: &Expression,
current_addr: u16,
location: Location,
) -> Result<i64, CodeGenError> {
match expr {
Expression::Number(n) => Ok(*n),
Expression::CurrentAddress => Ok(current_addr as i64),
Expression::Identifier(name) => {
self.symbols.lookup_value(name).ok_or_else(|| {
CodeGenError::UndefinedSymbol {
name: name.clone(),
location,
}
})
}
Expression::LocalIdentifier(name) => {
// Try direct lookup
if let Some(value) = self.symbols.lookup_value(name) {
return Ok(value);
}
// Try qualified lookup
let qualified = self.symbols.qualify_local_label(name);
self.symbols.lookup_value(&qualified).ok_or_else(|| {
CodeGenError::UndefinedSymbol {
name: name.clone(),
location,
}
})
}
Expression::Binary { left, op, right } => {
let l = self.evaluate_at(left, current_addr, location)?;
let r = self.evaluate_at(right, current_addr, location)?;
self.apply_binary_op(l, *op, r, location)
}
Expression::Unary { op, operand } => {
let value = self.evaluate_at(operand, current_addr, location)?;
self.apply_unary_op(*op, value)
}
}
}
}
}
Binary Operations
#![allow(unused)]
fn main() {
fn apply_binary_op(
&self,
left: i64,
op: BinaryOp,
right: i64,
location: Location,
) -> Result<i64, CodeGenError> {
match op {
BinaryOp::Add => Ok(left.wrapping_add(right)),
BinaryOp::Sub => Ok(left.wrapping_sub(right)),
BinaryOp::Mul => Ok(left.wrapping_mul(right)),
BinaryOp::Div => {
if right == 0 {
Err(CodeGenError::EvaluationError {
message: "division by zero".to_string(),
location,
})
} else {
Ok(left / right)
}
}
}
}
}
Examples
| Expression | Result |
|---|---|
10 + 5 | 15 |
20 - 3 | 17 |
4 * 8 | 32 |
100 / 10 | 10 |
0x1000 + 0x100 | 0x1100 |
Unary Operations
#![allow(unused)]
fn main() {
fn apply_unary_op(&self, op: UnaryOp, value: i64) -> Result<i64, CodeGenError> {
match op {
UnaryOp::Neg => Ok(-value),
UnaryOp::LoByte => Ok(value & 0xFF),
UnaryOp::HiByte => Ok((value >> 8) & 0xFF),
}
}
}
Lo-Byte and Hi-Byte Operators
These are essential for working with 16-bit addresses on an 8-bit processor:
.equ SCREEN 0x1234
lda #<SCREEN ; Low byte: 0x34
sta ptr
lda #>SCREEN ; High byte: 0x12
sta ptr+1
| Expression | Value | Result |
|---|---|---|
<0x1234 | 0x1234 | 0x34 |
>0x1234 | 0x1234 | 0x12 |
<0xFF00 | 0xFF00 | 0x00 |
>0x00FF | 0x00FF | 0x00 |
The Current Address ($)
The $ symbol represents the current address during assembly:
.org 0x8000
loop:
jmp $ ; Jump to self (infinite loop) - jumps to 0x8000
.dw $ ; Store current address
When evaluating $, we use the address at the start of the instruction, not after emitting bytes:
#![allow(unused)]
fn main() {
Expression::CurrentAddress => Ok(current_addr as i64),
}
This is why we pass instr_start when evaluating operands.
Range Checking
We provide variants that check the result fits in the required size:
#![allow(unused)]
fn main() {
pub fn evaluate_byte(
&self,
expr: &Expression,
location: Location,
) -> Result<u8, CodeGenError> {
let value = self.evaluate_with_location(expr, location)?;
// Allow -128 to 255 (signed or unsigned byte)
if value < -128 || value > 255 {
return Err(CodeGenError::ValueOutOfRange {
value,
max: 255,
location,
});
}
Ok(value as u8)
}
pub fn evaluate_word(
&self,
expr: &Expression,
location: Location,
) -> Result<u16, CodeGenError> {
let value = self.evaluate_with_location(expr, location)?;
if value < -32768 || value > 65535 {
return Err(CodeGenError::ValueOutOfRange {
value,
max: 65535,
location,
});
}
Ok(value as u16)
}
}
Evaluation with Custom Address
For correct $ handling during code generation:
#![allow(unused)]
fn main() {
pub fn evaluate_byte_at(
&self,
expr: &Expression,
location: Location,
current_addr: u16,
) -> Result<u8, CodeGenError> {
let value = self.evaluate_at(expr, current_addr, location)?;
if value < -128 || value > 255 {
return Err(CodeGenError::ValueOutOfRange {
value,
max: 255,
location,
});
}
Ok(value as u8)
}
}
Complex Expression Examples
Computing Buffer End
.equ BUFFER_START 0x0200
.equ BUFFER_SIZE 0x0100
; BUFFER_END = BUFFER_START + BUFFER_SIZE = 0x0300
lda #>BUFFER_START + BUFFER_SIZE ; Error? No, it's 0x03
lda #>(BUFFER_START + BUFFER_SIZE) ; Same: 0x03
Table Offsets
.equ SPRITE_SIZE 4
.equ SPRITE_COUNT 8
; Total size = 4 * 8 = 32 bytes
.equ SPRITE_TABLE_SIZE SPRITE_SIZE * SPRITE_COUNT
; Offset to sprite N: N * SPRITE_SIZE
lda sprite_table + 2 * SPRITE_SIZE ; Third sprite
Relative Jumps
beq $ + 3 ; Skip next 1-byte instruction if equal
nop
rts
Here $ + 3 evaluates to current_address + 3.
Evaluation Order
Expressions follow standard mathematical precedence:
- Parentheses
()- highest - Unary operators
-,<,> - Multiplication and division
*,/ - Addition and subtraction
+,-- lowest
So 2 + 3 * 4 evaluates as 2 + (3 * 4) = 14, not (2 + 3) * 4 = 20.
Error Messages
When evaluation fails, we provide helpful messages:
error: undefined symbol 'sprite_ptr'
--> game.s:42:5
|
42 | lda sprite_ptr
| ^^^^^^^^^^ symbol not defined
error: value 300 out of range (max 255)
--> game.s:15:9
|
15 | lda #300
| ^^^^ value too large for byte
Summary
In this chapter, we implemented expression evaluation:
- Numeric literals: Direct values
- Identifiers: Symbol table lookups
- Binary operations: +, -, *, /
- Unary operations: negation, lo-byte, hi-byte
- Current address: $ symbol
- Range checking: Ensure values fit in bytes/words
In the next chapter, we’ll implement directive handlers for .org, .db, .dw, etc.
Previous: Chapter 8 - Code Generation | Next: Chapter 10 - Implementing Directives
Chapter 10: Implementing Directives
In this chapter, we’ll implement handlers for all ByteASM directives.
Directive Overview
| Directive | Purpose | Example |
|---|---|---|
.org | Set assembly address | .org 0x8000 |
.db | Define bytes | .db 0x01, "Hi", 0 |
.dw | Define words | .dw 0x1234, label |
.equ | Define constant | .equ SCREEN 0x1000 |
.include | Include file | .include "utils.s" |
The .org Directive
.org (origin) sets the current assembly address.
Pass 1 Handling
#![allow(unused)]
fn main() {
DirectiveStmt::Org { address, .. } => {
let addr = self.evaluate(address)? as u16;
self.origin = addr;
self.current_address = addr;
}
}
Pass 2 Handling
#![allow(unused)]
fn main() {
DirectiveStmt::Org { address, .. } => {
let addr = self.evaluate(address)? as u16;
// If moving forward, pad with zeros
if addr > self.current_address {
self.pad_to(addr);
}
self.current_address = addr;
}
}
Usage Examples
.org 0x8000 ; Start code at 0x8000
reset:
; ... code ...
.org 0xFFFC ; Jump to reset vector location
.dw reset ; Write reset vector
Multiple .org Directives
.org 0x8000
lda #0x01 ; At 0x8000
.org 0x8100 ; Skip to 0x8100 (gap filled with zeros)
lda #0x02 ; At 0x8100
The .db Directive
.db (define bytes) emits raw bytes and strings.
Pass 1 Handling
Calculate size without emitting:
#![allow(unused)]
fn main() {
DirectiveStmt::Db { values, .. } => {
for value in values {
match value {
DataValue::Byte(_) => {
self.current_address = self.current_address.wrapping_add(1);
}
DataValue::String(s) => {
self.current_address = self.current_address.wrapping_add(s.len() as u16);
}
}
}
}
}
Pass 2 Handling
Emit the actual bytes:
#![allow(unused)]
fn main() {
DirectiveStmt::Db { values, location } => {
for value in values {
match value {
DataValue::Byte(expr) => {
let byte = self.evaluate_byte(expr, *location)?;
self.emit_byte(byte);
}
DataValue::String(s) => {
for byte in s.bytes() {
self.emit_byte(byte);
}
}
}
}
}
}
Usage Examples
; Single bytes
.db 0x00, 0xFF, 0x42
; Expressions
.db 10 + 5, CONSTANT - 1
; Strings
.db "Hello, World!", 0x0A, 0
; Mixed
.db "Score: ", 0x30, 0 ; "Score: 0\0"
Strings and Escape Sequences
The scanner handles escape sequences, so:
.db "Line 1\nLine 2\0"
Emits: 4C 69 6E 65 20 31 0A 4C 69 6E 65 20 32 00
The .dw Directive
.dw (define words) emits 16-bit values in little-endian format.
Pass 1 Handling
#![allow(unused)]
fn main() {
DirectiveStmt::Dw { values, .. } => {
self.current_address = self.current_address
.wrapping_add((values.len() * 2) as u16);
}
}
Pass 2 Handling
#![allow(unused)]
fn main() {
DirectiveStmt::Dw { values, location } => {
for value in values {
let word = self.evaluate_word(value, *location)?;
self.emit_word(word);
}
}
}
Usage Examples
; Numeric values
.dw 0x1234, 0x5678 ; Emits: 34 12 78 56
; Labels (addresses)
.dw start, main_loop ; Emits addresses
; Interrupt vectors
.org 0xFFFC
.dw reset ; Reset vector
.dw irq_handler ; IRQ vector
The .equ Directive
.equ defines a constant that can be used throughout the program.
Pass 1 Handling
#![allow(unused)]
fn main() {
DirectiveStmt::Equ { name, value, location } => {
let val = self.evaluate(value)?;
self.symbols.define_constant(name, val, *location)?;
}
}
Pass 2 Handling
#![allow(unused)]
fn main() {
DirectiveStmt::Equ { .. } => {
// Constants are handled in pass 1, nothing to emit
}
}
Usage Examples
; Hardware addresses
.equ VID_PTR 0xFD
.equ INPUT 0xFF
; Constants
.equ SCREEN_WIDTH 64
.equ SCREEN_HEIGHT 64
.equ PIXEL_COUNT SCREEN_WIDTH * SCREEN_HEIGHT
; Usage
lda INPUT
sta VID_PTR
ldx #SCREEN_WIDTH
Constants vs Labels
Constants differ from labels:
- Labels: Get their value from their position in code
- Constants: Have values assigned directly
.equ CONST 0x42 ; CONST = 0x42 (assigned)
label: ; label = current address (computed)
nop
The .include Directive
.include inserts another source file at the current position.
Implementation (Simplified)
#![allow(unused)]
fn main() {
DirectiveStmt::Include { path, location } => {
// Read the include file
let source = std::fs::read_to_string(path)
.map_err(|_| CodeGenError::Internal {
message: format!("cannot read file: {}", path),
})?;
// Parse it
let program = parser::parse(&source)?;
// Recursively assemble
for stmt in program.statements {
self.process_statement(&stmt)?;
}
}
}
Usage Examples
; main.s
.include "constants.s"
.include "macros.s"
.org 0x8000
reset:
jsr init_screen
; ...
; constants.s
.equ VID_PTR 0xFD
.equ INPUT 0xFF
.equ VRAM 0x1000
Circular Include Detection
We need to detect and prevent circular includes:
; a.s includes b.s
; b.s includes a.s → Error!
Track included files and error if we see a repeat.
Directive Processing Flow
Complete Pass 1 Handler
#![allow(unused)]
fn main() {
pub fn pass1_directive(&mut self, directive: &DirectiveStmt) -> Result<(), CodeGenError> {
match directive {
DirectiveStmt::Org { address, .. } => {
let addr = self.evaluate(address)? as u16;
self.origin = addr;
self.current_address = addr;
}
DirectiveStmt::Db { values, .. } => {
for value in values {
match value {
DataValue::Byte(_) => {
self.current_address = self.current_address.wrapping_add(1);
}
DataValue::String(s) => {
self.current_address =
self.current_address.wrapping_add(s.len() as u16);
}
}
}
}
DirectiveStmt::Dw { values, .. } => {
self.current_address = self
.current_address
.wrapping_add((values.len() * 2) as u16);
}
DirectiveStmt::Equ { name, value, location } => {
let val = self.evaluate(value)?;
self.symbols.define_constant(name, val, *location)?;
}
DirectiveStmt::Include { .. } => {
// Handle includes
}
}
Ok(())
}
}
Complete Pass 2 Handler
#![allow(unused)]
fn main() {
pub fn emit_directive(&mut self, directive: &DirectiveStmt) -> Result<(), CodeGenError> {
match directive {
DirectiveStmt::Org { address, .. } => {
let addr = self.evaluate(address)? as u16;
if addr > self.current_address {
self.pad_to(addr);
}
self.current_address = addr;
}
DirectiveStmt::Db { values, location } => {
for value in values {
match value {
DataValue::Byte(expr) => {
let byte = self.evaluate_byte(expr, *location)?;
self.emit_byte(byte);
}
DataValue::String(s) => {
for byte in s.bytes() {
self.emit_byte(byte);
}
}
}
}
}
DirectiveStmt::Dw { values, location } => {
for value in values {
let word = self.evaluate_word(value, *location)?;
self.emit_word(word);
}
}
DirectiveStmt::Equ { .. } => {
// Nothing to emit
}
DirectiveStmt::Include { .. } => {
// Handle includes
}
}
Ok(())
}
}
Summary
In this chapter, we implemented all ByteASM directives:
.org: Sets the assembly address, pads with zeros.db: Emits bytes and strings.dw: Emits 16-bit words (little-endian).equ: Defines named constants.include: Includes other source files
In the next chapter, we’ll implement comprehensive error handling and reporting.
Previous: Chapter 9 - Expression Evaluation | Next: Chapter 11 - Error Handling
Chapter 11: Error Handling and Reporting
In this chapter, we’ll implement comprehensive error handling to provide helpful messages when things go wrong.
Error Categories
Our assembler has three categories of errors:
- Scanner Errors: Problems reading source characters
- Parse Errors: Problems understanding the syntax
- Code Generation Errors: Problems during assembly
The Error Type Hierarchy
#![allow(unused)]
fn main() {
pub enum AssemblerError {
Scanner(ScannerError),
Parse(ParseError),
CodeGen(CodeGenError),
Io { message: String },
Multiple(Vec<AssemblerError>),
}
}
This unified type lets us handle errors consistently throughout the pipeline.
Scanner Errors
#![allow(unused)]
fn main() {
pub enum ScannerError {
UnknownCharacter {
line: usize,
column: usize,
character: char,
},
UnknownDirective {
line: usize,
column: usize,
directive: String,
},
NumberExpected {
line: usize,
column: usize,
symbol: char,
},
UnterminatedString {
line: usize,
column: usize,
quote: char,
},
}
}
Example Messages
error: unknown character '?'
--> game.s:5:10
|
5 | lda ?
| ^
error: unterminated string
--> game.s:12:9
|
12 | .db "Hello
| ^^^^^^ missing closing quote
Parse Errors
#![allow(unused)]
fn main() {
pub enum ParseError {
UnexpectedToken {
expected: String,
found: TokenKind,
location: Location,
},
UnexpectedEof {
location: Location,
},
InvalidOperand {
message: String,
location: Location,
},
InvalidExpression {
message: String,
location: Location,
},
InvalidDirective {
message: String,
location: Location,
},
InvalidLabel {
message: String,
location: Location,
},
}
}
Example Messages
error: unexpected token: expected ')', found ','
--> game.s:15:14
|
15 | lda (0x80,y)
| ^ expected ')' for indirect addressing
error: invalid operand: indexed indirect only supports X register
--> game.s:20:9
|
20 | sta (0x80,y)
| ^^^^^^^^ use (zp),y for indirect indexed mode
Code Generation Errors
#![allow(unused)]
fn main() {
pub enum CodeGenError {
UndefinedSymbol {
name: String,
location: Location,
},
DuplicateSymbol {
name: String,
first: Location,
second: Location,
},
InvalidAddressingMode {
mnemonic: Mnemonic,
mode: AddressingMode,
location: Location,
},
BranchOutOfRange {
offset: i64,
location: Location,
},
ValueOutOfRange {
value: i64,
max: i64,
location: Location,
},
CircularInclude {
path: String,
location: Location,
},
EvaluationError {
message: String,
location: Location,
},
}
}
Example Messages
error: undefined symbol 'sprite_ptr'
--> game.s:42:9
|
42 | lda sprite_ptr
| ^^^^^^^^^^ symbol not defined
error: duplicate symbol 'main'
--> game.s:30:1
|
10 | main:
| ---- first defined here
|
30 | main:
| ^^^^ redefined here
error: branch target out of range (offset -200)
--> game.s:55:5
|
55 | bne far_label
| ^^^^^^^^^^^^^ target is 200 bytes away (max: -128 to +127)
Formatting Errors
The Diagnostic Structure
#![allow(unused)]
fn main() {
pub struct Diagnostic {
pub severity: Severity,
pub message: String,
pub location: Option<Location>,
pub file: Option<String>,
}
pub enum Severity {
Error,
Warning,
Note,
}
}
Formatting with Source Context
#![allow(unused)]
fn main() {
impl Diagnostic {
pub fn format(&self, source: &str) -> String {
let severity = match self.severity {
Severity::Error => "error",
Severity::Warning => "warning",
Severity::Note => "note",
};
let mut output = String::new();
if let Some(loc) = &self.location {
// Header line
if let Some(file) = &self.file {
output.push_str(&format!(
"{}: {}:{}:{}: {}\n",
severity, file, loc.line, loc.column, self.message
));
} else {
output.push_str(&format!(
"{}: [{}:{}]: {}\n",
severity, loc.line, loc.column, self.message
));
}
// Show source context
let lines: Vec<&str> = source.lines().collect();
if loc.line > 0 && loc.line <= lines.len() {
let line = lines[loc.line - 1];
let line_num = format!("{}", loc.line);
let padding = " ".repeat(line_num.len());
output.push_str(&format!(" {} |\n", padding));
output.push_str(&format!(" {} | {}\n", line_num, line));
// Underline
let underline_padding = " ".repeat(loc.column.saturating_sub(1));
let underline = "^".repeat(loc.length.max(1));
output.push_str(&format!(
" {} | {}{}\n",
padding, underline_padding, underline
));
}
} else {
output.push_str(&format!("{}: {}\n", severity, self.message));
}
output
}
}
}
Error Recovery
Rather than stopping at the first error, we continue parsing to find more issues.
Parser Error Recovery
#![allow(unused)]
fn main() {
fn synchronize(&mut self) {
// Skip to the next line
while !self.is_at_end() {
if self.check(TokenKind::NewLine) {
self.advance();
return;
}
self.advance();
}
}
pub fn parse(&mut self) -> Result<Program, Vec<ParseError>> {
let mut program = Program::new();
while !self.is_at_end() {
self.skip_empty_lines();
if self.is_at_end() { break; }
match self.parse_line() {
Ok(stmts) => program.statements.extend(stmts),
Err(e) => {
self.errors.push(e);
self.synchronize(); // Skip to next line
}
}
}
if self.errors.is_empty() {
Ok(program)
} else {
Err(std::mem::take(&mut self.errors))
}
}
}
Assembler Error Collection
#![allow(unused)]
fn main() {
fn pass2(&mut self, program: &Program) -> Result<(), AssemblerError> {
for stmt in &program.statements {
match stmt {
Statement::Instruction(instr) => {
if let Err(e) = self.emit_instruction(instr) {
self.errors.push(e); // Collect, don't stop
}
}
// ...
}
}
Ok(())
}
}
Warnings
Some issues don’t prevent assembly but should be reported:
#![allow(unused)]
fn main() {
// Zero page address used with absolute mode
pub fn warn_zp_as_absolute(&self, addr: u16, location: Location) {
if addr <= 0xFF {
eprintln!(
"warning: [{}:{}]: address 0x{:02X} could use zero page mode",
location.line, location.column, addr
);
}
}
// Unused symbol
pub fn warn_unused_symbols(&self) {
for sym in self.symbols.unreferenced_symbols() {
eprintln!(
"warning: [{}:{}]: symbol '{}' is defined but never used",
sym.defined_at.line, sym.defined_at.column, sym.name
);
}
}
}
CLI Error Formatting
#![allow(unused)]
fn main() {
fn format_error(error: &ParseError, source: &str, file: &Path) -> String {
let loc = error.location();
let lines: Vec<&str> = source.lines().collect();
let mut output = format!(
"error: {}\n --> {}:{}:{}\n",
error, file.display(), loc.line, loc.column
);
if loc.line > 0 && loc.line <= lines.len() {
let line = lines[loc.line - 1];
let line_num = format!("{}", loc.line);
let padding = " ".repeat(line_num.len());
output.push_str(&format!(" {} |\n", padding));
output.push_str(&format!(" {} | {}\n", line_num, line));
output.push_str(&format!(
" {} | {}{}'\n",
padding,
" ".repeat(loc.column.saturating_sub(1)),
"^".repeat(loc.length.max(1))
));
}
output
}
}
Complete Error Output Example
error: unexpected token: expected identifier, found Number
--> game.s:5:10
|
5 | .equ 123 456
| ^^^ expected identifier after .equ
error: undefined symbol 'plyer_x'
--> game.s:15:9
|
15 | lda plyer_x
| ^^^^^^^ did you mean 'player_x'?
error: branch target out of range (offset -150)
--> game.s:42:5
|
42 | bne start
| ^^^^^^^^^ target is too far away
Found 3 errors.
Summary
In this chapter, we implemented comprehensive error handling:
- Three error categories: Scanner, parser, and code generation
- Location tracking: Every error includes file, line, and column
- Source context: Show the offending line with underline
- Error recovery: Continue after errors to find more issues
- Warnings: Non-fatal issues like unused symbols
In the next chapter, we’ll build the command-line interface for the assembler.
Previous: Chapter 10 - Implementing Directives | Next: Chapter 12 - The CLI
Chapter 12: The Command-Line Interface
In this chapter, we’ll build a usable command-line tool for our assembler.
CLI Design
byte_asm [OPTIONS] <input.s>
OPTIONS:
-o, --output <file> Output binary file (default: a.out)
-v, --verbose Show assembly progress
--hex Output as hex dump instead of binary
-h, --help Show help message
Argument Parsing
We’ll parse arguments manually for simplicity:
#![allow(unused)]
fn main() {
struct Args {
input: PathBuf,
output: PathBuf,
verbose: bool,
hex_dump: bool,
}
fn parse_args() -> Result<Args, String> {
let args: Vec<String> = std::env::args().collect();
if args.len() < 2 {
return Err(format!(
"Usage: {} [OPTIONS] <input.s>\n\n\
OPTIONS:\n\
-o, --output <file> Output binary file (default: a.out)\n\
-v, --verbose Show assembly progress\n\
--hex Output as hex dump",
args[0]
));
}
let mut input: Option<PathBuf> = None;
let mut output = PathBuf::from("a.out");
let mut verbose = false;
let mut hex_dump = false;
let mut i = 1;
while i < args.len() {
match args[i].as_str() {
"-o" | "--output" => {
i += 1;
if i >= args.len() {
return Err("Expected output file after -o".to_string());
}
output = PathBuf::from(&args[i]);
}
"-v" | "--verbose" => verbose = true,
"--hex" => hex_dump = true,
"-h" | "--help" => {
return Err(format!(
"ByteASM - 6502 Assembler\n\n\
Usage: {} [OPTIONS] <input.s>\n\n\
OPTIONS:\n\
-o, --output <file> Output file (default: a.out)\n\
-v, --verbose Show progress\n\
--hex Output hex dump\n\
-h, --help Show help",
args[0]
));
}
arg if arg.starts_with('-') => {
return Err(format!("Unknown option: {}", arg));
}
_ => {
input = Some(PathBuf::from(&args[i]));
}
}
i += 1;
}
let input = input.ok_or("No input file specified")?;
Ok(Args { input, output, verbose, hex_dump })
}
}
Main Function
fn main() {
let args = match parse_args() {
Ok(args) => args,
Err(msg) => {
eprintln!("{}", msg);
std::process::exit(1);
}
};
if args.verbose {
eprintln!("Assembling: {}", args.input.display());
}
// Read source file
let source = match fs::read_to_string(&args.input) {
Ok(s) => s,
Err(e) => {
eprintln!("Error reading {}: {}", args.input.display(), e);
std::process::exit(1);
}
};
// Parse
let program = match parser::parse(&source) {
Ok(p) => p,
Err(errors) => {
for error in errors {
eprintln!("{}", format_error(&error, &source, &args.input));
}
std::process::exit(1);
}
};
if args.verbose {
eprintln!("Parsed {} statements", program.statements.len());
}
// Assemble
let mut assembler = Assembler::new();
let binary = match assembler.assemble(&program) {
Ok(b) => b,
Err(e) => {
eprintln!("{}", format_assembler_error(&e, &source, &args.input));
std::process::exit(1);
}
};
if args.verbose {
eprintln!("Generated {} bytes", binary.len());
eprintln!("Defined {} symbols", assembler.symbols().len());
}
// Output
if args.hex_dump {
print_hex_dump(&binary);
} else {
if let Err(e) = fs::write(&args.output, &binary) {
eprintln!("Error writing {}: {}", args.output.display(), e);
std::process::exit(1);
}
if args.verbose {
eprintln!("Wrote: {}", args.output.display());
}
}
}
Hex Dump Output
For debugging, we can output a hex dump instead of binary:
#![allow(unused)]
fn main() {
fn print_hex_dump(binary: &[u8]) {
const BYTES_PER_LINE: usize = 16;
for (i, chunk) in binary.chunks(BYTES_PER_LINE).enumerate() {
// Address
print!("{:04X} ", i * BYTES_PER_LINE);
// Hex bytes
for (j, byte) in chunk.iter().enumerate() {
print!("{:02X} ", byte);
if j == 7 { print!(" "); } // Extra space in middle
}
// Padding for incomplete lines
if chunk.len() < BYTES_PER_LINE {
let missing = BYTES_PER_LINE - chunk.len();
for j in 0..missing {
print!(" ");
if chunk.len() + j == 7 { print!(" "); }
}
}
// ASCII representation
print!(" |");
for byte in chunk {
if *byte >= 0x20 && *byte < 0x7F {
print!("{}", *byte as char);
} else {
print!(".");
}
}
println!("|");
}
}
}
Example Output
$ byte_asm --hex example.s
0000 A9 42 85 00 4C 00 80 00 00 00 00 00 00 00 00 00 |.B..L...........|
0010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
Integration with Byte Emulator
The output binary is ready to load into the emulator:
# Assemble
cargo run -p byte_asm -- game.s -o game.bin
# Run
cargo run -p byte_emu -- game.bin
Setting Up the Reset Vector
The emulator reads the reset vector from 0xFFFC-0xFFFD:
.equ VID_PTR 0xFD
.equ VID_PAGE 0x01 ; Video page 1 (0x1000 >> 12)
.org 0x8000
reset:
; Initialize hardware
lda #VID_PAGE
sta VID_PTR ; Set video page
main_loop:
; Game logic
rti
; Set up vectors
.org 0xFFFC
.dw reset ; Reset vector - where to start
.dw main_loop ; IRQ vector - called on VBLANK
Verbose Output Example
$ byte_asm -v game.s -o game.bin
Assembling: game.s
Parsed 42 statements
Generated 156 bytes
Defined 12 symbols
Wrote: game.bin
Error Output Example
$ byte_asm broken.s
error: undefined symbol 'sprit_x'
--> broken.s:15:9
|
15 | lda sprit_x
| ^^^^^^^ symbol not defined
error: branch target out of range
--> broken.s:42:5
|
42 | bne far_label
| ^^^^^^^^^^^^^ offset -150, must be -128 to +127
Found 2 errors.
Usage Workflow
A typical workflow:
# Edit source
vim game.s
# Assemble
byte_asm game.s -o game.bin
# Test in emulator
byte_emu game.bin
# Debug with hex dump
byte_asm game.s --hex | less
# Verbose build
byte_asm -v game.s -o game.bin
Exit Codes
#![allow(unused)]
fn main() {
// Success
std::process::exit(0);
// Error (parse, assembly, I/O)
std::process::exit(1);
}
Summary
In this chapter, we built a command-line interface that:
- Parses command-line arguments
- Reads source files
- Invokes the parser and assembler
- Outputs binary or hex dump
- Reports errors with context
- Integrates with the byte emulator
In the next chapter, we’ll write tests to verify our assembler works correctly.
Previous: Chapter 11 - Error Handling | Next: Chapter 13 - Testing
Chapter 13: Testing the Assembler
In this chapter, we’ll write tests to verify our assembler produces correct machine code.
Testing Strategy
We’ll test at multiple levels:
- Unit tests: Individual components (scanner, parser, evaluator)
- Integration tests: Complete assembly of programs
- Comparison tests: Compare output to known-good binaries
Scanner Tests
Test that the scanner produces correct tokens:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod scanner_tests {
use byte_asm::scanner::*;
#[test]
fn test_number_formats() {
let mut scanner = Scanner::new("0xFF 0b1010 42");
let tok = scanner.scan_token().unwrap();
assert_eq!(tok.kind, TokenKind::Number);
assert_eq!(tok.number(), Some(255));
let tok = scanner.scan_token().unwrap();
assert_eq!(tok.kind, TokenKind::Number);
assert_eq!(tok.number(), Some(10));
let tok = scanner.scan_token().unwrap();
assert_eq!(tok.kind, TokenKind::Number);
assert_eq!(tok.number(), Some(42));
}
#[test]
fn test_instruction() {
let mut scanner = Scanner::new("lda");
let tok = scanner.scan_token().unwrap();
assert_eq!(tok.kind, TokenKind::Instruction);
}
#[test]
fn test_directive() {
let mut scanner = Scanner::new(".org");
let tok = scanner.scan_token().unwrap();
assert_eq!(tok.kind, TokenKind::Directive);
}
#[test]
fn test_local_label() {
let mut scanner = Scanner::new(".loop @temp");
let tok = scanner.scan_token().unwrap();
assert_eq!(tok.kind, TokenKind::LocalLabel);
let tok = scanner.scan_token().unwrap();
assert_eq!(tok.kind, TokenKind::LocalLabel);
}
}
}
Parser Tests
Test that parsing produces correct AST:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod parser_tests {
use byte_asm::parser;
use byte_asm::ast::*;
#[test]
fn test_parse_instruction() {
let program = parser::parse("lda #0x42").unwrap();
assert_eq!(program.statements.len(), 1);
match &program.statements[0] {
Statement::Instruction(i) => {
assert!(matches!(i.operand, Some(Operand::Immediate(_))));
}
_ => panic!("Expected instruction"),
}
}
#[test]
fn test_parse_label() {
let program = parser::parse("main:").unwrap();
match &program.statements[0] {
Statement::Label(l) => {
assert_eq!(l.name, "main");
assert!(!l.is_local);
}
_ => panic!("Expected label"),
}
}
#[test]
fn test_parse_addressing_modes() {
// Immediate
let p = parser::parse("lda #0x42").unwrap();
assert!(matches!(
&p.statements[0],
Statement::Instruction(i) if matches!(i.operand, Some(Operand::Immediate(_)))
));
// Zero Page / Absolute
let p = parser::parse("lda 0x80").unwrap();
assert!(matches!(
&p.statements[0],
Statement::Instruction(i) if matches!(i.operand, Some(Operand::Address(_)))
));
// Indirect X
let p = parser::parse("lda (0x80,x)").unwrap();
assert!(matches!(
&p.statements[0],
Statement::Instruction(i) if matches!(i.operand, Some(Operand::IndirectX(_)))
));
// Indirect Y
let p = parser::parse("lda (0x80),y").unwrap();
assert!(matches!(
&p.statements[0],
Statement::Instruction(i) if matches!(i.operand, Some(Operand::IndirectY(_)))
));
}
}
}
Assembler Tests
Test that assembly produces correct bytes:
#![allow(unused)]
fn main() {
#[cfg(test)]
mod assembler_tests {
use byte_asm::{parser, Assembler};
fn assemble(source: &str) -> Vec<u8> {
let program = parser::parse(source).unwrap();
let mut asm = Assembler::new();
asm.assemble(&program).unwrap()
}
#[test]
fn test_nop() {
let binary = assemble(".org 0x8000\nnop");
assert_eq!(binary[0], 0xEA);
}
#[test]
fn test_lda_immediate() {
let binary = assemble(".org 0x8000\nlda #0x42");
assert_eq!(&binary[0..2], &[0xA9, 0x42]);
}
#[test]
fn test_lda_zero_page() {
let binary = assemble(".org 0x8000\nlda 0x80");
assert_eq!(&binary[0..2], &[0xA5, 0x80]);
}
#[test]
fn test_lda_absolute() {
let binary = assemble(".org 0x8000\nlda 0x2000");
assert_eq!(&binary[0..3], &[0xAD, 0x00, 0x20]);
}
#[test]
fn test_jmp() {
let binary = assemble(".org 0x8000\njmp 0x9000");
assert_eq!(&binary[0..3], &[0x4C, 0x00, 0x90]);
}
#[test]
fn test_label_resolution() {
let source = r#"
.org 0x8000
start:
jmp end
nop
end:
rts
"#;
let binary = assemble(source);
// JMP end = 4C 05 80 (end is at 0x8005)
assert_eq!(&binary[0..3], &[0x4C, 0x05, 0x80]);
}
#[test]
fn test_branch() {
let source = r#"
.org 0x8000
ldx #5
loop:
dex
bne loop
"#;
let binary = assemble(source);
// BNE loop: D0 FD (offset -3)
assert_eq!(&binary[3..5], &[0xD0, 0xFD]);
}
#[test]
fn test_db_bytes() {
let binary = assemble(".org 0x8000\n.db 0x01, 0x02, 0x03");
assert_eq!(&binary[0..3], &[0x01, 0x02, 0x03]);
}
#[test]
fn test_db_string() {
let binary = assemble(".org 0x8000\n.db \"Hi\", 0");
assert_eq!(&binary[0..3], &[0x48, 0x69, 0x00]);
}
#[test]
fn test_dw() {
let binary = assemble(".org 0x8000\n.dw 0x1234");
assert_eq!(&binary[0..2], &[0x34, 0x12]); // Little-endian
}
#[test]
fn test_equ() {
let source = r#"
.equ VALUE 0x42
.org 0x8000
lda #VALUE
"#;
let binary = assemble(source);
assert_eq!(&binary[0..2], &[0xA9, 0x42]);
}
#[test]
fn test_expressions() {
let source = r#"
.equ BASE 0x1000
.org 0x8000
lda #>BASE
lda #<BASE
lda #10 + 5
"#;
let binary = assemble(source);
assert_eq!(&binary[0..6], &[0xA9, 0x10, 0xA9, 0x00, 0xA9, 0x0F]);
}
}
}
Integration Tests
Test complete programs:
#![allow(unused)]
fn main() {
// tests/integration.rs
use byte_asm::{parser, Assembler};
use std::fs;
fn assemble_file(path: &str) -> Vec<u8> {
let source = fs::read_to_string(path).unwrap();
let program = parser::parse(&source).unwrap();
let mut asm = Assembler::new();
asm.assemble(&program).unwrap()
}
#[test]
fn test_basic_program() {
let binary = assemble_file("tests/fixtures/basic.s");
assert!(!binary.is_empty());
}
#[test]
fn test_all_addressing_modes() {
let binary = assemble_file("tests/fixtures/addressing_modes.s");
assert!(!binary.is_empty());
}
}
Test Fixtures
Create test assembly files:
; tests/fixtures/basic.s
.org 0x8000
start:
lda #0x42
sta 0x00
nop
brk
.org 0xFFFC
.dw start
; tests/fixtures/addressing_modes.s
.org 0x8000
nop
asl a
lda #0xFF
lda 0x80
lda 0x80,x
lda 0x2000
lda 0x2000,x
lda 0x2000,y
lda (0x80,x)
lda (0x80),y
.org 0xFFFC
.dw 0x8000
Running Tests
# Run all tests
cargo test -p byte_asm
# Run specific test
cargo test -p byte_asm test_lda_immediate
# Run with output
cargo test -p byte_asm -- --nocapture
Test Output
running 22 tests
test assembler::codegen::tests::test_absolute ... ok
test assembler::codegen::tests::test_accumulator ... ok
test assembler::codegen::tests::test_immediate ... ok
test assembler::codegen::tests::test_implied ... ok
test assembler::codegen::tests::test_indexed_x ... ok
test assembler::codegen::tests::test_indirect_x ... ok
test assembler::codegen::tests::test_indirect_y ... ok
test assembler::codegen::tests::test_zero_page ... ok
test assembler::directives::tests::test_db_bytes ... ok
test assembler::directives::tests::test_db_string ... ok
test assembler::directives::tests::test_dw ... ok
test assembler::directives::tests::test_equ ... ok
test assembler::directives::tests::test_org ... ok
test assembler::eval::tests::test_binary_ops ... ok
test assembler::eval::tests::test_current_address ... ok
test assembler::eval::tests::test_identifier ... ok
test assembler::eval::tests::test_number ... ok
test assembler::eval::tests::test_unary_ops ... ok
test symbol::tests::test_constants ... ok
test symbol::tests::test_define_and_lookup ... ok
test symbol::tests::test_duplicate_symbol ... ok
test symbol::tests::test_local_labels ... ok
test result: ok. 22 passed; 0 failed
Summary
In this chapter, we wrote comprehensive tests:
- Scanner tests: Token recognition for all types
- Parser tests: AST structure for statements and operands
- Assembler tests: Machine code output verification
- Integration tests: Complete program assembly
- Test fixtures: Reusable assembly test files
In the final chapter, we’ll put it all together with a complete game example.
Previous: Chapter 12 - The CLI | Next: Chapter 14 - Complete Example
Chapter 14: Complete Example - A Bouncing Ball Game
In this final chapter, we’ll put everything together by examining a complete game that demonstrates all assembler features.
The Bouncing Ball Demo
This program displays a white ball that bounces around the screen, demonstrating:
- Hardware initialization
- Game loop structure
- Sprite movement and collision
- Video memory access
Memory Map
; Memory-mapped I/O registers
.equ VID_PTR 0xFD ; Video page pointer (page number, not high byte)
.equ RANDOM 0xFE ; Random number generator
.equ INPUT 0xFF ; Input register
; VRAM location
.equ VRAM 0x1000 ; Start of video RAM
.equ VID_PAGE 0x01 ; Video page number (0x1000 >> 12)
; Screen dimensions
.equ WIDTH 64
.equ HEIGHT 64
; Zero page variables
.equ BALL_X 0x00 ; Ball X position
.equ BALL_Y 0x01 ; Ball Y position
.equ VEL_X 0x02 ; X velocity (signed)
.equ VEL_Y 0x03 ; Y velocity (signed)
.equ OLD_X 0x04 ; Previous X position
.equ OLD_Y 0x05 ; Previous Y position
.equ TEMP 0x06 ; Temporary variable
Using .equ for constants makes the code readable and maintainable.
Initialization
.org 0x8000
reset:
; Set video page to VRAM (page 1 = 0x1000-0x1FFF)
lda #VID_PAGE
sta VID_PTR
; Initialize ball position to center
lda #32
sta BALL_X
sta BALL_Y
; Initialize velocity (moving down-right)
lda #1
sta VEL_X
sta VEL_Y
; Clear the screen
jsr clear_screen
The VID_PTR register takes a page number (0-15), where each page is 4KB. Page 1 corresponds to addresses 0x1000-0x1FFF.
The Main Loop
The main loop is driven by the IRQ (VBLANK):
main_loop:
; Save old position for erasing
lda BALL_X
sta OLD_X
lda BALL_Y
sta OLD_Y
; Update ball position
jsr update_ball
; Erase old ball
ldx OLD_X
ldy OLD_Y
lda #0 ; black
jsr draw_pixel
; Draw new ball
ldx BALL_X
ldy BALL_Y
lda #1 ; white
jsr draw_pixel
; Wait for next frame
rti
The rti (return from interrupt) waits until the next VBLANK.
Ball Update with Bouncing
update_ball:
; Update X position
lda BALL_X
clc
adc VEL_X
sta BALL_X
; Check X bounds
cmp #WIDTH - 1
bcs .bounce_x
cmp #0
beq .bounce_x
jmp .check_y
.bounce_x:
; Reverse X velocity: VEL_X = 0 - VEL_X
lda #0
sec
sbc VEL_X
sta VEL_X
; Clamp X position
lda BALL_X
cmp #WIDTH
bcc .clamp_x_done
lda #WIDTH - 2
sta BALL_X
.clamp_x_done:
lda BALL_X
bne .check_y
lda #1
sta BALL_X
.check_y:
; Similar logic for Y...
rts
This demonstrates:
- Local labels (
.bounce_x,.check_y) - Expression in immediate (
#WIDTH - 1) - Conditional branching
Pixel Drawing
; Draw a pixel at (X, Y) with color in A
; X = column (0-63)
; Y = row (0-63)
; A = color
draw_pixel:
sta TEMP ; Save color
; Calculate VRAM offset: Y * 64 + X
tya ; A = Y
asl a ; A = Y * 2
asl a ; A = Y * 4
asl a ; A = Y * 8
asl a ; A = Y * 16
asl a ; A = Y * 32
asl a ; A = Y * 64
; Add X
stx TEMP + 1 ; Save X
clc
adc TEMP + 1 ; A = Y * 64 + X
; Store to VRAM
tax
lda TEMP ; Get color back
sta VRAM,x ; Write to VRAM
rts
Screen Clearing
clear_screen:
ldx #0
lda #0
.loop:
sta VRAM,x
sta VRAM + 0x100,x
sta VRAM + 0x200,x
sta VRAM + 0x300,x
; ... (more pages)
inx
bne .loop
rts
Using expressions like VRAM + 0x100 makes the code clearer.
Interrupt Vectors
; Set up reset and IRQ vectors
.org 0xFFFC
.dw reset ; Reset vector
.dw main_loop ; IRQ vector (VBLANK)
The .dw directive writes 16-bit addresses in little-endian format.
Building and Running
# Assemble the game
cargo run -p byte_asm -- bouncing_ball.s -o game.bin
# Run in emulator
cargo run -p byte_emu -- game.bin
With verbose output:
$ cargo run -p byte_asm -- -v bouncing_ball.s -o game.bin
Assembling: bouncing_ball.s
Parsed 85 statements
Generated 312 bytes
Defined 15 symbols
Wrote: game.bin
Features Demonstrated
This example uses all major assembler features:
| Feature | Example |
|---|---|
.org | .org 0x8000 |
.equ | .equ WIDTH 64 |
.db | (could add strings) |
.dw | .dw reset |
| Labels | reset:, main_loop: |
| Local labels | .bounce_x:, .loop: |
| Immediate | lda #32 |
| Zero Page | sta BALL_X |
| Absolute | sta VRAM,x |
| Indexed | sta VRAM + 0x100,x |
| Expressions | #WIDTH - 1, VRAM + 0x100 |
| Branches | bne .loop, bcs .bounce_x |
| Subroutines | jsr update_ball, rts |
Extending the Example
Ideas for additions:
- User input to control ball direction
- Multiple balls
- Score counter
- Sound effects
- Paddle for a Pong-like game
Complete Source
See byte_asm/examples/bouncing_ball.s for the full source code.
Summary
In this tutorial, we built a complete 6502 assembler from scratch:
- Scanner: Tokenizes source into meaningful chunks
- Parser: Builds an AST representing program structure
- Symbol Table: Tracks labels and constants
- Two-Pass Assembler: Resolves forward references
- Code Generator: Emits correct machine code
- Expression Evaluator: Computes values at assembly time
- Directive Handlers: Processes .org, .db, .dw, .equ
- Error Handling: Provides helpful error messages
- CLI: User-friendly command-line interface
- Testing: Verifies correctness
The result is a fully functional assembler that can build real programs for the Byte fantasy console.
What’s Next?
- Macros: Add
.macroand.endmacrofor code reuse - Conditional Assembly: Add
.if,.else,.endif - Include Path: Search multiple directories for includes
- Listing Files: Generate human-readable assembly listings
- Debug Symbols: Output symbol tables for debuggers
Happy assembling!