Chapter 3: Designing the Abstract Syntax Tree
In this chapter, we’ll design the Abstract Syntax Tree (AST) - the data structure that represents a parsed assembly program.
What is an AST?
An Abstract Syntax Tree is a tree representation of the syntactic structure of source code. Unlike the flat list of tokens from the scanner, the AST captures the hierarchical relationships between program elements.
Why Not Just Use Tokens?
Tokens tell us what elements we have, but not how they relate to each other. Consider:
lda (0x80,x)
The tokens are: INSTRUCTION OPENPAREN NUMBER COMMA REGISTER CLOSEPAREN
But what we need to know is:
- This is an instruction:
LDA - With an operand in Indexed Indirect mode:
(0x80,x) - The base address is:
0x80
The AST captures this structure explicitly.
Program Structure
A ByteASM program consists of statements:
#![allow(unused)]
fn main() {
pub struct Program {
pub statements: Vec<Statement>,
pub source_file: Option<String>,
}
}
Statement Types
There are three kinds of statements:
#![allow(unused)]
fn main() {
pub enum Statement {
Label(LabelDef),
Instruction(InstructionStmt),
Directive(DirectiveStmt),
}
}
Label Definitions
#![allow(unused)]
fn main() {
pub struct LabelDef {
pub name: String,
pub is_local: bool, // true for .loop or @loop
pub location: Location,
}
}
Examples:
main:→LabelDef { name: "main", is_local: false }.loop:→LabelDef { name: ".loop", is_local: true }
Instruction Statements
#![allow(unused)]
fn main() {
pub struct InstructionStmt {
pub mnemonic: Mnemonic,
pub operand: Option<Operand>,
pub location: Location,
}
}
Examples:
nop→InstructionStmt { mnemonic: NOP, operand: None }lda #0x42→InstructionStmt { mnemonic: LDA, operand: Some(Immediate(Number(66))) }
Directive Statements
#![allow(unused)]
fn main() {
pub enum DirectiveStmt {
Org {
address: Expression,
location: Location,
},
Db {
values: Vec<DataValue>,
location: Location,
},
Dw {
values: Vec<Expression>,
location: Location,
},
Equ {
name: String,
value: Expression,
location: Location,
},
Include {
path: String,
location: Location,
},
}
}
Operands and Addressing Modes
The operand type directly encodes the addressing mode:
#![allow(unused)]
fn main() {
pub enum Operand {
// #value - Immediate mode
Immediate(Expression),
// a - Accumulator mode (for ASL, ROL, etc.)
Accumulator,
// address - Zero Page or Absolute
Address(Expression),
// address,x - Indexed by X
IndexedX(Expression),
// address,y - Indexed by Y
IndexedY(Expression),
// (address) - Indirect (JMP only)
Indirect(Expression),
// (zp,x) - Indexed Indirect
IndirectX(Expression),
// (zp),y - Indirect Indexed
IndirectY(Expression),
}
}
Addressing Mode Mapping
| Syntax | Operand Type | 6502 Mode |
|---|---|---|
| (none) | None | Implied |
a | Accumulator | Accumulator |
#expr | Immediate(expr) | Immediate |
expr | Address(expr) | Zero Page or Absolute |
expr,x | IndexedX(expr) | Zero Page,X or Absolute,X |
expr,y | IndexedY(expr) | Zero Page,Y or Absolute,Y |
(expr) | Indirect(expr) | Indirect |
(expr,x) | IndirectX(expr) | Indexed Indirect |
(expr),y | IndirectY(expr) | Indirect Indexed |
Note: The distinction between Zero Page and Absolute is determined during code generation based on the expression value.
Expressions
Expressions represent numeric values that can be computed:
#![allow(unused)]
fn main() {
pub enum Expression {
// Literal number: 0xFF, 255, 0b1010
Number(i64),
// Label or constant name
Identifier(String),
// Local label: .loop, @loop
LocalIdentifier(String),
// Binary operation: left op right
Binary {
left: Box<Expression>,
op: BinaryOp,
right: Box<Expression>,
},
// Unary operation: op operand
Unary {
op: UnaryOp,
operand: Box<Expression>,
},
// Current address: $
CurrentAddress,
}
}
Binary Operators
#![allow(unused)]
fn main() {
pub enum BinaryOp {
Add, // +
Sub, // -
Mul, // *
Div, // /
}
}
Unary Operators
#![allow(unused)]
fn main() {
pub enum UnaryOp {
Neg, // - (negation)
LoByte, // < (extract low byte)
HiByte, // > (extract high byte)
}
}
Expression Examples
| Source | AST |
|---|---|
42 | Number(42) |
label | Identifier("label") |
.loop | LocalIdentifier(".loop") |
$ | CurrentAddress |
10 + 5 | Binary { left: Number(10), op: Add, right: Number(5) } |
<0x1234 | Unary { op: LoByte, operand: Number(0x1234) } |
>label | Unary { op: HiByte, operand: Identifier("label") } |
Data Values for .db
The .db directive can contain bytes or strings:
#![allow(unused)]
fn main() {
pub enum DataValue {
Byte(Expression),
String(String),
}
}
Example:
.db "Hello", 0x0A, 0
Parses to:
#![allow(unused)]
fn main() {
Db {
values: vec![
DataValue::String("Hello"),
DataValue::Byte(Number(0x0A)),
DataValue::Byte(Number(0)),
]
}
}
Complete AST Example
Let’s trace through parsing this program:
.equ SCREEN 0x1000
.org 0x8000
start:
lda #>SCREEN
sta 0xFD
.loop:
jmp .loop
.org 0xFFFC
.dw start
The AST would be:
#![allow(unused)]
fn main() {
Program {
statements: [
Directive(Equ {
name: "SCREEN",
value: Number(0x1000),
}),
Directive(Org {
address: Number(0x8000),
}),
Label(LabelDef {
name: "start",
is_local: false,
}),
Instruction(InstructionStmt {
mnemonic: LDA,
operand: Some(Immediate(
Unary { op: HiByte, operand: Identifier("SCREEN") }
)),
}),
Instruction(InstructionStmt {
mnemonic: STA,
operand: Some(Address(Number(0xFD))),
}),
Label(LabelDef {
name: ".loop",
is_local: true,
}),
Instruction(InstructionStmt {
mnemonic: JMP,
operand: Some(Address(LocalIdentifier(".loop"))),
}),
Directive(Org {
address: Number(0xFFFC),
}),
Directive(Dw {
values: [Identifier("start")],
}),
],
source_file: Some("example.s"),
}
}
Design Principles
1. Preserve Source Information
Every node includes a Location so we can report errors with line numbers:
#![allow(unused)]
fn main() {
impl Statement {
pub fn location(&self) -> &Location {
match self {
Statement::Label(l) => &l.location,
Statement::Instruction(i) => &i.location,
Statement::Directive(d) => d.location(),
}
}
}
}
2. Explicit Over Implicit
Rather than encoding addressing modes as strings or enums, we use distinct types that make the structure clear:
#![allow(unused)]
fn main() {
// Good: Structure is explicit
Operand::IndirectY(Expression::Number(0x80))
// Bad: Structure is implicit
Operand { mode: "indirect_y", value: 0x80 }
}
3. Expressions Are Recursive
Using Box<Expression> allows expressions to be nested:
#![allow(unused)]
fn main() {
// Represents: (label + offset) * 2
Binary {
left: Box::new(Binary {
left: Box::new(Identifier("label")),
op: Add,
right: Box::new(Identifier("offset")),
}),
op: Mul,
right: Box::new(Number(2)),
}
}
4. Helper Constructors
We provide convenient constructors:
#![allow(unused)]
fn main() {
impl Expression {
pub fn binary(left: Expression, op: BinaryOp, right: Expression) -> Self {
Expression::Binary {
left: Box::new(left),
op,
right: Box::new(right),
}
}
pub fn unary(op: UnaryOp, operand: Expression) -> Self {
Expression::Unary {
op,
operand: Box::new(operand),
}
}
}
}
Summary
In this chapter, we designed an AST that:
- Represents the complete structure of a ByteASM program
- Distinguishes between labels, instructions, and directives
- Encodes addressing modes through operand types
- Supports recursive expressions with operators
- Preserves source location for error reporting
In the next chapter, we’ll start building the parser that constructs this AST from tokens.
Previous: Chapter 2 - Building the Scanner | Next: Chapter 4 - Parser Structure