Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 3: Designing the Abstract Syntax Tree

In this chapter, we’ll design the Abstract Syntax Tree (AST) - the data structure that represents a parsed assembly program.

What is an AST?

An Abstract Syntax Tree is a tree representation of the syntactic structure of source code. Unlike the flat list of tokens from the scanner, the AST captures the hierarchical relationships between program elements.

Why Not Just Use Tokens?

Tokens tell us what elements we have, but not how they relate to each other. Consider:

lda (0x80,x)

The tokens are: INSTRUCTION OPENPAREN NUMBER COMMA REGISTER CLOSEPAREN

But what we need to know is:

  • This is an instruction: LDA
  • With an operand in Indexed Indirect mode: (0x80,x)
  • The base address is: 0x80

The AST captures this structure explicitly.

Program Structure

A ByteASM program consists of statements:

#![allow(unused)]
fn main() {
pub struct Program {
    pub statements: Vec<Statement>,
    pub source_file: Option<String>,
}
}

Statement Types

There are three kinds of statements:

#![allow(unused)]
fn main() {
pub enum Statement {
    Label(LabelDef),
    Instruction(InstructionStmt),
    Directive(DirectiveStmt),
}
}

Label Definitions

#![allow(unused)]
fn main() {
pub struct LabelDef {
    pub name: String,
    pub is_local: bool,     // true for .loop or @loop
    pub location: Location,
}
}

Examples:

  • main:LabelDef { name: "main", is_local: false }
  • .loop:LabelDef { name: ".loop", is_local: true }

Instruction Statements

#![allow(unused)]
fn main() {
pub struct InstructionStmt {
    pub mnemonic: Mnemonic,
    pub operand: Option<Operand>,
    pub location: Location,
}
}

Examples:

  • nopInstructionStmt { mnemonic: NOP, operand: None }
  • lda #0x42InstructionStmt { mnemonic: LDA, operand: Some(Immediate(Number(66))) }

Directive Statements

#![allow(unused)]
fn main() {
pub enum DirectiveStmt {
    Org {
        address: Expression,
        location: Location,
    },
    Db {
        values: Vec<DataValue>,
        location: Location,
    },
    Dw {
        values: Vec<Expression>,
        location: Location,
    },
    Equ {
        name: String,
        value: Expression,
        location: Location,
    },
    Include {
        path: String,
        location: Location,
    },
}
}

Operands and Addressing Modes

The operand type directly encodes the addressing mode:

#![allow(unused)]
fn main() {
pub enum Operand {
    // #value - Immediate mode
    Immediate(Expression),

    // a - Accumulator mode (for ASL, ROL, etc.)
    Accumulator,

    // address - Zero Page or Absolute
    Address(Expression),

    // address,x - Indexed by X
    IndexedX(Expression),

    // address,y - Indexed by Y
    IndexedY(Expression),

    // (address) - Indirect (JMP only)
    Indirect(Expression),

    // (zp,x) - Indexed Indirect
    IndirectX(Expression),

    // (zp),y - Indirect Indexed
    IndirectY(Expression),
}
}

Addressing Mode Mapping

SyntaxOperand Type6502 Mode
(none)NoneImplied
aAccumulatorAccumulator
#exprImmediate(expr)Immediate
exprAddress(expr)Zero Page or Absolute
expr,xIndexedX(expr)Zero Page,X or Absolute,X
expr,yIndexedY(expr)Zero Page,Y or Absolute,Y
(expr)Indirect(expr)Indirect
(expr,x)IndirectX(expr)Indexed Indirect
(expr),yIndirectY(expr)Indirect Indexed

Note: The distinction between Zero Page and Absolute is determined during code generation based on the expression value.

Expressions

Expressions represent numeric values that can be computed:

#![allow(unused)]
fn main() {
pub enum Expression {
    // Literal number: 0xFF, 255, 0b1010
    Number(i64),

    // Label or constant name
    Identifier(String),

    // Local label: .loop, @loop
    LocalIdentifier(String),

    // Binary operation: left op right
    Binary {
        left: Box<Expression>,
        op: BinaryOp,
        right: Box<Expression>,
    },

    // Unary operation: op operand
    Unary {
        op: UnaryOp,
        operand: Box<Expression>,
    },

    // Current address: $
    CurrentAddress,
}
}

Binary Operators

#![allow(unused)]
fn main() {
pub enum BinaryOp {
    Add,  // +
    Sub,  // -
    Mul,  // *
    Div,  // /
}
}

Unary Operators

#![allow(unused)]
fn main() {
pub enum UnaryOp {
    Neg,     // - (negation)
    LoByte,  // < (extract low byte)
    HiByte,  // > (extract high byte)
}
}

Expression Examples

SourceAST
42Number(42)
labelIdentifier("label")
.loopLocalIdentifier(".loop")
$CurrentAddress
10 + 5Binary { left: Number(10), op: Add, right: Number(5) }
<0x1234Unary { op: LoByte, operand: Number(0x1234) }
>labelUnary { op: HiByte, operand: Identifier("label") }

Data Values for .db

The .db directive can contain bytes or strings:

#![allow(unused)]
fn main() {
pub enum DataValue {
    Byte(Expression),
    String(String),
}
}

Example:

.db "Hello", 0x0A, 0

Parses to:

#![allow(unused)]
fn main() {
Db {
    values: vec![
        DataValue::String("Hello"),
        DataValue::Byte(Number(0x0A)),
        DataValue::Byte(Number(0)),
    ]
}
}

Complete AST Example

Let’s trace through parsing this program:

.equ SCREEN 0x1000

.org 0x8000

start:
    lda #>SCREEN
    sta 0xFD
.loop:
    jmp .loop

.org 0xFFFC
.dw start

The AST would be:

#![allow(unused)]
fn main() {
Program {
    statements: [
        Directive(Equ {
            name: "SCREEN",
            value: Number(0x1000),
        }),
        Directive(Org {
            address: Number(0x8000),
        }),
        Label(LabelDef {
            name: "start",
            is_local: false,
        }),
        Instruction(InstructionStmt {
            mnemonic: LDA,
            operand: Some(Immediate(
                Unary { op: HiByte, operand: Identifier("SCREEN") }
            )),
        }),
        Instruction(InstructionStmt {
            mnemonic: STA,
            operand: Some(Address(Number(0xFD))),
        }),
        Label(LabelDef {
            name: ".loop",
            is_local: true,
        }),
        Instruction(InstructionStmt {
            mnemonic: JMP,
            operand: Some(Address(LocalIdentifier(".loop"))),
        }),
        Directive(Org {
            address: Number(0xFFFC),
        }),
        Directive(Dw {
            values: [Identifier("start")],
        }),
    ],
    source_file: Some("example.s"),
}
}

Design Principles

1. Preserve Source Information

Every node includes a Location so we can report errors with line numbers:

#![allow(unused)]
fn main() {
impl Statement {
    pub fn location(&self) -> &Location {
        match self {
            Statement::Label(l) => &l.location,
            Statement::Instruction(i) => &i.location,
            Statement::Directive(d) => d.location(),
        }
    }
}
}

2. Explicit Over Implicit

Rather than encoding addressing modes as strings or enums, we use distinct types that make the structure clear:

#![allow(unused)]
fn main() {
// Good: Structure is explicit
Operand::IndirectY(Expression::Number(0x80))

// Bad: Structure is implicit
Operand { mode: "indirect_y", value: 0x80 }
}

3. Expressions Are Recursive

Using Box<Expression> allows expressions to be nested:

#![allow(unused)]
fn main() {
// Represents: (label + offset) * 2
Binary {
    left: Box::new(Binary {
        left: Box::new(Identifier("label")),
        op: Add,
        right: Box::new(Identifier("offset")),
    }),
    op: Mul,
    right: Box::new(Number(2)),
}
}

4. Helper Constructors

We provide convenient constructors:

#![allow(unused)]
fn main() {
impl Expression {
    pub fn binary(left: Expression, op: BinaryOp, right: Expression) -> Self {
        Expression::Binary {
            left: Box::new(left),
            op,
            right: Box::new(right),
        }
    }

    pub fn unary(op: UnaryOp, operand: Expression) -> Self {
        Expression::Unary {
            op,
            operand: Box::new(operand),
        }
    }
}
}

Summary

In this chapter, we designed an AST that:

  • Represents the complete structure of a ByteASM program
  • Distinguishes between labels, instructions, and directives
  • Encodes addressing modes through operand types
  • Supports recursive expressions with operators
  • Preserves source location for error reporting

In the next chapter, we’ll start building the parser that constructs this AST from tokens.


Previous: Chapter 2 - Building the Scanner | Next: Chapter 4 - Parser Structure