Chapter 2: Building the Scanner (Lexer)

In this chapter, we’ll build the scanner (also called a lexer) - the first stage of our assembler. The scanner converts source text into a stream of tokens.

What is Lexical Analysis?

The scanner performs lexical analysis: breaking the input stream of characters into meaningful chunks called tokens. Think of tokens as the “words” of our assembly language.

Characters to Tokens

Source:  lda #0xFF

Tokens:  [INSTRUCTION(LDA)] [HASH] [NUMBER(255)]

Each token has:

A kind (what type of token it is)
Optionally a value (the parsed data)
A location (line, column, start position)

Token Types in ByteASM

Let’s define all the token types our scanner will produce:

#![allow(unused)]
fn main() {
pub enum TokenKind {
    // Punctuation
    CloseParen,    // )
    Colon,         // :
    Comma,         // ,
    OpenParen,     // (

    // Operators
    Hash,          // #
    Minus,         // -
    Plus,          // +
    Slash,         // /
    Star,          // *
    LessThan,      // <
    GreaterThan,   // >

    // Special symbols
    Dollar,        // $ (current address)
    Semicolon,     // ; (comment start)

    // Literals
    Number,        // 0xFF, 0b1010, 42
    String,        // "hello"

    // Identifiers and keywords
    Directive,     // .org, .db, etc.
    Instruction,   // lda, sta, etc.
    Register,      // a, x, y
    Identifier,    // label names
    LocalLabel,    // .loop or @loop

    // Structure
    Comment,       // ; to end of line
    NewLine,       // \n
    EOF,           // End of file
}
}

The Token Structure

Each token carries information about its type, value, and position:

#![allow(unused)]
fn main() {
pub struct Token {
    pub kind: TokenKind,
    pub value: Option<TokenValue>,
    pub location: Location,
}

pub struct Location {
    pub line: usize,      // Line number (1-indexed)
    pub column: usize,    // Column number (1-indexed)
    pub start: usize,     // Byte offset in source
    pub length: usize,    // Length in bytes
}

pub enum TokenValue {
    Number(u64),
    String(String),
    Directive(Directive),
    Instruction(Mnemonic),
}
}

Scanner Architecture

Our scanner uses a cursor to track our position in the source:

#![allow(unused)]
fn main() {
pub struct Scanner<'a> {
    cursor: Cursor<'a>,
    source: &'a str,
}

pub struct Cursor<'a> {
    chars: Peekable<Chars<'a>>,
    line: usize,
    column: usize,
    current: usize,    // Current byte position
    start: usize,      // Start of current token
}
}

The Cursor

The cursor provides these operations:

#![allow(unused)]
fn main() {
impl Cursor {
    /// Peek at the next character without consuming it
    fn peek(&mut self) -> Option<char>

    /// Advance and return the next character
    fn advance(&mut self) -> Option<char>

    /// Mark the start of a new token
    fn sync(&mut self)

    /// Create a Location for the current token
    fn location(&self) -> Location

    /// Advance the line counter (after newline)
    fn advance_line(&mut self)
}
}

The Main Scanning Loop

The heart of the scanner is the scan_token method:

#![allow(unused)]
fn main() {
pub fn scan_token(&mut self) -> Result<Token, ScannerError> {
    self.skip_whitespace();
    self.cursor.sync();

    match self.cursor.advance() {
        None => self.make_token(TokenKind::EOF, None),
        Some(c) => match c {
            // Single-character tokens
            ')' => self.make_token(TokenKind::CloseParen, None),
            '(' => self.make_token(TokenKind::OpenParen, None),
            ',' => self.make_token(TokenKind::Comma, None),
            ':' => self.make_token(TokenKind::Colon, None),
            '#' => self.make_token(TokenKind::Hash, None),
            '+' => self.make_token(TokenKind::Plus, None),
            '-' => self.make_token(TokenKind::Minus, None),
            '*' => self.make_token(TokenKind::Star, None),
            '/' => self.make_token(TokenKind::Slash, None),
            '<' => self.make_token(TokenKind::LessThan, None),
            '>' => self.make_token(TokenKind::GreaterThan, None),
            '$' => self.make_token(TokenKind::Dollar, None),

            // Newline
            '\n' => {
                let token = self.make_token(TokenKind::NewLine, None);
                self.cursor.advance_line();
                token
            }

            // Comment
            ';' => {
                self.scan_comment();
                self.make_token(TokenKind::Comment, None)
            }

            // More complex tokens...
            _ => self.scan_complex_token(c)
        }
    }
}
}

Scanning Numbers

ByteASM supports three number formats:

0xFF - Hexadecimal (0x prefix)
0b1010 - Binary (0b prefix)
42 - Decimal (no prefix)

#![allow(unused)]
fn main() {
fn scan_number(&mut self, first_char: char) -> Result<Token, ScannerError> {
    // Check for prefix
    if first_char == '0' {
        match self.cursor.peek() {
            Some('x') | Some('X') => {
                self.cursor.advance();
                return self.scan_hex();
            }
            Some('b') | Some('B') => {
                self.cursor.advance();
                return self.scan_binary();
            }
            _ => {}
        }
    }

    self.scan_decimal()
}

fn scan_hex(&mut self) -> Result<Token, ScannerError> {
    let start = self.cursor.current;

    // Consume hex digits
    while let Some(c) = self.cursor.peek() {
        if c.is_ascii_hexdigit() {
            self.cursor.advance();
        } else {
            break;
        }
    }

    // Must have at least one digit after 0x
    if self.cursor.current == start {
        return Err(ScannerError::NumberExpected {
            line: self.cursor.line,
            column: self.cursor.column,
            symbol: 'x',
        });
    }

    // Parse the hex value
    let hex_str = &self.source[start..self.cursor.current];
    let value = u64::from_str_radix(hex_str, 16)?;

    self.make_token(TokenKind::Number, Some(TokenValue::Number(value)))
}
}

Scanning Identifiers

Identifiers include:

Label names (main, loop_counter)
Instruction mnemonics (lda, sta)
Register names (a, x, y)

#![allow(unused)]
fn main() {
fn scan_identifier(&mut self) -> Result<Token, ScannerError> {
    // Consume alphanumeric characters and underscores
    while let Some(c) = self.cursor.peek() {
        if c.is_ascii_alphanumeric() || c == '_' {
            self.cursor.advance();
        } else {
            break;
        }
    }

    let text = &self.source[self.cursor.start..self.cursor.current];
    let lower = text.to_lowercase();

    // Check for register names
    if lower == "a" || lower == "x" || lower == "y" {
        return Ok(self.make_token(TokenKind::Register, None));
    }

    // Check for instruction mnemonics
    if let Ok(mnemonic) = Mnemonic::try_from(lower.to_uppercase().as_str()) {
        return Ok(self.make_token(
            TokenKind::Instruction,
            Some(TokenValue::Instruction(mnemonic))
        ));
    }

    // It's a general identifier
    Ok(self.make_token(TokenKind::Identifier, None))
}
}

Scanning Directives and Local Labels

Both directives and local labels start with .:

#![allow(unused)]
fn main() {
fn scan_dot(&mut self) -> Result<Token, ScannerError> {
    // Consume the identifier after the dot
    let start = self.cursor.current;
    while let Some(c) = self.cursor.peek() {
        if c.is_ascii_alphanumeric() || c == '_' {
            self.cursor.advance();
        } else {
            break;
        }
    }

    let text = &self.source[start..self.cursor.current];

    // Try to parse as directive
    if let Ok(directive) = Directive::try_from(text.to_uppercase().as_str()) {
        return Ok(self.make_token(
            TokenKind::Directive,
            Some(TokenValue::Directive(directive))
        ));
    }

    // It's a local label
    Ok(self.make_token(TokenKind::LocalLabel, None))
}
}

Local labels can also start with @:

#![allow(unused)]
fn main() {
'@' => {
    self.scan_identifier_rest();
    Ok(self.make_token(TokenKind::LocalLabel, None))
}
}

Scanning Strings

Strings appear in .db directives:

#![allow(unused)]
fn main() {
fn scan_string(&mut self, quote: char) -> Result<String, ScannerError> {
    let mut result = String::new();

    while let Some(c) = self.cursor.peek() {
        if c == quote || c == '\n' {
            break;
        }
        self.cursor.advance();

        // Handle escape sequences
        if c == '\\' {
            match self.cursor.peek() {
                Some('n') => { result.push('\n'); self.cursor.advance(); }
                Some('r') => { result.push('\r'); self.cursor.advance(); }
                Some('t') => { result.push('\t'); self.cursor.advance(); }
                Some('"') => { result.push('"'); self.cursor.advance(); }
                Some('\'') => { result.push('\''); self.cursor.advance(); }
                Some('\\') => { result.push('\\'); self.cursor.advance(); }
                Some(e) => { result.push(e); self.cursor.advance(); }
                None => continue,
            }
        } else {
            result.push(c);
        }
    }

    // Check for unterminated string
    if self.cursor.peek() != Some(quote) {
        return Err(ScannerError::UnterminatedString {
            line: self.cursor.line,
            column: self.cursor.column,
            quote,
        });
    }

    // Consume closing quote
    self.cursor.advance();
    Ok(result)
}
}

Handling Comments

Comments run from ; to the end of the line:

#![allow(unused)]
fn main() {
fn scan_comment(&mut self) {
    while let Some(c) = self.cursor.peek() {
        if c == '\n' {
            break;
        }
        self.cursor.advance();
    }
}
}

Whitespace Handling

We skip spaces, tabs, and carriage returns (but not newlines, which are significant):

#![allow(unused)]
fn main() {
fn skip_whitespace(&mut self) {
    while let Some(c) = self.cursor.peek() {
        match c {
            ' ' | '\r' | '\t' => { self.cursor.advance(); }
            _ => break,
        }
    }
}
}

Error Handling

The scanner can produce these errors:

#![allow(unused)]
fn main() {
pub enum ScannerError {
    UnknownCharacter { line: usize, column: usize, character: char },
    UnknownDirective { line: usize, column: usize, directive: String },
    NumberExpected { line: usize, column: usize, symbol: char },
    UnterminatedString { line: usize, column: usize, quote: char },
}
}

Running the Scanner

Let’s trace through scanning this example:

.org 0x8000
start:
    lda #0x00

Input	Token Kind	Value
`.org`	Directive	ORG
`0x8000`	Number	32768
`\n`	NewLine	-
`start`	Identifier	-
`:`	Colon	-
`\n`	NewLine	-
`lda`	Instruction	LDA
`#`	Hash	-
`0x00`	Number	0
`\n`	NewLine	-
(end)	EOF	-

Summary

In this chapter, we built a scanner that:

Recognizes all ByteASM token types
Handles three number formats (hex, binary, decimal)
Distinguishes between identifiers, instructions, and registers
Parses directives and local labels
Handles strings with escape sequences
Tracks location information for error reporting

In the next chapter, we’ll design the Abstract Syntax Tree that will represent our parsed program.

Previous: Chapter 1 - Introduction | Next: Chapter 3 - Designing the AST

Keyboard shortcuts

ByteASM Tutorial