Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 2: Building the Scanner (Lexer)

In this chapter, we’ll build the scanner (also called a lexer) - the first stage of our assembler. The scanner converts source text into a stream of tokens.

What is Lexical Analysis?

The scanner performs lexical analysis: breaking the input stream of characters into meaningful chunks called tokens. Think of tokens as the “words” of our assembly language.

Characters to Tokens

Source:  lda #0xFF

Tokens:  [INSTRUCTION(LDA)] [HASH] [NUMBER(255)]

Each token has:

  • A kind (what type of token it is)
  • Optionally a value (the parsed data)
  • A location (line, column, start position)

Token Types in ByteASM

Let’s define all the token types our scanner will produce:

#![allow(unused)]
fn main() {
pub enum TokenKind {
    // Punctuation
    CloseParen,    // )
    Colon,         // :
    Comma,         // ,
    OpenParen,     // (

    // Operators
    Hash,          // #
    Minus,         // -
    Plus,          // +
    Slash,         // /
    Star,          // *
    LessThan,      // <
    GreaterThan,   // >

    // Special symbols
    Dollar,        // $ (current address)
    Semicolon,     // ; (comment start)

    // Literals
    Number,        // 0xFF, 0b1010, 42
    String,        // "hello"

    // Identifiers and keywords
    Directive,     // .org, .db, etc.
    Instruction,   // lda, sta, etc.
    Register,      // a, x, y
    Identifier,    // label names
    LocalLabel,    // .loop or @loop

    // Structure
    Comment,       // ; to end of line
    NewLine,       // \n
    EOF,           // End of file
}
}

The Token Structure

Each token carries information about its type, value, and position:

#![allow(unused)]
fn main() {
pub struct Token {
    pub kind: TokenKind,
    pub value: Option<TokenValue>,
    pub location: Location,
}

pub struct Location {
    pub line: usize,      // Line number (1-indexed)
    pub column: usize,    // Column number (1-indexed)
    pub start: usize,     // Byte offset in source
    pub length: usize,    // Length in bytes
}

pub enum TokenValue {
    Number(u64),
    String(String),
    Directive(Directive),
    Instruction(Mnemonic),
}
}

Scanner Architecture

Our scanner uses a cursor to track our position in the source:

#![allow(unused)]
fn main() {
pub struct Scanner<'a> {
    cursor: Cursor<'a>,
    source: &'a str,
}

pub struct Cursor<'a> {
    chars: Peekable<Chars<'a>>,
    line: usize,
    column: usize,
    current: usize,    // Current byte position
    start: usize,      // Start of current token
}
}

The Cursor

The cursor provides these operations:

#![allow(unused)]
fn main() {
impl Cursor {
    /// Peek at the next character without consuming it
    fn peek(&mut self) -> Option<char>

    /// Advance and return the next character
    fn advance(&mut self) -> Option<char>

    /// Mark the start of a new token
    fn sync(&mut self)

    /// Create a Location for the current token
    fn location(&self) -> Location

    /// Advance the line counter (after newline)
    fn advance_line(&mut self)
}
}

The Main Scanning Loop

The heart of the scanner is the scan_token method:

#![allow(unused)]
fn main() {
pub fn scan_token(&mut self) -> Result<Token, ScannerError> {
    self.skip_whitespace();
    self.cursor.sync();

    match self.cursor.advance() {
        None => self.make_token(TokenKind::EOF, None),
        Some(c) => match c {
            // Single-character tokens
            ')' => self.make_token(TokenKind::CloseParen, None),
            '(' => self.make_token(TokenKind::OpenParen, None),
            ',' => self.make_token(TokenKind::Comma, None),
            ':' => self.make_token(TokenKind::Colon, None),
            '#' => self.make_token(TokenKind::Hash, None),
            '+' => self.make_token(TokenKind::Plus, None),
            '-' => self.make_token(TokenKind::Minus, None),
            '*' => self.make_token(TokenKind::Star, None),
            '/' => self.make_token(TokenKind::Slash, None),
            '<' => self.make_token(TokenKind::LessThan, None),
            '>' => self.make_token(TokenKind::GreaterThan, None),
            '$' => self.make_token(TokenKind::Dollar, None),

            // Newline
            '\n' => {
                let token = self.make_token(TokenKind::NewLine, None);
                self.cursor.advance_line();
                token
            }

            // Comment
            ';' => {
                self.scan_comment();
                self.make_token(TokenKind::Comment, None)
            }

            // More complex tokens...
            _ => self.scan_complex_token(c)
        }
    }
}
}

Scanning Numbers

ByteASM supports three number formats:

  • 0xFF - Hexadecimal (0x prefix)
  • 0b1010 - Binary (0b prefix)
  • 42 - Decimal (no prefix)
#![allow(unused)]
fn main() {
fn scan_number(&mut self, first_char: char) -> Result<Token, ScannerError> {
    // Check for prefix
    if first_char == '0' {
        match self.cursor.peek() {
            Some('x') | Some('X') => {
                self.cursor.advance();
                return self.scan_hex();
            }
            Some('b') | Some('B') => {
                self.cursor.advance();
                return self.scan_binary();
            }
            _ => {}
        }
    }

    self.scan_decimal()
}

fn scan_hex(&mut self) -> Result<Token, ScannerError> {
    let start = self.cursor.current;

    // Consume hex digits
    while let Some(c) = self.cursor.peek() {
        if c.is_ascii_hexdigit() {
            self.cursor.advance();
        } else {
            break;
        }
    }

    // Must have at least one digit after 0x
    if self.cursor.current == start {
        return Err(ScannerError::NumberExpected {
            line: self.cursor.line,
            column: self.cursor.column,
            symbol: 'x',
        });
    }

    // Parse the hex value
    let hex_str = &self.source[start..self.cursor.current];
    let value = u64::from_str_radix(hex_str, 16)?;

    self.make_token(TokenKind::Number, Some(TokenValue::Number(value)))
}
}

Scanning Identifiers

Identifiers include:

  • Label names (main, loop_counter)
  • Instruction mnemonics (lda, sta)
  • Register names (a, x, y)
#![allow(unused)]
fn main() {
fn scan_identifier(&mut self) -> Result<Token, ScannerError> {
    // Consume alphanumeric characters and underscores
    while let Some(c) = self.cursor.peek() {
        if c.is_ascii_alphanumeric() || c == '_' {
            self.cursor.advance();
        } else {
            break;
        }
    }

    let text = &self.source[self.cursor.start..self.cursor.current];
    let lower = text.to_lowercase();

    // Check for register names
    if lower == "a" || lower == "x" || lower == "y" {
        return Ok(self.make_token(TokenKind::Register, None));
    }

    // Check for instruction mnemonics
    if let Ok(mnemonic) = Mnemonic::try_from(lower.to_uppercase().as_str()) {
        return Ok(self.make_token(
            TokenKind::Instruction,
            Some(TokenValue::Instruction(mnemonic))
        ));
    }

    // It's a general identifier
    Ok(self.make_token(TokenKind::Identifier, None))
}
}

Scanning Directives and Local Labels

Both directives and local labels start with .:

#![allow(unused)]
fn main() {
fn scan_dot(&mut self) -> Result<Token, ScannerError> {
    // Consume the identifier after the dot
    let start = self.cursor.current;
    while let Some(c) = self.cursor.peek() {
        if c.is_ascii_alphanumeric() || c == '_' {
            self.cursor.advance();
        } else {
            break;
        }
    }

    let text = &self.source[start..self.cursor.current];

    // Try to parse as directive
    if let Ok(directive) = Directive::try_from(text.to_uppercase().as_str()) {
        return Ok(self.make_token(
            TokenKind::Directive,
            Some(TokenValue::Directive(directive))
        ));
    }

    // It's a local label
    Ok(self.make_token(TokenKind::LocalLabel, None))
}
}

Local labels can also start with @:

#![allow(unused)]
fn main() {
'@' => {
    self.scan_identifier_rest();
    Ok(self.make_token(TokenKind::LocalLabel, None))
}
}

Scanning Strings

Strings appear in .db directives:

#![allow(unused)]
fn main() {
fn scan_string(&mut self, quote: char) -> Result<String, ScannerError> {
    let mut result = String::new();

    while let Some(c) = self.cursor.peek() {
        if c == quote || c == '\n' {
            break;
        }
        self.cursor.advance();

        // Handle escape sequences
        if c == '\\' {
            match self.cursor.peek() {
                Some('n') => { result.push('\n'); self.cursor.advance(); }
                Some('r') => { result.push('\r'); self.cursor.advance(); }
                Some('t') => { result.push('\t'); self.cursor.advance(); }
                Some('"') => { result.push('"'); self.cursor.advance(); }
                Some('\'') => { result.push('\''); self.cursor.advance(); }
                Some('\\') => { result.push('\\'); self.cursor.advance(); }
                Some(e) => { result.push(e); self.cursor.advance(); }
                None => continue,
            }
        } else {
            result.push(c);
        }
    }

    // Check for unterminated string
    if self.cursor.peek() != Some(quote) {
        return Err(ScannerError::UnterminatedString {
            line: self.cursor.line,
            column: self.cursor.column,
            quote,
        });
    }

    // Consume closing quote
    self.cursor.advance();
    Ok(result)
}
}

Handling Comments

Comments run from ; to the end of the line:

#![allow(unused)]
fn main() {
fn scan_comment(&mut self) {
    while let Some(c) = self.cursor.peek() {
        if c == '\n' {
            break;
        }
        self.cursor.advance();
    }
}
}

Whitespace Handling

We skip spaces, tabs, and carriage returns (but not newlines, which are significant):

#![allow(unused)]
fn main() {
fn skip_whitespace(&mut self) {
    while let Some(c) = self.cursor.peek() {
        match c {
            ' ' | '\r' | '\t' => { self.cursor.advance(); }
            _ => break,
        }
    }
}
}

Error Handling

The scanner can produce these errors:

#![allow(unused)]
fn main() {
pub enum ScannerError {
    UnknownCharacter { line: usize, column: usize, character: char },
    UnknownDirective { line: usize, column: usize, directive: String },
    NumberExpected { line: usize, column: usize, symbol: char },
    UnterminatedString { line: usize, column: usize, quote: char },
}
}

Running the Scanner

Let’s trace through scanning this example:

.org 0x8000
start:
    lda #0x00
InputToken KindValue
.orgDirectiveORG
0x8000Number32768
\nNewLine-
startIdentifier-
:Colon-
\nNewLine-
ldaInstructionLDA
#Hash-
0x00Number0
\nNewLine-
(end)EOF-

Summary

In this chapter, we built a scanner that:

  • Recognizes all ByteASM token types
  • Handles three number formats (hex, binary, decimal)
  • Distinguishes between identifiers, instructions, and registers
  • Parses directives and local labels
  • Handles strings with escape sequences
  • Tracks location information for error reporting

In the next chapter, we’ll design the Abstract Syntax Tree that will represent our parsed program.


Previous: Chapter 1 - Introduction | Next: Chapter 3 - Designing the AST