Chapter 2: Building the Scanner (Lexer)
In this chapter, we’ll build the scanner (also called a lexer) - the first stage of our assembler. The scanner converts source text into a stream of tokens.
What is Lexical Analysis?
The scanner performs lexical analysis: breaking the input stream of characters into meaningful chunks called tokens. Think of tokens as the “words” of our assembly language.
Characters to Tokens
Source: lda #0xFF
Tokens: [INSTRUCTION(LDA)] [HASH] [NUMBER(255)]
Each token has:
- A kind (what type of token it is)
- Optionally a value (the parsed data)
- A location (line, column, start position)
Token Types in ByteASM
Let’s define all the token types our scanner will produce:
#![allow(unused)]
fn main() {
pub enum TokenKind {
// Punctuation
CloseParen, // )
Colon, // :
Comma, // ,
OpenParen, // (
// Operators
Hash, // #
Minus, // -
Plus, // +
Slash, // /
Star, // *
LessThan, // <
GreaterThan, // >
// Special symbols
Dollar, // $ (current address)
Semicolon, // ; (comment start)
// Literals
Number, // 0xFF, 0b1010, 42
String, // "hello"
// Identifiers and keywords
Directive, // .org, .db, etc.
Instruction, // lda, sta, etc.
Register, // a, x, y
Identifier, // label names
LocalLabel, // .loop or @loop
// Structure
Comment, // ; to end of line
NewLine, // \n
EOF, // End of file
}
}
The Token Structure
Each token carries information about its type, value, and position:
#![allow(unused)]
fn main() {
pub struct Token {
pub kind: TokenKind,
pub value: Option<TokenValue>,
pub location: Location,
}
pub struct Location {
pub line: usize, // Line number (1-indexed)
pub column: usize, // Column number (1-indexed)
pub start: usize, // Byte offset in source
pub length: usize, // Length in bytes
}
pub enum TokenValue {
Number(u64),
String(String),
Directive(Directive),
Instruction(Mnemonic),
}
}
Scanner Architecture
Our scanner uses a cursor to track our position in the source:
#![allow(unused)]
fn main() {
pub struct Scanner<'a> {
cursor: Cursor<'a>,
source: &'a str,
}
pub struct Cursor<'a> {
chars: Peekable<Chars<'a>>,
line: usize,
column: usize,
current: usize, // Current byte position
start: usize, // Start of current token
}
}
The Cursor
The cursor provides these operations:
#![allow(unused)]
fn main() {
impl Cursor {
/// Peek at the next character without consuming it
fn peek(&mut self) -> Option<char>
/// Advance and return the next character
fn advance(&mut self) -> Option<char>
/// Mark the start of a new token
fn sync(&mut self)
/// Create a Location for the current token
fn location(&self) -> Location
/// Advance the line counter (after newline)
fn advance_line(&mut self)
}
}
The Main Scanning Loop
The heart of the scanner is the scan_token method:
#![allow(unused)]
fn main() {
pub fn scan_token(&mut self) -> Result<Token, ScannerError> {
self.skip_whitespace();
self.cursor.sync();
match self.cursor.advance() {
None => self.make_token(TokenKind::EOF, None),
Some(c) => match c {
// Single-character tokens
')' => self.make_token(TokenKind::CloseParen, None),
'(' => self.make_token(TokenKind::OpenParen, None),
',' => self.make_token(TokenKind::Comma, None),
':' => self.make_token(TokenKind::Colon, None),
'#' => self.make_token(TokenKind::Hash, None),
'+' => self.make_token(TokenKind::Plus, None),
'-' => self.make_token(TokenKind::Minus, None),
'*' => self.make_token(TokenKind::Star, None),
'/' => self.make_token(TokenKind::Slash, None),
'<' => self.make_token(TokenKind::LessThan, None),
'>' => self.make_token(TokenKind::GreaterThan, None),
'$' => self.make_token(TokenKind::Dollar, None),
// Newline
'\n' => {
let token = self.make_token(TokenKind::NewLine, None);
self.cursor.advance_line();
token
}
// Comment
';' => {
self.scan_comment();
self.make_token(TokenKind::Comment, None)
}
// More complex tokens...
_ => self.scan_complex_token(c)
}
}
}
}
Scanning Numbers
ByteASM supports three number formats:
0xFF- Hexadecimal (0x prefix)0b1010- Binary (0b prefix)42- Decimal (no prefix)
#![allow(unused)]
fn main() {
fn scan_number(&mut self, first_char: char) -> Result<Token, ScannerError> {
// Check for prefix
if first_char == '0' {
match self.cursor.peek() {
Some('x') | Some('X') => {
self.cursor.advance();
return self.scan_hex();
}
Some('b') | Some('B') => {
self.cursor.advance();
return self.scan_binary();
}
_ => {}
}
}
self.scan_decimal()
}
fn scan_hex(&mut self) -> Result<Token, ScannerError> {
let start = self.cursor.current;
// Consume hex digits
while let Some(c) = self.cursor.peek() {
if c.is_ascii_hexdigit() {
self.cursor.advance();
} else {
break;
}
}
// Must have at least one digit after 0x
if self.cursor.current == start {
return Err(ScannerError::NumberExpected {
line: self.cursor.line,
column: self.cursor.column,
symbol: 'x',
});
}
// Parse the hex value
let hex_str = &self.source[start..self.cursor.current];
let value = u64::from_str_radix(hex_str, 16)?;
self.make_token(TokenKind::Number, Some(TokenValue::Number(value)))
}
}
Scanning Identifiers
Identifiers include:
- Label names (
main,loop_counter) - Instruction mnemonics (
lda,sta) - Register names (
a,x,y)
#![allow(unused)]
fn main() {
fn scan_identifier(&mut self) -> Result<Token, ScannerError> {
// Consume alphanumeric characters and underscores
while let Some(c) = self.cursor.peek() {
if c.is_ascii_alphanumeric() || c == '_' {
self.cursor.advance();
} else {
break;
}
}
let text = &self.source[self.cursor.start..self.cursor.current];
let lower = text.to_lowercase();
// Check for register names
if lower == "a" || lower == "x" || lower == "y" {
return Ok(self.make_token(TokenKind::Register, None));
}
// Check for instruction mnemonics
if let Ok(mnemonic) = Mnemonic::try_from(lower.to_uppercase().as_str()) {
return Ok(self.make_token(
TokenKind::Instruction,
Some(TokenValue::Instruction(mnemonic))
));
}
// It's a general identifier
Ok(self.make_token(TokenKind::Identifier, None))
}
}
Scanning Directives and Local Labels
Both directives and local labels start with .:
#![allow(unused)]
fn main() {
fn scan_dot(&mut self) -> Result<Token, ScannerError> {
// Consume the identifier after the dot
let start = self.cursor.current;
while let Some(c) = self.cursor.peek() {
if c.is_ascii_alphanumeric() || c == '_' {
self.cursor.advance();
} else {
break;
}
}
let text = &self.source[start..self.cursor.current];
// Try to parse as directive
if let Ok(directive) = Directive::try_from(text.to_uppercase().as_str()) {
return Ok(self.make_token(
TokenKind::Directive,
Some(TokenValue::Directive(directive))
));
}
// It's a local label
Ok(self.make_token(TokenKind::LocalLabel, None))
}
}
Local labels can also start with @:
#![allow(unused)]
fn main() {
'@' => {
self.scan_identifier_rest();
Ok(self.make_token(TokenKind::LocalLabel, None))
}
}
Scanning Strings
Strings appear in .db directives:
#![allow(unused)]
fn main() {
fn scan_string(&mut self, quote: char) -> Result<String, ScannerError> {
let mut result = String::new();
while let Some(c) = self.cursor.peek() {
if c == quote || c == '\n' {
break;
}
self.cursor.advance();
// Handle escape sequences
if c == '\\' {
match self.cursor.peek() {
Some('n') => { result.push('\n'); self.cursor.advance(); }
Some('r') => { result.push('\r'); self.cursor.advance(); }
Some('t') => { result.push('\t'); self.cursor.advance(); }
Some('"') => { result.push('"'); self.cursor.advance(); }
Some('\'') => { result.push('\''); self.cursor.advance(); }
Some('\\') => { result.push('\\'); self.cursor.advance(); }
Some(e) => { result.push(e); self.cursor.advance(); }
None => continue,
}
} else {
result.push(c);
}
}
// Check for unterminated string
if self.cursor.peek() != Some(quote) {
return Err(ScannerError::UnterminatedString {
line: self.cursor.line,
column: self.cursor.column,
quote,
});
}
// Consume closing quote
self.cursor.advance();
Ok(result)
}
}
Handling Comments
Comments run from ; to the end of the line:
#![allow(unused)]
fn main() {
fn scan_comment(&mut self) {
while let Some(c) = self.cursor.peek() {
if c == '\n' {
break;
}
self.cursor.advance();
}
}
}
Whitespace Handling
We skip spaces, tabs, and carriage returns (but not newlines, which are significant):
#![allow(unused)]
fn main() {
fn skip_whitespace(&mut self) {
while let Some(c) = self.cursor.peek() {
match c {
' ' | '\r' | '\t' => { self.cursor.advance(); }
_ => break,
}
}
}
}
Error Handling
The scanner can produce these errors:
#![allow(unused)]
fn main() {
pub enum ScannerError {
UnknownCharacter { line: usize, column: usize, character: char },
UnknownDirective { line: usize, column: usize, directive: String },
NumberExpected { line: usize, column: usize, symbol: char },
UnterminatedString { line: usize, column: usize, quote: char },
}
}
Running the Scanner
Let’s trace through scanning this example:
.org 0x8000
start:
lda #0x00
| Input | Token Kind | Value |
|---|---|---|
.org | Directive | ORG |
0x8000 | Number | 32768 |
\n | NewLine | - |
start | Identifier | - |
: | Colon | - |
\n | NewLine | - |
lda | Instruction | LDA |
# | Hash | - |
0x00 | Number | 0 |
\n | NewLine | - |
| (end) | EOF | - |
Summary
In this chapter, we built a scanner that:
- Recognizes all ByteASM token types
- Handles three number formats (hex, binary, decimal)
- Distinguishes between identifiers, instructions, and registers
- Parses directives and local labels
- Handles strings with escape sequences
- Tracks location information for error reporting
In the next chapter, we’ll design the Abstract Syntax Tree that will represent our parsed program.
Previous: Chapter 1 - Introduction | Next: Chapter 3 - Designing the AST