Lexer: Breaking Code Into Tokens | Compiler Series Part 2

This is part 2 of my compiler series. If you missed it, part 1 covers why I built a compiler for a video game.

Every compiler starts the same way: staring at a wall of text and trying to make sense of it.

When you write let temp = 500, you see three distinct things — a keyword, a variable name, and a number. But the compiler just sees 14 characters. No structure. No meaning. Just bytes.

The lexer's job is to be the first to impose order on this chaos. It scans through characters and groups them into tokens — the atomic units that the rest of the compiler can actually work with.

What the compiler sees

Here's a simple line of ReScript:

let temp = sensor + 100

To us, the structure is obvious. To the lexer, it's this:

l e t   t e m p   =   s e n s o r   +   1 0 0

The lexer transforms this into:

[Let] [Identifier("temp")] [Assign] [Identifier("sensor")] [Plus] [IntLiteral(100)]

Now we have structure. Each token knows what it is. The parser can work with this.

The state machine approach

A lexer is fundamentally a state machine. You're always in some state, you read a character, and that determines your next state.

Here's the mental model:

START → see 'l' → READING_IDENTIFIER → see 'e' → READING_IDENTIFIER → see 't' → READING_IDENTIFIER → see ' ' → EMIT "let" token → START

My implementation tracks three things:

type lexer = {
  source: string,    // the entire source code
  position: int,     // where we are
  length: int,       // total length
}

Every operation returns a new lexer state plus whatever we found:

let nextToken = (lexer: lexer): (lexer, token)

Read that signature carefully. The lexer goes in, and a new lexer comes out alongside the token. The original lexer is unchanged.

This is the key insight that shaped my entire implementation.

Functional vs imperative: why it matters

Most lexer tutorials you'll find online look like this:

class Lexer {
  constructor(source) {
    this.source = source;
    this.position = 0;
  }
  
  advance() {
    this.position++;  // mutation!
  }
  
  nextToken() {
    // read characters, mutate position
    // return token
  }
}

This works. But there's a problem: the lexer's state is constantly changing. If something goes wrong, you can't easily replay what happened. If you want to "peek ahead" without committing, you need to save and restore state manually.

The functional approach is different:

let advance = (lexer: lexer): lexer => {
  {...lexer, position: lexer.position + 1}
}

let nextToken = (lexer: lexer): (lexer, token) => {
  // returns NEW lexer, original unchanged
}

No mutation. Every function takes a state and returns a new state. The old state still exists.

This gives you superpowers:

Free backtracking. Need to peek ahead three characters and then decide? Just don't use the new lexer state if you don't like what you see.

Easy debugging. Every state transition is explicit. You can log every intermediate state without affecting behavior.

Referential transparency. Same input always produces same output. No hidden state to worry about.

Here's how I handle the tricky = character, which could be =, ==, or =>:

| Some("=") =>
  let lexer = advance(lexer)  // consume first =
  switch peekChar(lexer) {
  | Some("=") => (advance(lexer), EqualEqual)  // ==
  | Some(">") => (advance(lexer), Arrow)       // =>
  | _ => (lexer, Assign)                       // just =
  }

I advance once, peek at the next character, and decide. If it's not what I expected, I still have the intermediate state — no need to "unread" anything.

Why ML languages dominate compiler writing

If you look at production compilers, you'll notice a pattern:

OCaml: Rust (originally), Flow, Hack, parts of Facebook's infrastructure
Haskell: GHC itself, PureScript, Elm (originally)
F#: Parts of the .NET compiler platform
ReScript: ReScript compiler itself (self-hosted)

This isn't coincidence. ML-family languages have features that make compiler writing almost enjoyable:

Algebraic data types model ASTs perfectly:

type token =
  | Let
  | If
  | Identifier(string)
  | IntLiteral(int)
  | Plus
  | Invalid(string)

Try representing this in Java or Python. You'll end up with class hierarchies, visitor patterns, and boilerplate. In ReScript, it's just... data.

Pattern matching makes token dispatch natural:

switch token {
| Let => handleLet()
| If => handleIf()
| Identifier(name) => handleIdent(name)
| IntLiteral(n) => handleNumber(n)
}

The compiler checks exhaustiveness. Forget a case? Compile error. Add a new token type? The compiler tells you every place that needs updating.

Immutability by default prevents entire categories of bugs. I never accidentally modified lexer state when I meant to peek. The type system simply doesn't allow it.

Expression-oriented syntax means everything returns a value. No statements, no void functions, no forgetting to return. This matches how compilers think — transforming one representation into another.

When I chose ReScript for this project, I was choosing to implement a subset of the language using the language. There's something poetic about that. But more practically, I was choosing tools that fit the problem.

The edge cases that teach you

Simple cases are simple. + is always Plus. { is always LeftBrace. One character, one token, done.

The interesting stuff happens when one character could mean multiple things.

The colon problem. In my language, : alone means nothing — it's only valid as part of := (ref assignment). So when I see :, I must see = next:

| Some(":") =>
  let lexer = advance(lexer)
  switch peekChar(lexer) {
  | Some("=") => (advance(lexer), ColonEqual)
  | _ => (lexer, Invalid("Unexpected character ':'"))
  }

This is a design decision baked into the lexer. A different language might use : for type annotations, requiring different logic.

String literals need escape sequence handling:

let readStringLiteral = (lexer: lexer): (lexer, string) => {
  let rec loop = (l: lexer, acc: string): (lexer, string) => {
    switch peekChar(l) {
    | Some("\"") => (advance(l), acc)     // end quote
    | Some("\\") =>
      let l = advance(l)
      switch peekChar(l) {
      | Some("n") => loop(advance(l), acc ++ "\n")
      | Some("t") => loop(advance(l), acc ++ "\t")
      | Some("\\") => loop(advance(l), acc ++ "\\")
      | Some("\"") => loop(advance(l), acc ++ "\"")
      | Some(c) => loop(advance(l), acc ++ c)
      | None => (l, acc)
      }
    | Some(c) => loop(advance(l), acc ++ c)
    | None => (l, acc)  // unterminated string
    }
  }
  // ...
}

See that recursive loop function? In an imperative lexer, this would be a while loop with mutable accumulator. Here, recursion is the loop, and the accumulator is just a function parameter.

Keywords vs identifiers is a classic problem. When I see letter, is it the keyword let plus ter, or the identifier letter?

The answer: read the whole thing first, then check.

| Some(c) if isLetter(c) =>
  let (lexer, ident) = readIdentifier(lexer)
  switch ident {
  | "let" => (lexer, Let)
  | "if" => (lexer, If)
  | "else" => (lexer, Else)
  | "while" => (lexer, While)
  | "type" => (lexer, Type)
  | "switch" => (lexer, Switch)
  | "ref" => (lexer, Ref)
  | "true" => (lexer, True)
  | _ => (lexer, Identifier(ident))
  }

This is called "maximal munch" — always read the longest possible token. letter is one identifier, not let + ter.

What I learned

Lexers are deceptively simple. The core logic is maybe 100 lines. But edge cases multiply that by 3x. Escape sequences, multi-character operators, error recovery — these are where the real work lives.

Functional programming fits compilers. I didn't choose ReScript randomly. Algebraic types, pattern matching, and immutability aren't just nice-to-haves — they're the right tools for the job. Now I understand why so many compilers are written in ML-family languages.

The lexer shapes everything downstream. Decisions made here ripple through the entire compiler. What counts as a token? How are errors reported? What's the interface to the parser? Get these wrong, and you'll be fighting your own code later.

Testing is straightforward. Input string → token array. No side effects, no mocking, no setup. Pure functions are easy to test.

The token types

For reference, here's my complete token set — 28 types in all:

type token =
  | Let | If | Else | While        // control flow
  | Type | Switch | Ref | True     // declarations & literals
  | Identifier(string)             // variable names
  | IntLiteral(int)                // numbers
  | StringLiteral(string)          // "strings"
  | Plus | Minus | Multiply | Divide | Percent  // arithmetic
  | GreaterThan | LessThan | EqualEqual         // comparison
  | Assign | ColonEqual | Arrow | Pipe | Dot    // special operators
  | Comma | LeftParen | RightParen              // delimiters
  | LeftBrace | RightBrace
  | EOF | Invalid(string)          // control tokens

Each one is a building block for the parser. Speaking of which...

Next up

The lexer gives us tokens. But tokens aren't structure — they're just a flat list. if x < 10 { ... } and if { x 10 < } ... would produce similar token sequences.

The parser's job is to understand the grammar — which token sequences are valid, and what they mean.

Next article: how I built a recursive descent parser without a CS degree, and why "recursive descent" sounds scarier than it is.

The compiler is open source. Check out the lexer code on GitHub if you want to see the full implementation.