3. Pyxc: Better Errors

Where We Are

The parser from Chapter 2 works, but its error messages are rough.

ready> def bad(x) return x
Error: Expected ':' in function definition (token: -7)

By the end of this chapter the same mistake gives you:

Error (Line 1, Column 12): Expected ':' in function definition
def bad(x) return 
           ^~~~

Line number. Column number. The source line. A caret pointing at the problem. That's a real error message. The (token: -7) is a raw enum value. It means nothing to someone who didn't write the lexer.

And if you mistype a number —

ready> 1.2.3
Parsed a top-level expression.

That's a bug: 1.2.3 isn't a valid number, but the lexer silently accepted 1.2 and left .3 sitting in the stream. By the end of this chapter the same mistake gives you:

Error (Line 2, Column 1): invalid number literal '1.2.3'
1.2.3
^~~~

Source Code

git clone --depth 1 https://github.com/alankarmisra/pyxc-llvm-tutorial
cd pyxc-llvm-tutorial/code/chapter-03

The Four Problems

Everything in this chapter addresses a concrete deficiency from Chapter 2:

  1. Keyword recognition is a chain of if comparisons. Adding a new keyword means editing the comparison chain. A table is cleaner.
  2. strtod can silently accept junk. strtod("1.2.3", nullptr) returns 1.2 and ignores .3. We need to detect the leftover.
  3. Error messages show raw token numbers. We need a human-readable name for every token.
  4. Error messages have no location. We need to track line and column as we read characters, and attach those coordinates to each error.

Problem 1: A Table for Keywords

Chapter 1's keyword check looks like this:

if (IdentifierStr == "def")    return tok_def;
if (IdentifierStr == "extern") return tok_extern;
if (IdentifierStr == "return") return tok_return;
return tok_identifier;

This works, but every new keyword needs a new if. A map is more honest about what's happening — it is a lookup table — and adding a keyword is a one-line change:

static map<string, Token> Keywords = {
    {"def", tok_def}, {"extern", tok_extern}, {"return", tok_return}};

The lookup replaces the chain:

auto It = Keywords.find(IdentifierStr);
return (It == Keywords.end()) ? tok_identifier : It->second;

Not found → it's an identifier. Found → return the mapped token. Same behavior, open for extension.

Problem 2: Catching Malformed Numbers

The standard library function strtod converts a string to a double. It stops at the first character it doesn't recognize and tells you where it stopped via a second argument:

char *End = nullptr;
NumVal = strtod(NumStr.c_str(), &End);

After the call, End points to the first character strtod didn't consume. If End points to the null terminator (*End == '\0'), the entire string was valid. If it points anywhere else, there's unconsumed text — which means the input was malformed.

if (!End || *End != '\0') {
    fprintf(stderr,
            "Error (Line %d, Column %d): invalid number literal '%s'\n",
            CurLoc.Line, CurLoc.Col, NumStr.c_str());
    PrintErrorSourceContext(CurLoc);
    return tok_error;
}

1.2.3 produces NumStr = "1.2.3". strtod stops at the second ., leaving End pointing at .3. Since *End != '\0', we emit an error and return tok_error — a new token value that signals "the lexer already diagnosed this, skip it."

We also save the literal string before calling strtod:

NumLiteralStr = NumStr;

NumLiteralStr is used by FormatTokenForMessage later when a parse error involves a number token. The lexer sets it; nobody else needs to care about it.

Problem 3: A Name for Every Token

Chapter 2's error messages printed the raw integer value of CurTok — helpful only if you have the Token enum open in another window. We want something like 'def', identifier, or newline instead.

We build a map from token value to string once, at startup, using an immediately-invoked lambda:

static map<int, string> TokenNames = [] {
  map<int, string> Names = {
      {tok_eof,        "end of input"},
      {tok_eol,        "newline"},
      {tok_error,      "error"},
      {tok_def,        "'def'"},
      {tok_extern,     "'extern'"},
      {tok_identifier, "identifier"},
      {tok_number,     "number"},
      {tok_return,     "'return'"},
  };

  // Single character tokens.
  for (int ch = 0; ch <= 255; ++ch) {
    if (isprint(static_cast<unsigned char>(ch))) // isprint expects unsigned char; int ch can be negative on some platforms
      Names[ch] = "'" + string(1, static_cast<char>(ch)) + "'"; // string(n, char): needs char, not int
    else if (ch == '\n')
      Names[ch] = "'\\n'";
    else if (ch == '\t')
      Names[ch] = "'\\t'";
    else if (ch == '\r')
      Names[ch] = "'\\r'";
    else if (ch == '\0')
      Names[ch] = "'\\0'";
    else {
      ostringstream OS;
      OS << "0x" << uppercase << hex << setw(2) << setfill('0') << ch;
      Names[ch] = OS.str();
    }
  }

  return Names;
}();

The named token values (negative integers) are in the initializer list. Every printable ASCII character gets a quoted name like '+'. Unprintable characters get either an escape sequence or a hex code. The lambda runs once and the result is stored. No runtime cost after startup.

FormatTokenForMessage uses this map, with special cases for the tokens that carry extra information:

static string FormatTokenForMessage(int Tok) {
  if (Tok == tok_identifier)
    return "identifier '" + IdentifierStr + "'";
  if (Tok == tok_number)
    return "number '" + NumLiteralStr + "'";

  auto It = TokenNames.find(Tok);
  if (It != TokenNames.end())
    return It->second;
  return "unknown token";
}

When the bad token is an identifier or a number, we include the actual text (identifier 'foo', number '3.14'). Everything else uses the static name from the map.

Problem 4: Tracking Where We Are

To report (Line 3, Column 8), we need to know the line and column as we read characters. We introduce two small pieces of data.

Tracking Position Through advance()

In Chapter 1, advance() already wrapped getchar() to normalize line endings. Here we expand it to also keep a running position:

struct SourceLocation {
  int Line;
  int Col;
};
static SourceLocation CurLoc;
static SourceLocation LexLoc = {1, 0};

Two location globals: LexLoc is where the lexer's character-read head currently sits. CurLoc is snapshotted at the start of each token — the position the parser sees.

advance() updates LexLoc every time a character is consumed:

static int advance() {
  int LastChar = getchar();
  if (LastChar == '\r') {
    int NextChar = getchar();
    if (NextChar != '\n' && NextChar != EOF)
      ungetc(NextChar, stdin);
    LexLoc.Line++;
    LexLoc.Col = 0;
    return '\n';
  }

  if (LastChar == '\n') {
    LexLoc.Line++;
    LexLoc.Col = 0;
  } else {
    LexLoc.Col++;
  }

  return LastChar;
}

LexLoc is updated on every character: a newline increments the line counter and resets the column to zero; any other character increments the column.

gettok() snapshots LexLoc into CurLoc once, after the whitespace-skip loop:

while (isspace(LastChar) && LastChar != '\n')
  LastChar = advance();

CurLoc = LexLoc;

This is the position the diagnostics infrastructure uses. Snapshotting here — after skipping whitespace, before consuming the token's characters — means CurLoc always points to the first character of the current token.

There is one edge case: when gettok returns tok_eol from the comment path (# branch), the snapshot at the top of the function pointed at #, not at the newline. We re-snapshot just before returning to get the correct post-newline position:

if (LastChar == '#') {
  do
    LastChar = advance();
  while (LastChar != EOF && LastChar != '\n');

  if (LastChar != EOF) {
    CurLoc = LexLoc;  // re-snapshot after consuming the whole comment + '\n'
    LastChar = ' ';
    return tok_eol;
  }
}

Buffering Source Lines for Caret Output

Knowing the position isn't enough on its own. To print:

def bad(x) return 
           ^~~~

we need the actual text of the line. We solve this by buffering lines as we read. SourceManager accumulates characters through onChar(), which gets called by advance() on every character consumed:

class SourceManager {
  vector<string> CompletedLines;
  string CurrentLine;

public:
  void reset() {
    CompletedLines.clear();
    CurrentLine.clear();
  }

  void onChar(int C) {
    if (C == '\n') {
      CompletedLines.push_back(CurrentLine);
      CurrentLine.clear();
      return;
    }
    if (C != EOF)
      CurrentLine.push_back(static_cast<char>(C));
  }

  const string *getLine(int OneBasedLine) const {
    if (OneBasedLine <= 0)
      return nullptr;
    size_t Index = static_cast<size_t>(OneBasedLine - 1);
    if (Index < CompletedLines.size())
      return &CompletedLines[Index];
    if (Index == CompletedLines.size())
      return &CurrentLine;
    return nullptr;
  }
};

static SourceManager PyxcSourceMgr;

Characters accumulate in CurrentLine as they are read. When a newline arrives, CurrentLine is moved into CompletedLines and the CurrentLine buffer is reset. getLine(N) takes a 1-based line number and returns from CompletedLines for finished lines, or from CurrentLine for the line still being read.

We integrate SourceManager into advance():

  if (LastChar == '\r') {
    ...
    PyxcSourceMgr.onChar('\n'); // add this
    LexLoc.Line++;
    ...
  }

  if (LastChar == '\n') {
    PyxcSourceMgr.onChar('\n'); // add this
    LexLoc.Line++;
    LexLoc.Col = 0;
  } else {
    PyxcSourceMgr.onChar(LastChar); // add this
    LexLoc.Col++;
  }

Every character consumed by the lexer passes through onChar before being returned. SourceManager sees the whole character stream and builds its line buffer passively — no other part of the lexer needs to know about it.

Printing the Caret

With a stored line and a column number, printing the context is straightforward:

static void PrintErrorSourceContext(SourceLocation Loc) {
  const string *LineText = PyxcSourceMgr.getLine(Loc.Line);
  if (!LineText)
    return;

  fprintf(stderr, "%s\n", LineText->c_str());
  int spaces = Loc.Col - 1;
  if (spaces < 0)
    spaces = 0;
  for (int i = 0; i < spaces; ++i)
    fputc(' ', stderr);
  fprintf(stderr, "^~~~\n");
}

Print the line, then print (Col - 1) spaces, then ^~~~. The -1 converts from 1-based column to a 0-based offset into the string.

Pointing at the Right Place for tok_eol

When the parser fails on a newline token — for example, when the user types def foo(x) and hits Enter without a : — the error is logically at the end of the previous line, not at the start of the next one.

Because CurLoc for tok_eol is snapshotted after advance() has consumed the \n and incremented LexLoc.Line, CurLoc.Line is already the next line number. GetDiagnosticAnchorLoc steps back by one (Loc.Line - 1) to arrive at the line that just ended, then reports a column one past its last character so the caret appears just after the final token:

static SourceLocation GetDiagnosticAnchorLoc(SourceLocation Loc, int Tok) {
  if (Tok != tok_eol)
    return Loc;

  int PrevLine = Loc.Line - 1;
  if (PrevLine <= 0)
    return Loc;

  const string *PrevLineText = PyxcSourceMgr.getLine(PrevLine);
  if (!PrevLineText)
    return Loc;

  return {PrevLine, static_cast<int>(PrevLineText->size()) + 1};
}

For any other token, CurLoc is returned as-is.

For def foo(x) followed by Enter, this produces:

Error (Line 1, Column 11): Expected ':' in function definition
def foo(x)
          ^~~~

The caret lands just past the ) — exactly where the : was missing.

Putting It Together: LogError

LogError overloads now use the location infrastructure:

unique_ptr<ExprAST> LogError(const char *Str) {
  SourceLocation Anchor = GetDiagnosticAnchorLoc(CurLoc, CurTok);
  fprintf(stderr, "Error (Line %d, Column %d): %s\n",
          Anchor.Line, Anchor.Col, Str);
  PrintErrorSourceContext(Anchor);
  return nullptr;
}

Since LogErrorP and LogErrorF delegate to LogError, they get this for free.

Every parser error now shows:

  • The location of the bad token (or end of line, for tok_eol)
  • The source line
  • A ^~~~ caret

Error Recovery: tok_error and SynchronizeToLineBoundary

The lexer now returns tok_error for malformed input (like 1.2.3). The rest of the lexer has no idea how to handle that token — it's not a number, not an operator, not a keyword. If we let it fall through to ParsePrimary, it hits the default: branch and emits a second, confusing error: "unknown token when expecting an expression" — on top of the error the lexer already printed.

The fix is to intercept tok_error early and skip to the next line before trying to parse anything:

static void SynchronizeToLineBoundary() {
  while (CurTok != tok_eol && CurTok != tok_eof)
    getNextToken();
}

This is panic-mode error recovery: when something goes wrong and we can't reason about the current state, advance unconditionally to the next line boundary and restart parsing there. It's a blunt instrument — we discard the rest of the line — but it's reliable: after SynchronizeToLineBoundary(), CurTok is always tok_eol or tok_eof, and the REPL's main loop knows exactly how to handle those.

MainLoop calls it for tok_error:

if (CurTok == tok_error) {
  SynchronizeToLineBoundary();
  continue;
}

The Handle functions also call it on parse failure and on unexpected trailing tokens:

static void HandleDefinition() {
  if (ParseDefinition()) {
    if (CurTok != tok_eol && CurTok != tok_eof) {
      LogError(("Unexpected " + FormatTokenForMessage(CurTok)).c_str());
      SynchronizeToLineBoundary();
      return;
    }
    fprintf(stderr, "Parsed a function definition.\n");
  } else {
    SynchronizeToLineBoundary();
  }
}

The same pattern applies to HandleExtern and HandleTopLevelExpression. After any failure — whether the parser returned nullptr or left unexpected tokens in CurTok — we synchronize to the line boundary and let the main loop print a fresh prompt.

Build and Run

cd code/chapter-03
cmake -S . -B build && cmake --build build
./build/pyxc

Tests

llvm-lit code/chapter-03/test/

The test suite covers the error cases introduced in this chapter — malformed numbers, missing colons, bad separators — as well as location accuracy across sequential lines, comments, and recovery after an error. Peek into code/chapter-03/test/ for examples.

Try It

ready> def add(x, y):
   return x + y
Parsed a function definition.
ready> 1.2.3
Error (Line 3, Column 1): invalid number literal '1.2.3'
1.2.3
^~~~
ready> def bad(x) return x
Error (Line 4, Column 12): Expected ':' in function definition
def bad(x) return 
           ^~~~
ready> def missing_colon(x)
Error (Line 5, Column 21): Expected ':' in function definition
def missing_colon(x)
                    ^~~~
ready>^D

A few things to notice:

  • 1.2.3 is caught in the lexer now. The error fires before the parser ever sees the token.
  • def bad(x) return x — the caret points at the space before return, the position where : was expected.
  • def missing_colon(x) — the caret points just past the closing ), where : should have appeared. That's GetDiagnosticAnchorLoc at work: CurLoc for tok_eol is on the next line, so the function steps back by one and points to the end of the line that just ended.

Things Worth Knowing

  • tok_error is handled in MainLoop, not in the parse functions. When the lexer returns tok_error, MainLoop intercepts it and calls SynchronizeToLineBoundary() without forwarding to any Handle* function — by that point the lexer has already printed the error, so the parser has nothing to add.

What's Next

The lexer and parser are solid. Error messages are readable. The next step is to connect this to LLVM: walk the AST and emit LLVM IR — real machine-code instructions — for the first time.

Before that, Chapter 4 covers installing LLVM and setting up the build system. It's mostly infrastructure, but you only do it once.

Need Help?

Build issues? Questions?

Include:

  • Your OS and version
  • Full error message
  • Output of cmake --version and ninja --version

We'll figure it out.