3. pyxc: Better Errors

Where We Are

We have a nice little parser after Chapter 2. But the error messages are kinda rough. As we grow the language, and type code in our invented syntax, better error messaging will really help us narrow down if something is a code syntax problem, or a compiler problem. We take compiler correctness for granted when we use production level languages. But in inventing our own, we have to be wary of the fact that our compiler might be doing things wrong. Better error messages go a long way in tracing the problem. So we tackle this first, before moving on to generating machine code from our source in the following chapters.

We are going to attempt to make this:

ready> def bad(x) return x
Error: Expected ':' in function definition (token: -7)

look like this:

Error (Line 1, Column 12): Expected ':' in function definition
def bad(x) return 
           ^~~~

Line number. Column number. The source line. A caret pointing at the problem. That's a real error message.

And then there's this bug which we revealed in the previous chapter:

ready> 1.2.3
Parsed a top-level expression.

Internally this is accepted as 1.2 and the .3 is carelessly ignored. We will fix this too so we get the following:

Error (Line 2, Column 1): invalid number literal '1.2.3'
1.2.3
^~~~

That's it. Just two fixes and we'll be well on our way to making our code execute in the next few chapters. We're getting there. Don't give up on me now.

Source Code

git clone --depth 1 https://github.com/alankarmisra/pyxc-llvm-tutorial
cd pyxc-llvm-tutorial/code/chapter-03

A Name for Every Token

To print sensible error strings, we want to convert Token values to strings like def, identifier, or newline. A map serves our purpose well. But we also want to generate some strings through a code loop, so we wrap the whole thing in a lambda and execute it immediately.

static map<int, string> TokenNames = [] {
  map<int, string> Names = {
      {tok_eof,        "end of input"},
      {tok_eol,        "newline"},
      {tok_error,      "error"},
      {tok_def,        "'def'"},
      {tok_extern,     "'extern'"},
      {tok_identifier, "identifier"},
      {tok_number,     "number"},
      {tok_return,     "'return'"},
  };

  // Single character tokens.
  for (int ch = 0; ch <= 255; ++ch) {
    if (isprint(static_cast<unsigned char>(ch))) // isprint expects unsigned char; int ch can be negative on some platforms
      Names[ch] = "'" + string(1, static_cast<char>(ch)) + "'"; // string(n, char): needs char, not int
    else if (ch == '\n')
      Names[ch] = "'\\n'";
    else if (ch == '\t')
      Names[ch] = "'\\t'";
    else if (ch == '\r')
      Names[ch] = "'\\r'";
    else if (ch == '\0')
      Names[ch] = "'\\0'";
    else {
      ostringstream OS;
      OS << "0x" << uppercase << hex << setw(2) << setfill('0') << ch;
      Names[ch] = OS.str();
    }
  }

  return Names;
}();

The named token values (negative integers) are in the initializer list. Every printable ASCII character gets a quoted name like '+'. Unprintable characters get either an escape sequence or a hex code. The lambda runs once and the result is stored.

FormatTokenForMessage uses this map, with special cases for the tokens that carry extra information:

static string FormatTokenForMessage(int Tok) {
  if (Tok == tok_identifier)
    return "identifier '" + IdentifierStr + "'";
  if (Tok == tok_number)
    return "number '" + NumLiteralStr + "'";

  auto It = TokenNames.find(Tok);
  if (It != TokenNames.end())
    return It->second;
  return "unknown token";
}

When the bad token is an identifier or a number, we include the actual text (identifier 'foo', number '3.14'). Everything else uses the static name from the map.

Tracking Where We Are

To report (Line 3, Column 8), we need to know the line and column as we read characters. We introduce two small pieces of data.

struct SourceLocation {
  int Line;
  int Col;
};
static SourceLocation CurLoc;
static SourceLocation LexLoc = {1, 0};

Two location globals: LexLoc is where the lexer's character-read head currently sits. CurLoc is snapshotted at the start of each token — the position the parser sees.

In Chapter 1, advance() already wrapped getchar() to normalize line endings. Here we expand it to also keep a running position:

static int advance() {
  int LastChar = getchar();
  if (LastChar == '\r') {
    int NextChar = getchar();
    if (NextChar != '\n' && NextChar != EOF)
      ungetc(NextChar, stdin);
    LexLoc.Line++;
    LexLoc.Col = 0;
    return '\n';
  }

  if (LastChar == '\n') {
    LexLoc.Line++;
    LexLoc.Col = 0;
  } else {
    LexLoc.Col++;
  }

  return LastChar;
}

A newline increments the LexLoc line counter and resets the column to zero; any other character increments the column.

gettok() snapshots LexLoc into CurLoc once, after the whitespace-skip loop:

while (isspace(LastChar) && LastChar != '\n')
  LastChar = advance();

CurLoc = LexLoc;

This is the position that will be printed, should an error occur. Snapshotting here — after skipping whitespace, before consuming the token's characters — means CurLoc always points to the first character of the current token.

There is one edge case: when gettok returns tok_eol from the comment path (# branch), the snapshot at the top of the function pointed at #, not at the newline. In this case, we re-snapshot just before returning from the function. That way, we get the correct post-newline position:

if (LastChar == '#') {
  do
    LastChar = advance();
  while (LastChar != EOF && LastChar != '\n');

  if (LastChar != EOF) {
    CurLoc = LexLoc;  // re-snapshot after consuming the whole comment + '\n'
    LastChar = ' ';
    return tok_eol;
  }
}

Buffering Source Lines for Caret Output

Knowing the position isn't enough on its own. To print:

def bad(x) return 
           ^~~~

we need the actual text of the line. We solve this by buffering lines as we read. SourceManager accumulates characters through onChar(), which gets called by advance() on every character consumed:

class SourceManager {
  vector<string> CompletedLines;
  string CurrentLine;

public:
  void reset() {
    CompletedLines.clear();
    CurrentLine.clear();
  }

  void onChar(int C) {
    if (C == '\n') {
      CompletedLines.push_back(CurrentLine);
      CurrentLine.clear();
      return;
    }
    if (C != EOF)
      CurrentLine.push_back(static_cast<char>(C));
  }

  const string *getLine(int OneBasedLine) const {
    if (OneBasedLine <= 0)
      return nullptr;
    size_t Index = static_cast<size_t>(OneBasedLine - 1);
    if (Index < CompletedLines.size())
      return &CompletedLines[Index];
    if (Index == CompletedLines.size())
      return &CurrentLine;
    return nullptr;
  }
};

static SourceManager PyxcSourceMgr;

Characters accumulate in CurrentLine as they are read. When a newline arrives, CurrentLine is moved into CompletedLines and the CurrentLine buffer is reset. getLine(N) takes a 1-based line number and returns from CompletedLines for finished lines, or from CurrentLine for the line still being read.

We integrate SourceManager into advance():

  if (LastChar == '\r') {
    ...
    PyxcSourceMgr.onChar('\n'); // add this
    LexLoc.Line++;
    ...
  }

  if (LastChar == '\n') {
    PyxcSourceMgr.onChar('\n'); // add this
    LexLoc.Line++;
    LexLoc.Col = 0;
  } else {
    PyxcSourceMgr.onChar(LastChar); // add this
    LexLoc.Col++;
  }

Every character consumed by the lexer passes through onChar before being returned. SourceManager sees the whole character stream and builds its line buffer passively — no other part of the lexer needs to know about it.

Printing the Caret

With a stored line and a column number, printing the context is straightforward:

static void PrintErrorSourceContext(SourceLocation Loc) {
  const string *LineText = PyxcSourceMgr.getLine(Loc.Line);
  if (!LineText)
    return;

  fprintf(stderr, "%s\n", LineText->c_str());
  int spaces = Loc.Col - 1;
  if (spaces < 0)
    spaces = 0;
  for (int i = 0; i < spaces; ++i)
    fputc(' ', stderr);
  fprintf(stderr, "^~~~\n");
}

Print the line, then print (Col - 1) spaces, then ^~~~. The -1 converts from 1-based column to a 0-based offset into the string.

Pointing at the Right Place for tok_eol

When the parser fails on a newline token — for example, when the user types def foo(x) and hits Enter without a : — the error is logically at the end of the previous line, not at the start of the next one.

Because CurLoc for tok_eol is snapshotted after advance() has consumed the \n and incremented LexLoc.Line, CurLoc.Line is already the next line number. GetDiagnosticAnchorLoc steps back by one (Loc.Line - 1) to arrive at the line that just ended, then reports a column one past its last character so the caret appears just after the final token:

static SourceLocation GetDiagnosticAnchorLoc(SourceLocation Loc, int Tok) {
  if (Tok != tok_eol)
    return Loc;

  int PrevLine = Loc.Line - 1;
  if (PrevLine <= 0)
    return Loc;

  const string *PrevLineText = PyxcSourceMgr.getLine(PrevLine);
  if (!PrevLineText)
    return Loc;

  return {PrevLine, static_cast<int>(PrevLineText->size()) + 1};
}

For any other token, CurLoc is returned as-is.

For def foo(x) followed by Enter, this produces:

Error (Line 1, Column 11): Expected ':' in function definition
def foo(x)
          ^~~~

The caret lands just past the ) — exactly where the : was missing.

Putting It Together: LogError

LogError overloads now use the location infrastructure:

unique_ptr<ExprAST> LogError(const char *Str) {
  SourceLocation Anchor = GetDiagnosticAnchorLoc(CurLoc, CurTok);
  fprintf(stderr, "Error (Line %d, Column %d): %s\n",
          Anchor.Line, Anchor.Col, Str);
  PrintErrorSourceContext(Anchor);
  return nullptr;
}

Since LogErrorP and LogErrorF delegate to LogError, they get this for free.

Every parser error now shows:

  • The location of the bad token (or end of line, for tok_eol)
  • The source line
  • A ^~~~ caret

Error Recovery: tok_error and SynchronizeToLineBoundary

The lexer now returns tok_error for funky input (like 1.2.3). The rest of the lexer has no idea how to handle that token — it's not a number, not an operator, not a keyword. If we let it fall through to ParsePrimary, it hits the default: branch and emits a second, confusing error: "unknown token when expecting an expression" — on top of the error the lexer already printed.

The fix is to intercept tok_error early and skip to the next line before trying to parse anything:

static void SynchronizeToLineBoundary() {
  while (CurTok != tok_eol && CurTok != tok_eof)
    getNextToken();
}

This is panic-mode error recovery: when something goes wrong and we can't reason about the current state, advance unconditionally to the next line boundary and restart parsing there. It's a blunt instrument — we discard the rest of the line — but it's reliable: after SynchronizeToLineBoundary(), CurTok is always tok_eol or tok_eof, and the REPL's main loop knows exactly how to handle those.

MainLoop calls it for tok_error:

if (CurTok == tok_error) {
  SynchronizeToLineBoundary();
  continue;
}

The Handle* functions also call it on parse failure and on unexpected trailing tokens:

static void HandleDefinition() {
  if (ParseDefinition()) {
    if (CurTok != tok_eol && CurTok != tok_eof) {
      LogError(("Unexpected " + FormatTokenForMessage(CurTok)).c_str());
      SynchronizeToLineBoundary();
      return;
    }
    fprintf(stderr, "Parsed a function definition.\n");
  } else {
    SynchronizeToLineBoundary();
  }
}

The same pattern applies to HandleExtern and HandleTopLevelExpression. After any failure — whether the parser returned nullptr or left unexpected tokens in CurTok — we synchronize to the line boundary and let the main loop print a fresh prompt.

Catching Malformed Numbers

Let's deal with malformed numbers now. It's a really quick fix. The standard library function strtod converts a string to a double. It stops at the first character it doesn't recognize and tells you where it stopped via a second argument:

char *End = nullptr;
NumVal = strtod(NumStr.c_str(), &End);

After the call, End points to the first character strtod didn't consume. If End points to the null terminator (*End == '\0'), the entire string was valid. If it points anywhere else, there's unconsumed text — which means the input was a valid number. Here's the code.

if (!End || *End != '\0') {
    fprintf(stderr,
            "Error (Line %d, Column %d): invalid number literal '%s'\n",
            CurLoc.Line, CurLoc.Col, NumStr.c_str());
    PrintErrorSourceContext(CurLoc);
    return tok_error;
}

1.2.3 produces NumStr = "1.2.3". strtod stops at the second ., leaving End pointing at .3. Since *End != '\0', we emit an error and return tok_error — a new token value that signals "the lexer already diagnosed this, skip it."

We also save the literal string before calling strtod:

NumLiteralStr = NumStr;

NumLiteralStr is used by FormatTokenForMessage later when a parse error involves a number token. The lexer sets it; nobody else needs to care about it.

Notice that we are in the lexer section of the code in gettok(), which returns an int. So we can't return LogError(...) here as we do with parser level errors which returns nullptr. For now, we just print the error within gettok() and move on. If we find that we are printing more and more Lexer errors inline, we will refactor it.

Cleanup: a Table for Keywords

First let's simplify our keyword lookup code.

if (IdentifierStr == "def")    return tok_def;
if (IdentifierStr == "extern") return tok_extern;
if (IdentifierStr == "return") return tok_return;
return tok_identifier;

This works, but every new keyword needs a new if. A map is more honest about what's happening — it is a lookup table — and adding a keyword is a one-line change:

static map<string, Token> Keywords = {
    {"def", tok_def}, {"extern", tok_extern}, {"return", tok_return}};

The lookup replaces the chain:

auto It = Keywords.find(IdentifierStr);
return (It == Keywords.end()) ? tok_identifier : It->second;

If an identifier-like string is not a keyword, it's an identifier. As with most languages, we won't allow keywords like def to be used as variables. If we did, it would make the language a little goofy.

Build and Run

cd code/chapter-03
cmake -S . -B build && cmake --build build
./build/pyxc

Tests

llvm-lit code/chapter-03/test/

The test suite covers the error cases introduced in this chapter — malformed numbers, missing colons, bad separators — as well as location accuracy across sequential lines, comments, and recovery after an error. Peek into code/chapter-03/test/ for examples.

Try It

ready> def add(x, y):
   return x + y
Parsed a function definition.
ready> 1.2.3
Error (Line 3, Column 1): invalid number literal '1.2.3'
1.2.3
^~~~
ready> def bad(x) return x
Error (Line 4, Column 12): Expected ':' in function definition
def bad(x) return 
           ^~~~
ready> def missing_colon(x)
Error (Line 5, Column 21): Expected ':' in function definition
def missing_colon(x)
                    ^~~~
ready>^D

What's Next

The lexer and parser are solid. Error messages are readable. The next step is to connect this to LLVM: walk the AST and emit LLVM IR — real machine-code instructions — for the first time.

Before that, Chapter 4 covers installing LLVM and setting up the build system. It's mostly infrastructure, but you only do it once.

Need Help?

Build issues? Questions?

Include:

  • Your OS and version
  • Full error message
  • Output of cmake --version and ninja --version

We'll figure it out.