17. pyxc: Structs

Where We Are

Chapter 16 gave Pyxc ten scalar types. Every value is still a single number — an int, a float, a bool. If you want to group a pair of coordinates and pass them around as one thing, you're out of luck.

This chapter adds structs. After this chapter:

struct Point:
  x: int
  y: int

def distance_sq(p: Point) -> float64:
  return float64(p.x * p.x + p.y * p.y)

def main() -> int:
  var p: Point
  p.x = 3
  p.y = 4
  printd(distance_sq(p))  # 25.000000
  return 0

Source Code

git clone --depth 1 https://github.com/alankarmisra/pyxc-llvm-tutorial
cd pyxc-llvm-tutorial/code/chapter-17

Grammar

One new declaration and one new expression form:

struct-def    ::= 'struct' identifier ':' NEWLINE INDENT field+ DEDENT

field         ::= identifier ':' type NEWLINE

field-expr    ::= identifier ('.' identifier)+

field-assign  ::= field-expr '=' expression

type          ::= ...
               | identifier   (* struct name — must be declared above the point of use *)

struct is a top-level declaration, like def or extern def. It is not an expression. You cannot declare a struct inside a function.

Field access — p.x, o.inner.value — works both as an expression (read) and on the left side of = (write). Field access must start with a named variable. make_point().x is not supported yet.

A Lurking Lexer Bug

Before anything else: a bug that was already there but only surfaced now. The number lexer entered the float-parsing path whenever it saw a standalone .:

// Before — wrong
if (isdigit(LexerLastChar) || LexerLastChar == '.') {

That meant p.x would lex as: identifier p, then see . and enter the number-parsing path, find x instead of a digit, and produce garbage. Fine when . meant nothing. Fatal now that it separates a variable from its field.

The fix — only enter the float path when the character after . is actually a digit:

// After — correct
if (isdigit(LexerLastChar) ||
    (LexerLastChar == '.' && isdigit(peek()))) {

.5 still works as a float literal. p.x no longer gets eaten.

The struct Keyword

tok_struct = -34,

Registered in the keyword map alongside the other keywords:

{"struct", tok_struct}

ParseTypeToken now recognises struct names as types:

case tok_identifier: {
  string TyName = IdentifierStr;
  if (!StructTypes.count(TyName)) {
    LogError(("Unknown type '" + TyName + "'").c_str());
    return ValueType::Error;
  }
  getNextToken();
  if (StructName)
    *StructName = TyName;
  return ValueType::Struct;
}

ValueType::Struct is a new entry in the enum. Unlike the scalar types, a struct value is not self-describing — ValueType::Struct alone doesn't tell you which struct. You need the name alongside it to know the field layout. This is why ParseTypeToken now takes an optional string *StructName output parameter, and why every place that stores a ValueType for a struct also stores a StructName string next to it. There is a lot of that in this chapter.

Tracking Struct Definitions at Parse Time

Two structs hold what the parser knows about a declared struct:

struct StructFieldInfo {
  string Name;
  ValueType Type = ValueType::Error;
  string StructName;  // only set if Type == Struct
};

struct StructTypeInfo {
  string Name;
  vector<StructFieldInfo> Fields;
  std::map<string, size_t> FieldIndex;  // field name → index into Fields
};

static std::map<string, StructTypeInfo> StructTypes;

StructTypes is the global registry of all declared structs. It is populated at parse time and consulted at parse time — every field access and every struct type annotation looks the struct up here to validate it.

FieldIndex maps field name to position in Fields. It exists for two reasons: O(log n) lookup during field access parsing, and duplicate field detection during struct declaration.

Parsing a Struct Definition

ParseStructDefinition is called when the top-level loop sees tok_struct. It reads the struct name, body, and field list, populating a StructTypeInfo and registering it:

static bool ParseStructDefinition() {
  getNextToken(); // eat 'struct'
  string StructName = IdentifierStr;
  if (StructTypes.count(StructName)) {
    LogError(("Struct '" + StructName + "' is already defined").c_str());
    return false;
  }
  getNextToken(); // eat struct name
  // ... eat ':', newline, INDENT ...
  while (CurTok != tok_dedent && CurTok != tok_eof) {
    string FieldName = IdentifierStr;
    // ... eat ':', parse type ...
    if (Info.FieldIndex.count(FieldName)) {
      LogError(("Duplicate struct field '" + FieldName + "'").c_str());
      return false;
    }
    Info.FieldIndex[FieldName] = Info.Fields.size();
    Info.Fields.push_back({FieldName, FieldType, FieldStructName});
  }
  StructTypes[StructName] = std::move(Info);
  return true;
}

Struct bodies follow the same indentation rules as function bodies. Redefining a struct and declaring duplicate fields are both errors. Forward references are not supported — a struct must be declared before any use of it as a type.

Two New AST Nodes

FieldExprAST

A field read: p.x, o.inner.value.

class FieldExprAST : public ExprAST {
  string BaseName;           // the variable at the root: "p" or "o"
  vector<string> FieldPath;  // the chain of field names: ["x"] or ["inner", "value"]
  ...
};

The type of the expression (set in the constructor) is the type of the last field in the path. getLValueName() returns &BaseName — used by assignment codegen to find the root pointer.

FieldAssignmentExprAST

A field write: p.x = 5.

class FieldAssignmentExprAST : public ExprAST {
  unique_ptr<FieldExprAST> LHS;
  unique_ptr<ExprAST> RHS;
  ...
};

shouldPrintValue() returns false — assignments produce no REPL output.

Parsing Field Access

ParseFieldAccessExpr is called when the parser sees a . after an identifier that resolved to a struct variable. It walks the dot chain, validating each field against StructTypes:

static unique_ptr<FieldExprAST> ParseFieldAccessExpr(
    string BaseName, ValueType BaseType, string BaseStructName) {
  vector<string> Path;
  ValueType CurType = BaseType;
  string CurStruct = BaseStructName;
  while (CurTok == '.') {
    getNextToken(); // eat '.'
    string Field = IdentifierStr;
    getNextToken(); // eat field name
    // look up Field in CurStruct's FieldIndex,
    // advance CurType and CurStruct to that field's type
    Path.push_back(Field);
  }
  return make_unique<FieldExprAST>(BaseName, Path, CurType, CurStruct);
}

Each step resolves the field type from StructTypes. By the time the loop exits, CurType and CurStruct describe the leaf field — the type the whole expression produces.

Field access on the left of = goes through the same ParseFieldAccessExpr, then into ParseFieldAssignmentRHS, which type-checks the RHS and wraps it in FieldAssignmentExprAST.

Tracking Struct Names in Scope

Chapter 16 added VarScopes: vector<map<string, ValueType>> — a stack of maps from variable name to type. Struct variables need the struct name alongside ValueType::Struct, so a parallel stack is added:

static vector<std::map<string, string>> VarStructScopes;

Every time a struct variable enters scope, both stacks are updated:

static void DeclareVar(const string &Name, ValueType Type,
                       const string &StructName = "") {
  VarScopes.back()[Name] = Type;
  if (Type == ValueType::Struct)
    VarStructScopes.back()[Name] = StructName;
}

LookupVarStructName searches VarStructScopes innermost-first, then falls back to GlobalVarStructTypes for globals — mirroring how LookupVarType works:

static string LookupVarStructName(const string &Name) {
  for (auto It = VarStructScopes.rbegin(); It != VarStructScopes.rend(); ++It) {
    auto Found = It->find(Name);
    if (Found != It->end())
      return Found->second;
  }
  auto GI = GlobalVarStructTypes.find(Name);
  if (GI != GlobalVarStructTypes.end())
    return GI->second;
  return "";
}

PrototypeAST also grows a ReturnStructName field, and the pair<string, ValueType> per argument from chapter 16 becomes an ArgInfo struct with Name, Type, and StructName. Same mechanics; just more to carry per argument.

From Struct Name to LLVM Type

LLVM represents struct types as StructType* objects. GetOrCreateLLVMStructType converts a Pyxc struct name to the corresponding LLVM type, creating it on first use and caching the result:

static std::map<string, StructType *> LLVMStructTypes;

static Type *GetOrCreateLLVMStructType(const string &StructName) {
  auto It = LLVMStructTypes.find(StructName);
  if (It != LLVMStructTypes.end())
    return It->second;

  auto *ST = StructType::create(*TheContext, "struct." + StructName);
  LLVMStructTypes[StructName] = ST;  // register before filling the body

  vector<Type *> FieldTys;
  for (const auto &Field : StructTypes[StructName].Fields)
    FieldTys.push_back(LLVMTypeFor(Field.Type, Field.StructName));
  ST->setBody(FieldTys, false);
  return ST;
}

Three things worth noting here.

First, the cache lookup is essential. LLVM creates a distinct StructType object each time you call StructType::create with the same name — it does not deduplicate them. Without the cache, two separate alloca instructions for the same struct would use two unrelated LLVM types with the same layout but different identities. Every load, store, and GEP that mixes them would fail.

Second, the type is registered in LLVMStructTypes before its body is filled. This is not an accident — it allows a struct to contain a pointer to itself without infinite recursion. A struct containing itself by value would be infinitely large, so that case doesn't come up in valid code.

Third, setBody(FieldTys, false) — the false means non-packed. Fields are laid out with natural alignment, the same as a C struct by default.

LLVMTypeFor dispatches to this function for ValueType::Struct:

case ValueType::Struct:
  return GetOrCreateLLVMStructType(StructName);

The IR Layout

For:

struct Point:
  x: int
  y: int

The LLVM type, named with the "struct." prefix:

%struct.Point = type { i64, i64 }

int is pointer-width (i64 on a 64-bit host). A struct with a float64 field:

struct Circle:
  radius: float64
%struct.Circle = type { double }

Fields appear in declaration order. LLVM inserts padding according to the target's data layout — it is not visible in the IR but is present in the machine code.

Codegen: Getting a Field's Address

Reading or writing a field means computing a pointer to it first. GetFieldAddress does this by walking FieldPath one step at a time:

static Value *GetFieldAddress(const string &BaseName,
                              const vector<string> &FieldPath, ...) {
  // find the base pointer — local alloca or global variable
  Value *Ptr = BasePtr;
  for (const auto &FieldName : FieldPath) {
    size_t Idx = StructTypes[CurStruct].FieldIndex[FieldName];
    Type *BaseLLVM = LLVMTypeFor(CurType, CurStruct);
    Ptr = Builder->CreateStructGEP(BaseLLVM, Ptr, Idx, "fieldptr");
    // advance CurType and CurStruct to this field's type
  }
  return Ptr;
}

CreateStructGEP emits a getelementptr inbounds for struct field access. One GEP per field step. For p.x on a Point:

%fieldptr = getelementptr inbounds %struct.Point, ptr %p, i32 0, i32 0

For o.inner.value where inner is an Inner:

%fieldptr  = getelementptr inbounds %struct.Outer, ptr %o, i32 0, i32 0
%fieldptr1 = getelementptr inbounds %struct.Inner, ptr %fieldptr, i32 0, i32 0

One GEP per field step, not one big multi-index GEP. Simpler codegen, same result.

Codegen: Reading and Writing Fields

Read:

Value *FieldExprAST::codegen() {
  Value *Ptr = GetFieldAddress(*getLValueName(), FieldPath, ...);
  return Builder->CreateLoad(LLVMTypeFor(LeafType, LeafStruct), Ptr, "fieldload");
}

Compute the pointer, load from it. For p.x where x: int:

%fieldptr  = getelementptr inbounds %struct.Point, ptr %p, i32 0, i32 0
%fieldload = load i64, ptr %fieldptr

Write:

Value *FieldAssignmentExprAST::codegen() {
  Value *Ptr = GetFieldAddress(*LHS->getLValueName(), LHS->getFieldPath(), ...);
  Value *Val = RHS->codegen();
  Val = EmitImplicitCast(Val, RHS->getType(), DestType);
  Builder->CreateStore(Val, Ptr);
  return Val;
}

Compute the pointer, codegen the RHS, implicit cast if needed, store. For p.x = 5 where x: int:

%fieldptr = getelementptr inbounds %struct.Point, ptr %p, i32 0, i32 0
store i64 5, ptr %fieldptr

The implicit cast rules from chapter 16 apply to field assignments. Assigning a float64 to an int field is a type error. Assigning an int8 to an int field widens silently.

Struct Variables and Zero Initialization

var p: Point with no initializer allocates stack space and zero-initializes the struct:

InitVal = ZeroConstant(VarType, VarStructName);
// ...
Builder->CreateStore(InitVal, Alloca);

ZeroConstant for a struct calls Constant::getNullValue(LLVMTypeFor(Type, StructName)), which produces a zero aggregate constant:

%p = alloca %struct.Point
store %struct.Point zeroinitializer, ptr %p

There is no struct initializer syntax yet — var p: Point = Point{x: 1, y: 2} is not supported. Struct variables always start zeroed. Fields are then assigned individually.

Structs Are Passed by Value

When a function takes a struct parameter, the caller passes a copy:

struct Box:
  value: int

def clobber(b: Box) -> None:
  b.value = 0

def main() -> int:
  var b: Box
  b.value = 99
  clobber(b)
  # b.value is still 99 here
  return 0

The function signature in IR:

define void @clobber(%struct.Box %b) {
entry:
  %b.addr = alloca %struct.Box
  store %struct.Box %b, ptr %b.addr
  %fieldptr = getelementptr inbounds %struct.Box, ptr %b.addr, i32 0, i32 0
  store i64 0, ptr %fieldptr
  ret void
}

clobber receives a copy of b. Writing to b.value inside clobber writes to that copy. The caller's struct is unchanged after the call. If you want a function to modify the caller's struct, you need a pointer — that's chapter 18.

Global Struct Variables

Struct variables at global scope work the same as scalar globals:

struct Counter:
  value: int

var g: Counter

Zero-initialized at program start:

@g = global %struct.Counter zeroinitializer

Field reads and writes on globals go through the same GetFieldAddress path — it checks NamedValues for locals first, then falls back to GetGlobalVariable.

Build and Run

cd code/chapter-17
cmake -S . -B build && cmake --build build

Try It

Basic field access

struct Point:
  x: int
  y: int

extern def printd(x: float64)

def main() -> int:
  var p: Point
  p.x = 3
  p.y = 4
  printd(float64(p.x + p.y))
  return 0
7.000000

Passing a struct to a function

struct Point:
  x: int
  y: int

extern def printd(x: float64)

def sum_point(p: Point) -> int:
  return p.x + p.y

def main() -> int:
  var p: Point
  p.x = 5
  p.y = 7
  printd(float64(sum_point(p)))
  return 0
12.000000

Nested field access

struct Inner:
  value: int

struct Outer:
  inner: Inner

extern def printd(x: float64)

def main() -> int:
  var o: Outer
  o.inner.value = 9
  printd(float64(o.inner.value))
  return 0
9.000000

Inspect the IR

pyxc --emit llvm-ir -o out.ll program.pyxc
grep 'struct\|getelementptr\|alloca' out.ll

Known Limitations

No struct initializer syntax. var p: Point = Point{x: 1, y: 2} is not supported. Fields must be assigned individually after declaration.

No struct-to-struct copy. var p2: Point = p1 is not supported. Whole-struct initialization from another variable isn't implemented yet.

Field access must start with a named variable. make_point().x is rejected — the base must be a variable in scope, not an expression.

No pointer-to-struct. Functions take structs by value. To share a struct across functions and have modifications be visible to the caller, you need a pointer — that's chapter 18.

What's Next

Chapter 18 adds pointers: ptr[T] as a type, addr(x) to take the address of a variable, and p[i] for pointer indexing. With pointers, you can pass a struct by reference and have functions modify the caller's data.

Need Help?

Build issues? Questions?

Include:

  • Your OS and version
  • Full error message
  • Output of cmake --version, ninja --version, and llvm-config --version

We'll figure it out.