Engineering a Custom LLM for Hyperlambda

I have been fine-tuning a Large Language Model to natively write Hyperlambda.

That sounds at first like a normal code-model project.

It is not.

Because Hyperlambda is not just another programming language.

It is a declarative relational file format for creating execution trees, which can then be executed inside Magic, a secure cloud operating system implemented in C# on .NET 10. In memory, these structures become lambda objects. On disk, they exist as Hyperlambda text. The important point is that the language is fundamentally AST-oriented, structurally dense, mutable, and executable.

That changes the fine-tuning problem completely.

I am not trying to train a model to emit code that merely looks plausible to a human reader. I am trying to train a model to emit a highly compact deterministic execution graph where small structural mistakes are fatal.

In a language like Python, JavaScript, or C#, the model has some room to be sloppy and still appear useful. In Hyperlambda, the tolerance is much lower. A missing structural cue, a malformed slot invocation, a wrong dot prefix, or broken indentation does not produce something that is slightly ugly. It produces something that is semantically wrong.

That is the central challenge.

What makes Hyperlambda a special target

To understand why this matters, you first have to understand what Hyperlambda actually is.

Hyperlambda is built from nodes.

Each node has three conceptual properties.

A name
A value
Zero or more children

Names and values are separated by a colon. Children are scoped by three additional spaces of indentation. The executor walks nodes sequentially from top to bottom and assumes that everything is a slot invocation unless the node name starts with a dot.

That last detail matters a lot.

Nodes starting with a dot are normally treated as data. Nodes without a dot are normally treated as executable slots.

This means syntax is not merely cosmetic. It is directly tied to runtime meaning.

To illustrate this, consider a simple data segment followed by a loop.

.data
   item1:Hello from Item1
   item2:Hello from Item2
   item3:Hello from Item3
for-each:x:@.data/*
   log.info:x:@.dp/#

The .data node is inert data because it starts with a dot.

The for-each node is executable because it does not.

That distinction is easy for a human to read.

But for an LLM, it means a single-character drift can change the role of an entire subtree.

Hyperlambda is also unusually compact. It is less verbose than JSON or XML for representing execution trees because it does not require unique child names and because nested execution semantics are expressed directly through indentation and node ordering.

This compactness is one of the reasons it is so useful for machine-generated software.

It is code as mutable data.

You can create it, transform it, splice new nodes into it, inspect it, and execute it, all while preserving a simple graph structure. That makes it a very strong DSL for orchestration, code generation, dynamic composition, and autonomous backend work.

It also makes it unusually unforgiving when a model starts drifting.

Why I wanted a model that speaks Hyperlambda natively

The actual goal was to create an autonomous full-stack software agent.

Not a chatbot that explains code. Not an assistant that writes partial stubs. An agent that can reason through a task, generate backend logic, build authenticated CRUD flows, construct frontends, choose the right tools, and do useful engineering work in a secure runtime.

Hyperlambda is a very good target for that kind of system because it already maps naturally to execution trees, slot invocations, HTTP endpoints, file manipulation, and database operations.

It is particularly strong for backend automation because the language is so close to runtime intent.

That means if the model really learns Hyperlambda, it is not just learning syntax. It is learning a compact executable representation for orchestrating real work.

And initially, the results were better than I expected.

The baseline model demonstrated strong zero-shot generalization. It could autonomously generate a fully authenticated full-stack SQLite CRUD API, including a custom Magic Auth frontend, in roughly 30 minutes.

More interestingly, it showed evidence of operational reasoning.

For web scraping tasks, it did not simply imitate human workflows. It chose a more efficient execution path, preferring a fast raw HTTP GET through Hyperlambda over a clunky headless browser when the problem did not actually require a browser.

That is exactly the kind of behavior I wanted.

Not just code emission. Reasoning plus tool selection.

At that point it looked like the project was on track.

Then the model started getting smarter in exactly the wrong way.

The failure mode was subtle and dangerous

As the model learned to generate larger and more sophisticated architectures, its high-level reasoning improved.

But its low-level syntax reliability began to degrade.

That is the worst kind of failure in a deterministic DSL.

The model could still produce impressive large-scale solutions. It could still reason across multiple moving parts. It could still compose backend and frontend logic.

But at the same time, it started hallucinating simple foundational constructs it had previously known perfectly.

Basic if and else patterns drifted. Simple http.get usage became less stable. Core slot invocation forms became unreliable.

This was not ordinary hallucination.

It was catastrophic forgetting.

And it makes perfect sense once you look at the training dynamics from the optimizer's point of view instead of the human curator's point of view.

The real problem was token volume

The core mistake is easy to describe.

I originally treated the dataset as though balancing snippet counts was enough.

It was not.

Loss is computed per token, not per snippet. Gradient updates are driven by token-level prediction error, not by the fact that two folders might contain the same number of examples.

This means that a 600-token sample does not have the same influence as a 40-token sample just because each counts as one row in a dataset.

It has vastly more optimization mass.

If your long-form architectural snippets are fifteen times larger than your foundational syntax snippets, then they present many more prediction positions, generate much more aggregate loss, and therefore exert much stronger pressure on the weights.

That pressure accumulates.

So even if your dataset looks balanced in terms of item count, the optimizer may still be spending most of its effort trying to fit long compositional trajectories.

That is exactly what happened here.

As the model learned larger and larger Hyperlambda architectures, the gradients from those long samples started mathematically dominating the optimization process. The foundational grammar did not disappear because it was missing from the corpus. It disappeared because it was underweighted relative to the long-form code.

The complex samples were hijacking the updates.

This is the trap I fell into.

The model was learning more reasoning while simultaneously losing pieces of the syntax substrate that made that reasoning executable.

Why this problem is amplified in deterministic DSLs

This issue exists in many code-model projects, but deterministic DSLs expose it much more aggressively.

The reason is simple.

In Hyperlambda, many short snippets carry disproportionately important grammatical information.

A tiny snippet might encode:

how slot invocation works
how arguments are attached
how lambda expressions bind to nodes
when a node is data versus executable
how indentation scopes children
how types are declared
how sequential execution composes nested calls

These are not low-value beginner examples.

They are the grammar kernel of the language.

If the model loses reliability there, then it can still sound sophisticated while producing invalid execution trees.

That is why deterministic DSL fine-tuning is a different engineering problem than general code fine-tuning.

The small examples often matter more than their token count suggests.

Re-architecting the corpus

To solve this, I stopped thinking about the dataset as one flat pile of training snippets.

Instead, I analyzed and tokenized the entire corpus and separated it into a five-bucket complexity pyramid.

Bucket 1 under 60 tokens for core grammar and primitives
Bucket 2 under 120 tokens for extended syntax
Bucket 3 under 250 tokens for mid-level composition
Bucket 4 under 600 tokens for complex logic
Bucket 5 over 600 tokens for long-form reasoning and full applications

This made the structure of the problem visible.

Bucket 1 was not just easy data. It was the base grammar. Bucket 5 was not just advanced data. It was where the token mass lived.

Once the dataset was partitioned this way, the solution became obvious.

I had to decouple snippet count from token influence.

The fix was probabilistic curriculum learning

The approach I ended up with used three mechanisms.

Stochastic oversampling
Randomized placement
Proportional batch chunking

The objective was to keep foundational syntax continuously present during training without turning the short examples into rigid memorization targets.

Stochastic oversampling

Buckets 1 and 2 were artificially inflated using probabilities instead of deterministic copying.

For Bucket 1, each snippet had:

15 percent chance of getting 2 extra copies
35 percent chance of getting 1 extra copy
50 percent chance of getting no extra copies

For Bucket 2, each snippet had:

10 percent chance of getting 2 extra copies
20 percent chance of getting 1 extra copy
70 percent chance of getting no extra copies

This added roughly 15,000 extra foundational examples without creating a repetitive deterministic duplication pattern.

That distinction matters.

If you duplicate everything mechanically, the model starts memorizing exact local strings. If you duplicate stochastically, the grammar becomes more prevalent without becoming a rigid pattern.

It behaves more like ambient reinforcement.

Randomized placement

The duplicates were then injected at random positions inside the arrays.

This prevented local clustering.

That matters because if duplicated syntax snippets bunch together, the model gets bursts of concentrated reinforcement followed by long absences. I did not want spikes. I wanted distribution.

Random placement ensured that foundational syntax remained present throughout training rather than arriving in obvious waves.

Proportional batch chunking

This was the most important part.

To avoid batches dominated by long-form samples, the five buckets were drained simultaneously using a fixed extraction ratio. For the final post-inflation dataset of roughly 77,000 items and a target batch size of 32, the chunk ratio was engineered as follows.

19 from Bucket 1
9 from Bucket 2
7 from Bucket 3
5 from Bucket 4
2 from Bucket 5

This ratio ensured that all buckets emptied at roughly the same time.

In practical terms, every segment of training mixed core grammar with long-form reasoning.

That changed the optimization landscape entirely.

The model was no longer being forced to choose between remembering syntax and learning architecture. It was being trained to keep both active at the same time.

Signal purity mattered just as much as balancing

Once you are tuning a model for a deterministic DSL, every irrelevant token becomes noise.

So I aggressively sanitized the dataset.

Completions contained no conversational filler at all. No explanatory prose. No wrapper text. Just instruction mapped directly to raw Hyperlambda.

Markdown fences were removed completely.

This was important because I did not want the model learning that executable code should be wrapped in presentation syntax. In production, I needed raw output that could be executed immediately.

Line endings were normalized from Windows carriage returns to Unix newlines to avoid unnecessary token fragmentation.

And the system prompt explicitly anchored Hyperlambda's strict three-space indentation rule.

That might sound minor.

It is not minor in a language where indentation is part of the execution tree.

Again, this is one of the things that makes DSL fine-tuning different. In a deterministic executable format, formatting discipline is not style. It is syntax.

Hyperparameters had to reflect the new dataset

Because the corpus was now inflated in the lower buckets, I also had to be conservative with training settings.

The final parameters were:

Batch size 32
Learning rate multiplier 0.5
Epochs 2

Batch size 32 gave stable blended updates across the bucketed chunks.

The reduced learning rate helped prevent aggressive overwriting in a language where tiny regressions matter.

And dropping from 3 epochs to 2 was important because once foundational examples are probabilistically duplicated, an extra epoch starts increasing the risk of memorization.

The goal was not to make the model recite short snippets.

The goal was to preserve the grammar while learning the larger programs.

What changed after the fix

After re-architecting the corpus, the model stopped drifting on foundational syntax while retaining its long-form reasoning ability.

That was the real success condition.

I did not want a model that became syntactically correct by becoming less capable. I wanted a model that could still build large systems without forgetting how the language itself works.

That is what the new pipeline delivered.

The result was a much more balanced Hyperlambda-native engineering agent with stronger syntax retention, fewer low-level hallucinations, and better reliability during autonomous generation.

What I learned from this

The biggest lesson is that fine-tuning LLMs for deterministic DSLs is its own engineering discipline.

You cannot treat the corpus like a generic code dataset. You cannot assume equal snippet counts imply balanced learning. And you definitely cannot assume that if the model gets better at complex reasoning, it is also preserving the grammatical substrate required to execute that reasoning.

In a language like Hyperlambda, the opposite can happen.

The model can become globally more impressive while locally becoming less trustworthy.

That is what catastrophic forgetting looked like in this project.

And the fix was not magical.

It was mathematical.

Once I started treating token volume, gradient pressure, curriculum structure, and syntax purity as first-class concerns, the path forward became clear.

If you want a model to write a deterministic DSL reliably, you have to train for grammar retention explicitly.

Otherwise the large samples will eventually bulldoze the small ones.

And in a language where the small ones define the execution grammar, that is where everything starts breaking.