neural voxel parkour
the idea
I wanted a small 3D voxel parkour game where the physics step could be swapped
at runtime. The analytic engine handles acceleration, friction, gravity, jumping,
and axis-separated AABB collision in one function: physics_step().
The goal was to train a neural network that predicts the same state transition,
then push inference latency down until it is close to the analytic baseline.
That means two parallel optimization tracks. On the algorithmic side: keep the teacher fast and deterministic. On the neural side: shrink the observation, quantize weights, fuse normalization, and tune the matmul kernels until the gap is measured in single-digit microseconds, not milliseconds.
Target accuracy is at least 98% one-step grounded agreement on held-out data. Every model we kept clears that bar. The harder metric is rollout: run the network for hundreds of ticks and compare to analytic physics. That is what you feel in-game as drift, jitter, and phasing through voxels.
The current default is model_q8_h24_p3.bin: 39 inputs (3x3x3 voxels
+ 12 scalars), 24-neuron int8 MLP, ~0.95 us per neural step (~12x analytic at
~0.08 us). It is not the fastest option we trained, but it is the most stable
on long runs among the models we still ship: ~89% rollout grounded agreement,
~2.4% of airborne steps overlap solids (analytic: 0%).
Most of the code — C inference, training pipeline, benchmarks, int8 export — was built in one afternoon with Cursor agents from a written spec. See implementation and building with cursor below for file layout, the record→train→bench loop, and what made iteration fast.
the game
16x16x16 voxel world, procedural parkour maps, orange cube player, WASD and space to jump. Each launch generates a new map via random-walk platforms with gaps and height changes. Blue start, green goal. Fall below y = -4 and respawn. Reach the goal and the world regenerates.
architecture
The physics pass is isolated behind a single call site so analytic and neural paths share the same game loop:
if (physics_mode == PHYSICS_ANALYTIC)
physics_step(&player, input, dt);
else
neural_physics_step(&player, input, dt);
The analytic engine is the teacher. During data collection it records paired before/after states under randomized input policies. The neural path loads a binary MLP from disk and runs inference in C with no pytorch at runtime.
Output is 7 floats: dx, dy, dz, next vx, next vy, next vz, next grounded. Training uses pytorch; inference uses exported float or int8 weights with per-row scale factors.
implementation
The repo is plain C99 + raylib, CMake, and a small Python training script. No engine, no ECS, no runtime ML dependencies. The whole project is small enough to hold in your head: seven C files, one header of shared constants, two Python tools.
code layout
| file | role |
|---|---|
main.c | raylib loop, input, Tab toggle, model load order |
physics.c | analytic teacher: accel, friction, gravity, axis AABB collision |
world.c | 16³ voxel grid, procedural map, player spawn nudge |
observation.c | 3³ patch around player, feature packing for train + inference |
neural.c | binary model load, int8/float forward pass, AVX2 matmul, neural_physics_step() |
sim.c | headless dataset recorder, rollout benchmark vs analytic |
common.h | PATCH_D, INPUT_DIM, packed TrainingRecord, model magic bytes |
tools/train.py | load dataset, train MLP, fuse norm, export int8/float .bin |
closed loop
Everything hangs off one pipeline. Same binary, different CLI flags:
# 1. record 500k one-step transitions from analytic physics (headless, ~seconds)
./voxel_parkour.exe --record 500000 --out data/train.bin
# 2. train + export int8 models (~12 min stable sweep on CPU)
python tools/train.py --sweep-stable --epochs 40
# 3. measure rollout + latency vs analytic
./voxel_parkour.exe --bench models/model_q8_h24_p3.bin
# 4. play, toggle mid-game
./voxel_parkour.exe
Recording never opens a window. It runs the same physics_step() as the
game, writes fixed-size rows to disk, and rotates maps when the player falls or
an episode hits 400 steps. Training reads those rows, holds out 20% of map seeds
for validation, and exports weights the C loader understands.
observation format
Each training row is 90 bytes, packed to match C struct layout exactly:
- 27 bytes — voxel types in the 3³ neighborhood (0=empty, 1=solid, …)
- 12 bytes — sub-voxel offset (3 floats)
- 12 bytes — velocity before the step
- 6 bytes — grounded flag + 5 input buttons
- 4 bytes — fixed dt (1/60)
- 25 bytes — analytic targets: Δpos, next vel, next grounded
- 4 bytes — map seed (for train/val split)
At inference, pack_player_features() builds the same 39-float input
vector the trainer sees: voxels / 4, offsets, normalized vel, grounded, one-hot
inputs. Train/serve skew here would show up immediately as rollout drift, so the
packing path is shared logic duplicated once in C (not two different codepaths).
model binary format
Shipped models are int8 v2 blobs (~1–4 KB). Header: magic MLP!, version,
four ints (input_dim, h1, h2, output_dim). Weights are per-row quantized int8
with float scales; biases are pre-fused with input/output normalization at export
time so inference never touches mean/std arrays. Forward pass: int8 matmul →
scale → ReLU → int8 matmul → scale → 7 outputs applied directly to player state.
void neural_physics_step(Player *p, InputState input, float dt) {
float in[INPUT_DIM], out[OUTPUT_DIM];
pack_player_features(p, input, in);
neural_forward(in, out);
p->pos.x += out[0]; p->pos.y += out[1]; p->pos.z += out[2];
p->vel.x = out[3]; p->vel.y = out[4]; p->vel.z = out[5];
p->grounded = out[6] >= 0.5f ? 1 : 0;
}
AVX2 FMA kernels live only in neural.c. Applying -march=native
to the whole executable caused stack misalignment crashes in raylib's render path;
scoping SIMD to the inference translation unit fixed that.
benchmark modes
--bench runs 50 episodes × up to 300 steps. Each step: same random
input drives analytic and neural players in parallel from identical starting
state. We track mean position/velocity error, grounded mismatch %, and tunnel
count (neural AABB overlapping solids while airborne). That last metric matches
what you see as phasing in play.
analytic teacher (what the network imitates)
The teacher is deliberately boring: axis-separated AABB collision against a
16×16×16 voxel grid, no rotation, no swept tests. Each frame applies input
acceleration, horizontal friction, speed clamp, optional jump, gravity, then
moves X → Y → Z independently — if a move collides, snap back and zero velocity
on that axis. Landing on Y sets grounded = 1.
void physics_step(Player *p, InputState input, float dt) {
/* accel + friction + jump + gravity */
p->grounded = 0;
move_axis(p, 0, p->vel.x * dt); /* X, then Y, then Z */
move_axis(p, 1, p->vel.y * dt);
move_axis(p, 2, p->vel.z * dt);
}
Recording snapshots the player before the step, runs
physics_step(), snapshots after, and writes the delta as
training targets. The network never sees collision resolution code — it learns
the end-to-end mapping from local geometry + state to the teacher's outcome.
training and export
train.py loads the binary dataset with numpy struct unpacking
(one frombuffer pass over 500k rows, not a Python loop per sample).
Map seeds split train/val 80/20 so the network generalizes across procedural
layouts, not just held-out frames from the same map.
The MLP is a plain two-layer ReLU net in PyTorch: 39 → h → 7. After training,
fuse_weights() bakes input/output z-score normalization into the
first and last layer so C inference never loads mean/std arrays:
w1 = w1 / x_std; b1 = b1 - w1 @ x_mean # input norm → layer 1
w3 = w3 * y_std; b3 = b3 * y_std + y_mean # output denorm → layer 3
int8 export quantizes each weight row independently (per-row scale, max abs → 127).
Biases stay float32. The C loader reads magic MLP!, dims, packed int8
weights + scales, and runs the same graph as PyTorch at eval time — within
quantization error. Sweep flags train multiple hidden widths in one invocation,
reusing the cached dataset between models.
build and entry points
CMake fetches raylib 5.5, builds one executable with Release defaults
(-O3 -ffast-math). AVX2/FMA flags apply only to neural.c
— not the whole binary. Same exe, four modes:
| flag | behavior |
|---|---|
(none) | raylib game, Tab toggles analytic/neural, N reloads model |
--record N --out path | headless, writes N rows to train.bin |
--bench model.bin | headless rollout + forward-pass timing vs analytic |
Model load tries several relative paths (models/…,
../models/…) so the same binary works from repo root or
build/. Default search order prefers h24, then h48×24, h64, h48.
building with cursor
I did not hand-write most of this. I wrote spec.md — constraints,
data formats, accuracy floor, swap interface — and drove the implementation
through Cursor agents in a single long session over one day. My job was picking
tradeoffs and reading benchmark output; the agent's job was generating code,
running builds, fixing compile errors, and iterating until the numbers moved.
why this project fits agents well
- Closed loop with objective metrics. Record, train, bench, play. Every hypothesis ends in a number: microseconds, grounded %, tunnel count. Agents can run the loop without me context-switching into five tools.
- Small surface area. ~2k lines of C, one MLP, one training script. Fits in context. Refactors don't require understanding a whole engine.
- Spec as source of truth. Fixed dt, no pytorch at runtime, 98% one-step floor, binary export format — agents had concrete acceptance criteria instead of "make it feel good."
- Embarrassingly parallel experiments. "Try 7³ patch." "Add int8 export." "Sweep h12–h32." Each is a prompt + a bench run, not a week of manual porting.
spec.md as the contract
Before any code landed, I wrote a spec: observation layout, record format, model binary layout, accuracy floor, swap interface. That file became the agent's acceptance test. When rollout looked bad, I didn't argue about feel — I pointed at spec constraints ("no analytic fallback at runtime") and asked for new metrics. When the agent added collision fallback to hide drift, I rejected it against the spec and the real problem surfaced.
Concrete fields in the spec — 90-byte rows, 7-float outputs, fixed 1/60 dt — meant the agent could generate matching C structs, Python loaders, and export code in one pass without me hand-aligning offsets.
prompts that actually moved the needle
Short, outcome-oriented prompts worked best. The agent had shell access, so "make it faster" became "Release build, fuse norm, re-bench" in one turn. Examples, paraphrased from the session:
- "Implement the full pipeline from spec.md: record mode, train.py, C inference, Tab toggle." — greenfield in ~2 hours of agent time.
- "305 µs is too slow. Optimize inference without changing accuracy." — Release flags, fused norm, single-pass pack; ~50× in one iteration.
- "Try shrinking the patch to 7³, then 3³. Sweep and bench each." — touched
common.h, crop logic, re-export, full metric table. - "Remove analytic fallback — pure neural only." — one function edit, rollout truth exposed.
- "Training sweep takes forever. Vectorize the loader." — 500k-row parse dropped from minutes to 0.3 s.
- "Game segfaults on launch after AVX." — gdb trace, scoped SIMD to
neural.c, no raylib rabbit hole. - "Bench numbers don't match playtesting." — found
sample_in[729]overflow in timing loop.
I rarely pasted stack traces or file paths. The agent searched the repo, ran commands, and reported numbers back. My input was mostly constraints and yes/no on tradeoffs.
what a typical iteration looked like
Prompt → agent edits C/Python/CMake → agent runs cmake --build →
agent runs --bench or train.py --sweep → I read the
table → next prompt. Examples from this project:
First neural build was Debug, unoptimized, with runtime normalization passes. One prompt: Release flags, fuse norm into weights, single-pass feature pack. Agent rebuilt, re-benched. ~50× faster in one turn. I didn't profile manually — the agent ran the before/after.
I asked to try 7³, then 3³. Agent updated common.h, crop logic in
train.py, re-exported models, updated load order in main.c,
ran the sweep. Entire patch-size pivot in one conversation thread without me
touching individual files.
Game crashed intermittently after adding global AVX flags. Agent bisected with
gdb, found crash at BeginDrawing, traced to stack misalignment from
-march=native on main.c, scoped AVX to neural.c
only. I would have blamed raylib for hours.
First training script parsed 500k rows in a Python loop, reloaded the dataset four times per sweep. Agent vectorized loading (~0.3 s), cached data across models, bumped batch size, added ETA logging. Stable sweep dropped to ~13 min on CPU with visible progress each epoch.
Early 3³ benches reported suspiciously good numbers. Agent found
sample_in[729] writing past a 39-element array in sim.c
during forward-pass timing. One-line fix, re-benched, rollout numbers finally
matched playtesting (phasing, drift).
what I still had to do
Agents are fast at implementation; they don't replace judgment calls:
- Remove analytic fallback — agent added collision fallback when rollout looked bad. I said no: pure neural or nothing. That made the real rollout problem visible.
- Pick the default model — h12 is fastest; h24 is most stable on long runs. Agent would have kept optimizing latency; I chose rollout over raw µs.
- Native 3³ re-record — cropping center 3³ from old 9³ data was a hack. I asked for honest recording at PATCH_D=3. Accuracy similar, rollout different, old weights discarded.
- Read the game — benches don't capture feel. I playtested, reported phasing, agent updated the blog and pointed at tunnel metrics. Subjective + objective together.
iteration speed (rough)
| task | wall time | who |
|---|---|---|
| initial game + analytic physics + recorder | ~1 hr | agent (from spec) |
| pytorch train + float export + C inference | ~1 hr | agent |
| Release + AVX + int8 + patch sweeps (9³→7³→3³) | ~2 hr | agent + my bench reviews |
| 500k sample re-record (native 3³) | <5 sec | agent runs exe |
| int8 stable sweep (4 models, 40 epochs) | ~13 min | agent runs train.py |
| debug segfault / bench overflow / train loader | ~30 min each | agent diagnose + fix |
| blog + benchmark docs | ~20 min | agent draft, I correct metrics |
where time actually goes
Cursor compresses typing and glue, not GPU/CPU training. A typical experiment loop:
- ~30 sec — agent edits 2–5 files, rebuilds Release binary
- <5 sec — re-record 500k rows if observation shape changed
- ~13 min — train + export 4 int8 variants (the real bottleneck)
- ~10 sec — bench all models, paste comparison table
- ~2 min — me: read table, pick next hypothesis, send one prompt
Nine patch/architecture pivots in one afternoon is feasible because steps 1, 2, 4, and 5 are agent-owned. I was never waiting on me to write matmul kernels or debug CMake. I was waiting on epoch 37 of four parallel MLPs — but that runs unattended while the agent logs ETA.
Compare to solo dev without an agent: same afternoon might cover "get pytorch export matching C struct layout" and one broken bench run. Here that was the first hour, and the rest was optimization and experimentation.
human vs agent split
| agent owned | I owned |
|---|---|
| C/Python/CMake edits, SIMD kernels, binary I/O | spec constraints, accuracy vs latency tradeoffs |
cmake --build, --record, --bench, sweeps | rejecting analytic fallback, picking h24 default |
| gdb segfault bisect, buffer overflow find | playtesting feel, calling out phasing |
| blog drafts, metric tables, reproduce commands | correcting one-step vs rollout confusion |
End-to-end from empty repo to playable neural swap with documented benchmarks: one afternoon. Without Cursor that is easily a week of setup, training pipeline boilerplate, and inference export glue — work I didn't want to write by hand.
starting assumptions
These were the constraints I set before any optimization work began. They shaped every later decision.
- Fixed dt only. 1/60 s everywhere. No variable timestep in training or inference.
- Pure C MLP inference. No pytorch, no BLAS dependency at runtime. Weights load from a binary blob.
- Analytic teacher.
physics_step()generates labels. The network learns to imitate it, not invent its own physics. - One-step supervision. Train on single-tick transitions. Rollout drift is measured separately; it now drives default model selection, not just diagnostics.
- Swappable call site. Same game loop, same inputs, same observation packing. Only the step function changes.
- No collision correction at runtime. The network sets position and velocity directly. There is no analytic fallback. Small per-frame errors compound; the player can end up inside voxels.
- 98% grounded floor (one-step). Any model below that on held-out validation is rejected. Rollout stability is a separate, harder filter for picking the default.
Full writeup of the original spec lives in spec.md in the repo.
See building with cursor above for how those constraints were enforced
in practice.
current models
Four int8 models remain in models/. All use native 3x3 observations
(500k recorded samples, 90-byte rows). Older 7x7, 9x9, float, and speed-ladder
(h12/h16/h32) checkpoints were removed after rollout testing.
| file | hidden | one-step | rollout | tunnel* | step |
|---|---|---|---|---|---|
model_q8_h24_p3.bin default |
24 | 98.7% | ~89% | ~2.4% | ~0.95 us |
model_q8_h48_p3.bin |
48 | 98.7% | ~80% | ~2.8% | ~1.8 us |
model_q8_h64_p3.bin |
64 | 98.7% | ~73% | ~2.7% | ~2.9 us |
model_q8_h48x24_p3.bin |
48x24 | 98.7% | ~78% | ~3.2% | ~2.4 us |
*tunnel = fraction of steps where the neural player overlaps solid voxels while
not grounded. Analytic physics: 0%. Measured on 50 episodes x up to 300 steps with
--bench.
optimization progression
Each checkpoint below is a real model variant we trained and benchmarked. For each one I list what we assumed would work, what actually drove the numbers, and what we learned before moving to the next step.
Model: hand-written physics_step()
Latency: ~0.08 us per tick
assumption
A few arithmetic ops plus three axis-separated collision checks would stay effectively free compared to any neural path. This is the target we are trying to approach, not beat.
what drives performance
- Fixed small constant work: accel, friction, gravity, jump, three axis moves
- No allocations, no branches beyond collision resolution
- Compiler inlines everything at -O3
result
Confirmed. Analytic physics uses roughly 0.005% of a 60 FPS frame budget. Every neural model is measured as a multiple of this number.
Input: 741 dims (729 voxels + 12 scalars)
Architecture: 741 -> 128 -> 128 -> 7
Latency: ~305 us full step (~4000x analytic)
One-step accuracy: 99.2% grounded, pos RMSE 0.00253
assumption
A 9x9x9 local patch gives enough spatial context to learn collision and jumping. A two-hidden-layer 128x128 MLP has enough capacity to match the teacher on one-step labels. Accuracy first, speed later.
what drives performance
- Matmul dominates. ~741x128 + 128x128 + 128x7 multiply-adds every tick
- Debug build. No -O3, no SIMD, naive dot products
- Runtime normalization. Separate mean/std pass before layer 1, denorm after layer 3
- Two-pass feature packing. Build observation struct, then pack to floats
- Float32 weights. Full bandwidth on every weight read
result
Accuracy was good immediately. Speed was not playable-adjacent without optimization. The 9x9 patch assumption held for accuracy but was overkill for this simple geometry. We had proof the swap worked; next goal was inference.
Input: 741 dims
Architecture: 741 -> 64 -> 64 -> 7
Latency: ~6 us full step (~80x analytic), forward ~4.9 us
One-step accuracy: 99.1% grounded, pos RMSE 0.00324
assumption
Most of the 305 us was implementation overhead, not fundamental compute. Release build, smaller hidden layers, and a faster inference path would get us into single-digit microseconds without retraining on a smaller patch.
what drives performance (changes applied)
- Release build:
-O3 -ffast-math -fno-math-errno - Fused normalization: input/output mean and std baked into weights at load time. No runtime norm arrays.
- Single-pass packing:
pack_player_features()writes the 741-float vector directly from world state - Unrolled scalar matmul: 8-wide manual unroll,
restrictpointers - Smaller network: 64x64 cuts FLOPs ~4x vs 128x128
result
~50x speedup from checkpoint 1, mostly from compiler flags and removing redundant normalization passes. Still 80x slower than analytic because the 741-wide first layer dominates. Patch size was now the obvious next lever.
Input: 741 dims
Architecture: 741 -> 32 -> 7 (one hidden layer)
Latency: ~3.2 us full step (~35x analytic), forward ~2.7 us
One-step accuracy: 99.0% grounded, pos RMSE 0.00465
assumption
We could drop the second hidden layer entirely and cut neurons to 32 without losing playability. AVX2 FMA would accelerate the still-large 741-input first layer.
what drives performance (changes applied)
- AVX2 FMA kernels: 8-wide SIMD dot products on all layers
- Single-layer MLP: eliminates 64x64 middle matmul entirely
- Row-wise AVX: we tried input-major layout for layer 1 first; it broke rollout accuracy. Row-wise AVX on output-major weights was the fix.
- -mavx2 -mfma compile flags
result
Another ~2x win, but the 741-input first layer is still the bottleneck. One-step accuracy held above 99%. Input-major AVX was a useful negative result: layout matters as much as SIMD when in_n is large and out_n is small.
Input: 355 dims (343 voxels + 12 scalars), cropped from 9x9 records
Architecture: 355 -> 16 -> 7 int8
Latency: ~2.1 us full step (~26x analytic), forward ~1.8 us
One-step accuracy: 98.83% grounded, pos RMSE 0.00571
assumption
We do not need 9x9 context for this parkour. Cropping to 7x7 halves input dims without re-recording the dataset. int8 quantization cuts weight memory bandwidth. A 16-neuron single layer is enough if we accept slightly lower accuracy.
what drives performance (changes applied)
- Smaller patch: PATCH_R 3 (7x7x7). Training crops center 7x7 from existing 792-byte records.
- int8 weights: per-output-row scale factors, int8 dot products in C
- Pre-fused normalization at export: int8 models skip runtime mean/std entirely
- 16-neuron width: smallest layer count that cleared 98% grounded in sweep
- -march=native: CPU-specific instruction selection
result
First checkpoint under 98% was not the issue; h16 cleared 98.83%. The
combined patch shrink + int8 + tiny hidden layer broke the ~26x barrier.
These 7x7 models are historical only; they were removed from models/
once native 3x3 training landed.
Input: 39 dims (27 voxels + 12 scalars)
Architectures: h12, h16, h24, h32 single-layer int8 MLPs
Best latency: ~0.75 us full step (h12), ~0.52 us (h16)
One-step accuracy: 98.55% to 98.67% grounded (all above floor)
assumption
Immediate neighbors are enough for this world. Platforms are 2-4 voxels wide, gaps are small, and the player only needs to know what is directly adjacent plus velocity and input state. If 3x3 fails one-step validation, grounded accuracy would drop below 98%.
what drives performance (changes applied)
- 39-input first layer: down from 741. First matmul is ~19x fewer ops than original.
- Native 3x3 dataset: 500k samples recorded at PATCH_D=3 (record_size=90). No crop step.
- int8 sweep:
python tools/train.py --sweeptrains h12 through h32 - Feature pack is cheap: 27 voxel reads + 12 scalars. Small relative to matmul at h24+.
result
3x3 held on one-step metrics. h12 was the latency winner (~0.75 us full step). h16 was oddly faster on full step (~0.52 us) despite a larger forward pass. All four cleared 98%. Playtesting and rollout benchmarks showed one-step numbers were not enough to pick a default.
Input: 39 dims (27 voxels + 12 scalars)
Models kept: h24, h48, h64, h48x24 int8 on 3x3 patch
Default: model_q8_h24_p3.bin (~0.95 us, ~89% rollout grounded)
One-step accuracy: 98.65% to 98.74% grounded on kept models
Playtest reality: ~2.4% of airborne benchmark steps still overlap solids (293 / 12408). You can feel this as occasional phasing. Analytic: 0 overlaps.
assumption
Multi-step drift is the real failure mode. A model that looks fine on held-out one-step labels can still tunnel through voxels after hundreds of ticks. Wider hidden layers, a two-layer MLP, or a 5x5 patch might improve rollout without giving back too much latency.
what drives performance (changes applied)
- Rollout benchmark: 50 episodes x 300 steps, analytic vs neural in parallel, grounded agreement tracked
- Stable sweep:
--sweep-stabletrains h48, h64, and 48x24 two-layer models - 5x5 patch experiment: tried larger observation windows; h24 on 3x3 still won rollout
- Model cleanup: old 7x7 and 9x9 models removed from
models/ - Default selection: h24 balances ~1.3 us step cost with best rollout among kept models
result
Bigger is not always better. h48x24 (two-layer) hits 98.74% one-step but only ~78% rollout. h64 is ~73% rollout at ~2.8 us. h48 is ~80% at ~2.4 us. h24 stays the default: ~89% rollout, ~0.95 us, 98.67% one-step. Still ~12x analytic. Wider models (h48, h64, 48x24) did not improve rollout; a 5x5 patch re-record and retrain also lost to h24 on 3x3. The remaining gap is structural: open-loop prediction with no post-step collision fix.
where time goes (current 3x3 h24 model)
At 39 inputs and 24 hidden neurons, the full neural step breaks down roughly as:
- pack_player_features(): 27 voxel lookups + 12 scalar writes.
- Layer 1 (39 -> 24 int8): 936 int8 multiply-adds with per-row scale. Dominates forward pass (~0.76 us).
- Layer 2 (24 -> 7 int8): 168 multiply-adds. Minor.
- State apply: write pos delta, velocity, grounded flag. No collision pass.
AVX2 FMA lives only in neural.c (not whole-program -march=native);
global AVX flags caused stack misalignment crashes when calling raylib from main.
Compared to analytic (~0.08 us), the network still does hundreds of multiply-adds per tick. We removed the speed-ladder models (h12/h16/h32) from disk; they were faster on paper but worse on rollout.
benchmarks
| checkpoint | input | arch | neural step | one-step | rollout | tunnel | vs analytic |
|---|---|---|---|---|---|---|---|
| analytic baseline | - | - | ~0.08 us | 100% | 100% | 0% | 1x |
| model.bin (unoptimized) | 741 | 128x128 f32 | ~305 us | 99.2% | - | - | ~3800x |
| model_fast.bin | 741 | 64x64 f32 | ~6 us | 99.1% | - | - | ~75x |
| model_tiny.bin | 741 | 32 f32 | ~3.2 us | 99.0% | - | - | ~40x |
| model_q8_h16 (7x7, removed) | 355 | 16 int8 | ~2.1 us | 98.8% | - | - | ~26x |
| model_q8_h24_p3 (default) | 39 | 24 int8 | ~0.95 us | 98.7% | ~89% | ~2.4% | ~12x |
| model_q8_h48_p3 | 39 | 48 int8 | ~1.8 us | 98.7% | ~80% | ~2.8% | ~22x |
| model_q8_h64_p3 | 39 | 64 int8 | ~2.9 us | 98.7% | ~73% | ~2.7% | ~36x |
| model_q8_h48x24_p3 | 39 | 48x24 int8 | ~2.4 us | 98.7% | ~78% | ~3.2% | ~30x |
lessons
one-step accuracy lies
Every checkpoint looked fine on held-out one-step validation (98-99% grounded). Early writeups sometimes reported those numbers as if they were long-run rollout scores. They are not. Rollout benchmarks (50 episodes x 300 steps, analytic vs neural in parallel) show ~89% grounded agreement for the default h24 model. Errors compound: position drifts, the 3x3 patch sees the wrong voxels, and the player phases through blocks (~2.4% of airborne steps in bench). Analytic physics has zero such overlaps.
why old builds felt more stable
Crop-trained weights from the first 3x3 sweep are gone; we re-recorded native 3x3 data and retrained. One-step metrics stayed similar but rollout changed. There is no analytic collision fallback (removed on purpose). If an older build felt smoother, it may have been different weights, not a magic architecture change. The current default is the best rollout model we still have on disk.
patch size mattered more than hidden width
Going from 741 to 355 to 39 inputs gave larger wins than any matmul kernel tweak. The 9x9 assumption was safe but expensive. 3x3 was the risky bet that paid off for this specific world generator.
implementation details compound
Release flags, fused normalization, single-pass feature packing, and int8 export each gave meaningful gains alone. Stacked together they turned a research prototype into something that runs at ~0.95 us (h24 default) without changing the fundamental fact that matmul is slower than analytic collision code.
bigger models do not fix rollout
h48, h64, and 48x24 all beat h24 on one-step metrics but score worse on rollout and tunnel counts. A 5x5 patch experiment did not beat h24 on 3x3 either. More neurons or context within this MLP formulation does not solve open-loop drift. Next levers: multi-step training, or a minimal post-neural depenetration pass (nudge out of solids without full analytic physics).
analytic will always win on raw speed
The goal was never to beat analytic. It was to get neural close enough that the swap is a viable design option. At ~12x and well under 0.01% of a 60 FPS frame budget, we are there for latency. Rollout fidelity is the open problem.
reproduce
cd build && cmake .. -DCMAKE_BUILD_TYPE=Release && cmake --build .
./voxel_parkour.exe --record 500000 --out ../data/train.bin
python tools/train.py --sweep-stable --epochs 40 # h48, h64, h48x24 on 3x3 data
./voxel_parkour.exe --bench ../models/model_q8_h24_p3.bin
./voxel_parkour.exe
Tab toggles analytic and neural physics mid-game. N reloads the model.
Game loads model_q8_h24_p3.bin first, then falls back through
h48x24, h64, and h48. Training takes ~13 min for the stable sweep on CPU;
--log-every 5 prints ETA during training.
what is next
Rollout is ~89% on the default model, not perfect. Phasing still happens in play despite good one-step numbers. Likely next experiments: multi-step training loss (teach stable trajectories, not single frames), or a lightweight depenetration clamp after the neural step (not full analytic fallback). Residual learning (predict a delta on top of analytic) is another path.