← back to the build log
Under the hood

How the coach
actually works.

The app doesn't just log your workout, it decides every set for you, and it gets the call right by learning who you are. Here it is two ways: the plain version, and the one with the formulas.

It learns you, not the average person

Most gym apps are spreadsheets you type into. This one does the thinking: it tells you exactly what to lift, set by set. A generic program assumes you're the average lifter. This coach builds a picture of you instead, how strong you are on each movement, how fast you recover, how you respond to a hard week. Every set you log sharpens that picture.

Why it tells you this exact weight

It watches your honest hard sets and keeps a running, smoothed estimate of your real strength on each lift. Recent sets count for more than old ones, so the estimate moves with you as you get stronger. When it hands you a weight, it's reading that estimate, not pulling a number out of the air.

It remembers what you tell it

Say “left shoulder felt tweaky” after a set. The coach files it away. Next time you train that movement, it pulls the memory back up and works around it. It's not just tracking numbers, it remembers the words too, and brings back the ones that matter.

It gets sharper every session

After each workout a learning engine recomputes everything: are you progressing or stalling on this movement? Recovered enough to push, or still cooked from last time? It even grades its own past advice, if it keeps telling you a number of reps that you keep beating, it notices and corrects itself. And the bar it holds you to only ever ratchets up as you get stronger. It never quietly lowers the target.

1You do the set
2It reads your strength
3It updates its model of you
4It plans your next set

…and that runs on every single set.

It won't make things up

If the coach can't get a confident answer, it says so and asks, it never invents a number just to fill the silence. Every line it gives you is tied to something real from your training. No hype, no “you crushed it!”, just the honest read from someone who watched the session.

The shape of it

Three tiers. The iOS app runs the live session and calls the model; Supabase (Postgres + row-level security + Deno Edge Functions) is the source of truth and the learning engine; Claude is stateless coaching reasoning. All the learned state lives in Postgres, the model holds nothing between calls.

Your iPhone
SwiftUI app · live workout session · single-owner session coordinator · local-first cache · write-ahead outbox
Supabase
Postgres (RLS on) · Edge Functions = the learning engine (EWMA, recovery, phases, transfer) · per-user isolation
Claude
Stateless reasoning: program, per-day plan, per-set prescription, note classifier. No memory of its own.

The phone calls Claude directly for generation; the heavy learning math runs server-side, never on the client.

The memory (the “memory-chain RAG”)

Three layers get stitched into one JSON context object on every set:

  1. Vector RAG. A query built from the exercise + muscles is embedded (OpenAI text-embedding-3-small, 1536-dim) and run as a pgvector cosine search over your past workout notes, top 3, similarity ≥ 0.60.
  2. Trainee-model digest. A token-economical projection of the server-maintained model of you, plateau verdicts, the coach's own accuracy bias, cross-exercise transfer coefficients, fatigue interactions, active limitations.
  3. Session log. An append-only record of every set so far this session, the model's working memory.

The heavy reasoning (plateau, transfer, fatigue) is computed deterministically server-side and distilled into that digest. The LLM is a constrained final-mile reasoner reading pre-chewed context, not a black box doing the math.

Go deeper: what's actually in the digest

The digest is a request-time projection of the full server model, narrowed hard for token economy: per-pattern plateau verdicts, the coach's own prescription-accuracy bias cells, cross-exercise transfer coefficients (only those past R² ≥ 0.4 with ≥ 5 paired observations), cross-pattern fatigue interactions (only past confidence ≥ 0.7), active injury limitations, and weekly-fatigue / deload flags. The note store is pgvector with an HNSW cosine index; the session log is the model's primary working memory inside a single workout.

The models & the contract

  • Claude Sonnet 4.5: per-set prescription (8 s budget) and 12-week program generation.
  • Claude Haiku 4.5: classifying free-text notes and post-workout insight bullets.
  • Output is prompt-instructed JSON, validated in Swift against hard ranges. There's no tool-use/structured-output API in the loop; an invalid or low-confidence response is a permanent error that surfaces a retry sheet, never a silently fabricated set.
Go deeper: failure handling
  • Transient failures (429 / 502 / 503 / 504 / 529, dropped connections) retry with exponential backoff 1 → 2 → 4 → 8 s inside the 8-second budget. A decode or range-validation failure is permanent (no same-prompt retry) and surfaces a Retry / Pause sheet.
  • Program generation gets one extra guardrail: if the model returns valid JSON that breaks a domain rule (an empty training day, equipment your gym doesn't have), it's re-prompted once with the violation named, then throws rather than ship a broken plan. There's also a hand-rolled JSON-repair pass for malformed generation output.
  • A deterministic safety gate runs after the model: if a set is flagged for pain, rest is forced to ≥ 180 s regardless of the number the model returned.

How the numbers are computed

  • Strength estimate (e1RM). Epley: e1RM = weight × (1 + reps/30), counted on top sets only (3–10 reps), so a high-rep burnout never inflates it.
  • Smoothing. An EWMA with α = 0.333 over the last 5 top sets, recent work weighted heavier. After a layoff, deload, or phase change it switches to “transition mode”: a 3-session mean that re-anchors fast instead of dragging stale numbers along.
  • Capability bands. Per movement pattern: a floor (round-down median of recent sessions, so it never overstates) and a stretch (floor × a trend-based margin: 7.5% progressing, 4% plateaued, 2.5% declining).
  • The floor ratchets up, only. When you durably outgrow a band (current ≥ stretch + one full band-width), it re-derives the floor upward. It is monotonic, it never lowers a target or invents progress.
Go deeper: the parts that bite
  • Transition mode isn't a flag, it's a different estimator. On a calibration, a deload-end, a phase change, or a long layoff, the estimate collapses from the 5-set EWMA to a 3-session plain mean with Bessel-corrected variance. After a ≥ 28-day gap it goes further and trims the pre-gap sets out of the window entirely (the naive average was shown to move the estimate the wrong way after time off), and keeps re-anchoring until the old sets age out.
  • “Capability” is two numbers, not one. The coach's per-exercise estimate is the EWMA; the floor/stretch bands use a recent-window median instead (deliberately), so a single big day can't yank a target around. The round-down median is the conservative one.
  • The ratchet is structurally idempotent. Because the new floor is round-down(current), a just-recalibrated pattern always reads back below its own stretch, so it can't re-fire “achieved” until a fresh full band-width of real growth. The banner that surfaces it is watermark-armed, not a one-shot boolean.

How it self-improves

Every completed session runs a deterministic, idempotent server-side pipeline:

  • Stimulus → recovery. Each set is classified neuromuscular / metabolic / both, feeding two recovery curves with different time-constants (τ ≈ 30 h neuromuscular, 12 h metabolic).
  • Plateau detection. A two-track verdict (e1RM spread + weekly volume) decides progressing / plateaued / declining, which drives phase advance, deload, and the stretch margin.
  • Cross-exercise transfer. A per-pair log-log regression learns how one lift carries to another, published only at ≥5 paired observations and R² ≥ 0.4. No textbook defaults, it waits for your data before it propagates.
  • The coach grades itself. It tracks the error between the reps it prescribed and the reps you actually hit, per movement, and feeds its own miscalibration back into the prompt. It's meta-coaching: the coach coaching the coach.
  • Confidence lifecycle. Each axis (exercise / pattern / muscle) advances forward-only as evidence accrues, so the app never claims more certainty than the data supports.
Go deeper: the actual formulas
  • Recovery curve. readiness(t) = clamp(0, 1, 0.3 + 0.7·(1 − e^(−t/τ))) with a 0.3 residual floor, τ = 30 h for neuromuscular work and 12 h for metabolic. A set at RPE ≥ 9 is bumped to count as both, since under-counting fatigue is the one error it won't make silently.
  • Plateau verdict. Two tracks: an e1RM track (spread (max−min)/mean ≤ 2.5% over a frequency-scaled window, RPE-gated) and a weekly volume-load track (≤ 5% over the trailing 4 weeks). Aggregation is declining-wins.
  • Transfer regression. A log-log linear fit ln(toE1RM) = coef·ln(fromE1RM) + b per exercise pair. A Spearman rank check at ≥ 10 observations catches monotonic-but-non-linear pairs and widens the standard error additively by residual-stddev × 1.0. coef > 1 means the second lift improves faster than the first.
  • Self-grading buckets. Prescription accuracy is tracked per (pattern × intent) as rep-error over a 30-observation window, split into time-since-last-session buckets (< 48 h / 48–72 h / > 72 h) so fatigue-stacking shows up on its own. It only speaks up past 5 samples, and only when bias > 5% or RMSE > 10%.
  • Confidence gates. An exercise reaches “established” at ≥ 8 sessions and an e1RM coefficient of variation ≤ 7.5%; a pattern at ≥ 6 sessions with a data-backed trend; a muscle once two-thirds of its participating patterns are established. The ladder advances at most one rung per session and never regresses.

The hard parts

The intricacy isn't the happy path, it's what breaks. A few failures that shaped the design:

  • The day the app forgot 75% of someone's history. Lift history joins on a day's label. On a program regen the model minted fresh labels, the joins silently stopped resolving, and three-quarters of a user's history dropped out of the math. Fix: the day-label is now a stable per-user key the model is told to reuse, never reinvent.
  • Two screens, one session, no referee. Two workout views sat over a single session actor with nothing owning “which day is live.” The result was silent wrong-day completions and “Session Not Found.” Fix: a durable training_day_id plus a single ActiveSessionCoordinator that is the only owner of live/paused state, read from one atomic snapshot.
  • Retries that must not double-count. The outbox can replay a model update after a crash or a flaky network. A (user, session) primary key with ON CONFLICT DO NOTHING makes every replay converge to exactly one apply: idempotency in the schema, not in hopeful code.
  • It refuses to fake a number. Every learned value (transfer, fatigue, even confidence) waits for your data before it propagates, with no textbook defaults to fall back on. Paired with the no-silent-fallback rule, the whole system is built to say “I don't know yet” out loud rather than guess.

Decisions I'd defend

  • The program is a queue, not a calendar. It advances on training events, not dates, so it never shames you for a “missed Tuesday.”
  • Learning runs server-side. One implementation of the math for everyone, no split-brain across app versions; updates are idempotent by a (user, session) primary key.
  • No silent fallbacks. Errors are typed transient (retry) vs permanent (fail loud). A failed AI call shows a retry sheet, it never substitutes a guess.
  • Anonymous auth + enforced RLS. No sign-up friction, but every row is scoped to auth.uid() and Edge Functions re-check the JWT owner.

Honest caveats

  • Generation runs on Sonnet 4.5, not Opus (some older comments say otherwise).
  • JSON is prompt-instructed and hand-validated, not API-enforced structured output.
  • Auth is alpha-grade: anonymous accounts don't survive a reinstall, the key isn't rotated yet, no CAPTCHA, all deferred to a real launch.
  • Prompt caching is wired up but doesn't yet fire on the per-set path.

Want the messier version, with the bugs and the dead ends? That's the whole log.