MDx OS

The Closed Loop

65 days of building in the open.
A hypothesis about autonomous software development.
And what I actually learned.

by [Author] · March 2026 · A Real Talk Publication · 14 min read

Movement 01

65 Days

This publication is not about a finished product. It's about a 65-day experiment... and the hypothesis that drove it.

A note: this is stream of consciousness, written across a few early mornings. I prioritized getting the learnings out over polishing every sentence. References to Phase 14 or Phase 18 won't mean much without context, and that's fine. You'll get the thread.

Since January, I've been building in public...early mornings, weekends, flights. Not as a hobby. To test a thesis about where AI-native software development is heading, and what the infrastructure underneath it needs to look like.

Across those 65 days, I built MDx Twin, MDx Code, MDx OS, MDx Message, Forge, Pulse, Ella... a set of interconnected products, each one teaching me something the last one didn't. The experience compounds. What I know about context management from building Twin shaped how I designed the knowledge layer in Code. What I learned about real-time infrastructure from Message informed how I thought about the process supervisor in the autonomous pipeline. None of this happened in isolation. It's all one continuous build.

MDx Code...the focus of this piece...is where that accumulated understanding led me to a specific question I wanted to answer: can you close the loop entirely? From intent to production, with a human at the checkpoints but not in the middle?

The short answer, after Phase 18, is: yes. The closed loop now runs. But I want to be honest about what that means, what it doesn't mean, and where the real work still lives.

Twin

MDx Twin → The Cognitive System

Multi-agent cognitive twin. Five agents, parallel execution, three-tier knowledge architecture. Where the idea of "orchestration as the real problem" first crystallized.

Code

MDx Code → The SDLC Experiment

Started as an interactive coding assistant. Evolved, phase by phase, into a hypothesis about autonomous software development. Phase 18 showed the loop closes. Still an experiment.

MDx OS → The Operating Layer

The shared infrastructure underneath everything else. Governance, knowledge, orchestration, observability. The reason the individual apps don't feel like islands.

Message

MDx Message → The Communication Primitive

Real-time Rust relay, Stream Fabric OS primitive, agent-native participant model. Built in 11 sessions across 4 waves.

Ella / Pulse

Ella + Pulse → Applied Intelligence

Agentic wellness benefits coach and biometric intelligence layer. Where the OS abstractions get tested against real domain complexity.

Movement 02

The Starting Point

MDx Code started as a coding assistant. Useful...but still fundamentally interactive. I described what I wanted, it wrote code, I reviewed it, I deployed it. The human in the loop at every step.

That's how Cursor works. Copilot. Every AI coding tool on the market right now...the AI accelerates the human while the human remains the engine. I kept pulling on a different thread: if the AI can write the code, why can't it also plan the approach, review its own work, test the results, handle failures, and ship the outcome? Why am I copy-pasting between a plan document, a terminal, a browser, and a deployment dashboard? Why am I the glue?

The answer isn't that the AI can't do it. The answer is that I haven't found the system that replaces the glue. The models are capable. The tools exist. The orchestration...the thing that connects intent to deployed software without a human threading every needle...that's the missing piece. So I decided to try building it, one phase at a time, using everything I'd already learned from Twin, OS, and the rest of the ecosystem.

I also removed the IDE from my workflow partway through. Not as a stunt. As a constraint that forces the right frame. When you can't open a file in an editor, you stop thinking about code as something you read and start thinking about it as something the system produces. You focus on the inputs (specs, plans, constraints) and the outputs (working software, passing tests, clean deployments). The middle becomes the machine's problem. That constraint changed how I designed every module from Phase 13 onward.

"Why am I the glue? If the AI can write the code, plan the approach, test the results, and handle failures...why am I still threading every needle?"

Movement 03

The Pipeline

Phase 18 built the autonomous SDLC pipeline. Here's what it actually does...and what it doesn't do yet.

The current state: the closed loop runs end-to-end through the Claude Code CLI. That's what Phase 18 proved. The human touches two points...approving the plan, and reviewing SQL before it hits production. Everything in between is automated. The next step, which I haven't taken yet but is close, is connecting this to the MDx Code CLI directly...a relatively straightforward wiring exercise now that the pipeline modules exist and are tested.

There's also a web UI version of MDx Code that diverged during the build...a three-panel workspace interface built in Phase 14. It exists, it's built, and I haven't seriously tested it in this context yet. The CLI loop is proven. The web interface is a parallel track that needs its own validation.

Here's what the pipeline does today:

Intent → Production

Human checkpoints at plan approval and SQL review. Pipeline handles the rest.

//plan phase-N "Theme"

The GovernanceEngine checks that required artifacts exist. The GapTriageEngine scans the plan for holes...missing edge cases, unaddressed dependencies. Then the ExpertPanel runs: 15 simulated domain experts review the plan in parallel, challenge it from different angles, surface risks. The output is a blocking gate. The plan doesn't advance without a structured expert review document. Human reads it, approves or adjusts. One decision point.

→ GovernanceEngine · GapTriageEngine · ExpertPanel (15 personas) · Human approval

//run phase-N

The ProcessSupervisor creates an integration branch, generates self-contained session briefs, and spawns Claude Code workers in isolated git worktrees. Parallel waves. Each worker reads its brief from disk, builds its piece, reports completion through a sentinel contract...a specific string the Supervisor detects in stdout. If a worker gets stuck, the AutonomousTriageRelay intervenes: up to three attempts with one model, then two more with a different model for fresh perspective. If it's still stuck after five attempts, the session gets marked BLOCKED and the wave continues without it. One stuck worker never stops the phase.

→ ProcessSupervisor · SessionBriefGenerator · MultiModelRouter · AutonomousTriageRelay

Auto-finalize

When all waves complete, the pipeline generates a phase summary, the SQLConsolidationEngine pulls all migrations into a single file, compliance checks run, and status moves to PENDING_DEPLOY. No human intervention between //run and here.

→ SQLConsolidationEngine · ComplianceChecks · PhaseCloseEngine

SQL review → //close

The one place I insist on human eyes: production database migrations. The pipeline has already consolidated all SQL into a single reviewed file. I read it. I run it. Then //close merges the integration branch to main and triggers deployment.

→ Human reviews SQL · //close merges to main · Deployment triggered

Why Workers Don't Share Memory

Each worker reads its brief from disk, builds its piece, and reports back. No shared context between workers. This is deliberate...when an AI's context window compacts mid-build (and it will, especially across long sessions), the worker recovers by re-reading its brief from the filesystem. The truth lives on disk, not in memory. The filesystem is the source of truth.

The ten modules Phase 18 produced...GovernanceEngine, GapTriageEngine, ExpertPanel, SessionBriefGenerator, ProcessSupervisor, MultiModelRouter, AutonomousTriageRelay, SQLConsolidationEngine, PhaseCloseArchiveEngine, and CLI integration...each one handles a specific class of problem. They're composable. Failures in one module don't cascade. The pipeline is the integration of all ten, and that integration is where the real difficulty lived.

⚙️ GovernanceEngine

Checks required artifacts exist before anything starts. First gate. Nothing advances without a valid foundation.

S128 · 29 tests

🔍 GapTriageEngine

Scans the plan for holes before the expert panel runs. Structured auto-triage rubric. Catches what you didn't think of.

S130 · 18 tests

🧠 ExpertPanel

15 domain expert personas in parallel, modeled after real engineering archetypes. They challenge the plan from different angles. Blocking gate before execution.

S131 · 26 tests · 15 personas

🖥️ ProcessSupervisor

The black box. PTY spawn, isolated git worktrees, wave orchestration, stdout monitoring, auto-triage on stuck workers.

S133 · 31 tests

🔀 MultiModelRouter

7 TaskTypes, routing table, fallback logic. Triage switches models after three failures for genuinely fresh perspective.

S134 · 18 tests

🚨 AutonomousTriageRelay

Analyze, escalate, inject. Up to 5 attempts before marking BLOCKED. Clean injection into stuck worker context.

S135 · 28 tests

Movement 04

The 80/20

The last 20% of this system took 80% of the effort. Anyone who's built distributed systems knows this...but knowing it doesn't make it easier.

The architecture was designed in a day. Each individual module...built in a session or two. Clean, testable, modular. Then I tried to connect them, and the integration boundary revealed everything it had been hiding.

The ProcessSupervisor spawns real operating system processes...terminals, PTY sessions, Claude Code instances running in isolated directories. And the failure modes were a different class of problem than anything in the modules themselves:

PTY Startup Race Condition

Worker launches, produces no output for 30 seconds...because the CLI wasn't ready when the kickoff prompt was sent. The process existed. The PTY was open. The application layer wasn't listening yet. Explicit readiness checks and a handshake sequence fixed it, but diagnosing it required reading process logs character by character.

Stale Worker Branches

Worktrees created from an old HEAD because the integration branch hadn't been updated before workers were spawned. Every worker built from a stale baseline, then all branches conflicted at merge. The fix was obvious after the fact: create integration branch, update to latest HEAD, then spawn. Integration failures always look obvious after the fact.

Session Number Collisions

Two different phases both had a session labeled S128 internally. The naming convention had a gap, and when the pipeline referenced historical sessions across the full phase history, it pulled the wrong one. A namespace problem that only surfaces when you have enough phases that session numbers overlap globally.

Asyncio Event Loop Pollution

Tests passed individually. Failed when run together. The event loop from one test suite leaked into the next, leaving the async runtime in a state that caused downstream tests to hang. Non-deterministic...depended on execution order. The kind of failure that makes you question your confidence in the entire test suite until you find it.

Context Compaction Silent Waits

Workers that appeared stuck were actually waiting on a prompt swallowed during context compaction. Context window compressed, pending prompt was lost, process sitting there waiting for input that was never going to arrive. The filesystem-based brief recovery was designed partly for this...but diagnosing the first instance meant reading PTY output stream by stream to understand what the worker was waiting for.

The module is code. The orchestration is systems engineering. They're different disciplines, and they fail in different ways. Every one of these bugs was an integration-boundary failure...the kind that only shows up when real processes interact with real filesystems under real timing constraints. If you're building something like this...budget 4x what you think for the integration phase. I certainly should have.

"The module is code. The orchestration is systems engineering. And the integration boundary between real OS processes running AI agents in isolated worktrees is where you find out what you didn't know you didn't know."

Movement 05

The Thesis

Here's what I believe is happening, and why this experiment matters beyond the pipeline itself.

LLMs are getting better faster than most people expected. At the same time, they're a bigger black box than ever. Frontier labs are still building tools to understand how their own models work. The systems generating our code are systems we cannot fully inspect or explain. And yet... the code they generate works. Not always, not perfectly, but increasingly and reliably enough to ship to production.

I believe we are on a path from "intent" to "production" with no human in the middle...for some teams, perhaps as early as next year, and for safety-critical systems, a few more years out. But directionally...that's where this goes. The entire build-test-deploy cycle is becoming more of a black box, and the question isn't whether to accept that. The question is how to design the safety mechanisms that make a black box build process trustworthy.

Review Isn't the Safety Mechanism

Humans don't review code as well as we think we do...we miss things, we get tired, we rubber-stamp PRs because we trust the author. Code review is as much a social ritual as it is a quality gate, and it can't possibly keep pace with AI generating thousands of lines per hour across parallel workers.

So the answer isn't "humans review harder." It's to design the system so that review isn't the primary safety mechanism...governance gates, expert panels, sentinel contracts, idempotency guarantees, triage systems that handle failure without human intervention. Structural safety rather than procedural safety. That's what the pipeline is attempting. Phase 18 provides early evidence it's tractable...but it's not a proof.

"Design the system so that review isn't the primary safety mechanism. Safety should be structural...governance gates, sentinel contracts, idempotency guarantees, triage systems. "

Movement 06

The Inversion

Most people think about AI coding tools as a way to help humans write code faster. That's a reasonable frame, but it's not the one I was working from.

The question I kept returning to: if you removed the human entirely, what system would you need? Design that system. Build it. Make it work end-to-end with zero human intervention. Then...and only then...ask where the human adds value.

That's how I arrived at two points. The human adds value at intent...what should we build, and does this plan reflect that? And at production safety...is this SQL safe to run against real data? Everything between those two points...the human was glue. Expensive, error-prone, context-losing glue. The pipeline replaces the glue with structure.

And that inverts the traditional SDLC in a way that changes what you optimize for. In the traditional model, you're optimizing for writing speed, code quality at the line level, PR throughput. In the inverted model, you're optimizing for plan quality, constraint specification, structural safety. The creative act moves earlier...the human's judgment matters most before the machine starts working, not during. And the output you evaluate is working software in production, not clean code in a PR.

"If you removed the human entirely, what system would you need? Design that system first. Then ask where the human adds value. "

Movement 07

On BMAD

A number of people I work with are experimenting with BMAD...and it's worth being precise about how what I'm building here differs, because they're genuinely different bets.

BMAD is an agent-based development framework. Specialized personas...PM, Architect, Developer, Scrum Master...structured artifacts, and guided phase handoffs that give the AI consistent context across a build lifecycle. You're still the engine...switching personas, feeding context, deciding what to build next. It's structured human-AI collaboration, and it's a legitimate approach. If your goal is to work with Claude Code or Cursor more effectively, BMAD is worth understanding.

What I'm experimenting with is a different question entirely. Not "how do we make the human more effective at directing AI?" but "how do we remove the human from the middle of the build process while keeping them at the decision points that actually require judgment?" Those are different hypotheses, not competing approaches to the same problem. BMAD makes the human a better conductor. The MDx Code pipeline is testing whether you need a conductor at all.

BMAD

Better Structured Human-AI Collaboration

Structured prompting methodology. Role-based persona switching. Human remains the engine...directing, reviewing, deciding at each step. The AI executes more coherently because the human's instructions are better organized. Solid approach. Makes you a more effective director.

MDx Code Hypothesis

Remove the Human From the Middle

Autonomous orchestration pipeline. GovernanceEngine, ExpertPanel, ProcessSupervisor, AutonomousTriageRelay...actual running code that spawns OS processes, monitors outputs, switches models on failure, and decides when to mark something BLOCKED. Human provides intent and reviews SQL. Pipeline handles the rest.

I'm not saying one is better. They're just different questions. If you're trying to ship faster with your current workflow, BMAD is more immediately useful. If you're trying to understand what a future looks like where the build process is largely autonomous and the human is a checkpoint rather than the engine...that's what I'm exploring here. Both are worth doing. They're just not the same experiment.

Movement 08

What's Next

I started building MDx Code because I wanted to understand whether the closed loop was possible. After 65 days...I have a provisional answer: yes, it closes. The loop runs. The pipeline produces working software with two human checkpoints and nothing in between.

That's not the same as saying this is ready for anyone else, or that it's hardened, or that the vision is proven at scale. It's an experiment that produced a result I found compelling enough to keep pushing on. The next step is connecting the pipeline to the MDx Code CLI directly...a relatively straightforward wiring exercise now that the modules exist and are tested. After that, I want to run a few more phases through the full loop and see where it breaks.

The human is doing what humans are best at: deciding what matters and judging whether the outcome serves the purpose. Everything between intent and result...that's what I'm handing to the machine. Phase 18 provides early evidence that's tractable.

I'm going to keep experimenting. And I'd like to bring more people along for that...engineers who are curious about this direction, willing to poke holes in it, and interested in what comes next when the loop gets tighter. If that's you, find me.

More soon.

~ [Author]
mdx.yourcompany.com · github.com/yourorg/mdx-code