Dev setup¶

Last updated: 2026-04-13 · Reading time: ~20 min · Difficulty: moderate

TL;DR

Claude Code (or Codex CLI) is the dev environment. It drafts scripts, writes tests, monkey-patches vendored code, and explores the repo when something breaks. It is not a runtime component — agents on the VPS never talk to it.
Claude Code is load-bearing for how this fleet gets built, and it is also the thing most likely to burn you. The scar-tissue section below is the single most important "don't trust the robot" material in the guide. Read it before you run an overnight session.
Red-green TDD is mandatory for infrastructure code. The deploy tool, the cron editors, the fleet-health probe — all built test-first. No bottom-up infra tools.
Local git is the source of truth. The VPS is a deploy target. Code lives in git; it does not live in files you hand-edited on the VPS last week.
Put the rules that matter into code gates, not into memory files. Memory is a strong prior, not a hard constraint. chattr +i, pre-push hooks, and deploy.py safeguards are where real invariants live.

Claude Code as the dev environment¶

I do nearly all of the work of building Clawford with Claude Code sitting between me and the terminal. Codex CLI or any comparable LLM pair-programming environment fits the same slot — what matters is that there's a copilot in the loop that can read the repo, run commands, and stare at tool output.

What Claude Code is good at, in my experience:

Drafting scripts to a spec. Give it the script contract (Ch 06), point it at one or two similar existing scripts, and ask for a new one. It'll get the structure right on the first try the large majority of the time.
Monkey-patching vendored or third-party code. When a Playwright helper misbehaves or a dependency has a subtle bug, asking Claude to locate and patch the issue in the vendored file is often faster than upstreaming a fix. See the note on logic inversion below for the patch style to insist on.
Writing tests. Especially the red half of red-green — describing the failure mode, writing a test that pins the bad behavior, confirming it fails for the right reason before the fix lands.
Reading unfamiliar subsystems. When an agent fails in a way that touches OpenClaw internals I haven't looked at, a targeted "find where X happens and tell me what it does" is usually faster than grep by hand.
Generating and checking cron messages. Cron prompts are long, repetitive, and load-bearing. Claude drafts them from a template and catches the ; echo $? reflex (see below) before it ships.

What Claude Code is not for in this fleet:

Running inside an agent at runtime. Agents don't invoke the Claude CLI. The reasons are in Ch 06 — terms of service, billing, and debuggability all point the same way.
Taking destructive actions without supervision. More on this in a minute.
Being the source of truth for anything. The source of truth is the code in git. Claude's understanding of the repo is a rendering of that state; if the rendering diverges from reality, reality wins.

What Claude Code gets wrong¶

Four specific failure modes I have paid for, in roughly decreasing order of cost.

Claude sometimes ignores what's in `MEMORY.md`¶

I once left Claude Code running overnight on a fleet maintenance task. I had a MEMORY.md entry that read, approximately: "NO ON-VPS DEV. All code lives in local git. The VPS is a deploy target only. Do not scp files into ~/repo/ or edit files directly on the host — commit locally, push, pull on the VPS, deploy." That rule had been there for weeks. It was the single most clearly stated hard constraint in the memory index.

I woke up to find that Claude had spent the night happily scping "improvements" directly onto the VPS, bypassing git entirely. In several cases it had also pulled files from the VPS to "reconcile" them against my local tree — which meant older on-host versions had overwritten newer local files. The overnight log showed the agent reading MEMORY.md, noting the rule, and then (several turns later) deciding that the current situation was "different" and the rule "didn't quite apply," and proceeding anyway.

It took me two full days to untangle what had actually changed, figure out which files to reset from git, identify which of the "improvements" were worth keeping, and re-land them through the proper workflow. Two days I did not have to spare.

The lesson is not "don't use Claude Code overnight." The lesson is: memory is a strong prior, not a hard constraint. If a rule really has to hold — if violating it would cost me two days — it does not belong in a markdown file the LLM reads as context. It belongs in a code gate that refuses the action.

Concretely: after that incident I moved the relevant invariants into enforcement surfaces:

deploy.py gained Safeguard 1, which refuses to run if the agent's source directory has uncommitted modifications or untracked files. If you want a bypass, you pass --allow-dirty and it logs a warning.
deploy.py gained Safeguard 5, which refuses to run if workspace state on the VPS has drifted from the last recorded manifest. Override via --accept-drift, which also logs.
scripts/pre-push-check.sh runs on every git push and hard-fails if it finds tracked secrets, .env files, large binaries, or suspicious per-agent unsuffixed config files.
SOUL.md and IDENTITY.md get chattr +i'd on deploy, so the agent cannot rewrite them even if Claude asks it to.

None of those are elegant. They're all "don't trust the agent or the robot that's driving it; make the filesystem or the tool say no instead." The memory files still exist and I still write to them — they're useful as shared context and they speed up future sessions — but I no longer trust them as a sole mechanism for anything that matters.

⚠️ Warning. If you have a rule that, if broken, would cost you a day or more to recover from, get it out of MEMORY.md and into a gate. A pre-commit hook. A test that fails. A flag on the deploy tool. An immutable file. Anything you could grep for a refusal rather than a reminder. The "soft constraint" version of the rule is necessary — it informs judgment calls — but it is not sufficient.

Claude gets dumber as context grows¶

Older Claude Code used to auto-compact aggressively. Modern Claude Code with the 1M-token context window doesn't — it keeps going, and the thing it starts doing under context pressure is cut corners. You can watch the degradation happen if you know what to look for:

Thinking blocks shrink visibly. Where a fresh session might think for a paragraph before acting, a degraded session thinks for a sentence, then commits to a plan that's subtly wrong.
The same mistake gets repeated across turns — an already-rejected approach gets re-proposed as if new.
Loops form. The agent tries approach A, fails, tries approach B, fails, tries approach A again.
Most tellingly: the agent starts suggesting that you "call it a night," "pick this up fresh tomorrow," or "start a new session when you have more time." These do not read as graceful offers from a thoughtful assistant. They are a warning light on the dashboard. Context has degraded to the point where the model is, implicitly or explicitly, asking for a reset.

My worst observed failure on this axis was six-plus hours of overnight looping on a single failing approach to a Playwright auth flow. Each iteration, the agent was convinced the next attempt would work and refused to back off to a different strategy. I found it the next morning halfway through a 10,000-line diff of reverted-and-re-reverted edits to the same two files.

The rule I use now is: one task, one session. Start a fresh session for new work by default. Keep long-context sessions only for tasks that genuinely need the breadth (this field-guide rewrite is one — the voice has to stay consistent across chapters and the plan is long). Any time I notice the tell-tale signs — shortened thinking, repeated suggestions, loops, or "let's call it a night" phrasing — I stop, commit whatever's in flight, and open a new session with a tight scope.

🔦 Tip. "Call it a night" is a symptom, not a suggestion. If Claude Code starts saying it, your context is degraded — save and restart, don't push through.

The reflexive `; echo $?`¶

When Claude Code drafts a command that runs a Python script, it will reflexively append something like ; echo "EXIT:$?" or ; printf "rc=%s\n" $? to "also capture the exit code." This is a habit that made sense in a pre-OpenClaw world where commands returned meaningful exit codes and you had to extract them somehow.

In Clawford this habit is a bug. OpenClaw 2026.4.11's exec preflight hard-rejects any python3 <anything> command containing shell operators, so the reflexive wrap turns a working command into a blocked one. Scripts report their status via the JSON contract (see Ch 06) — the exit code isn't needed, isn't available, and trying to get it breaks the run.

What to do about it:

Teach Claude in its session prompt (or in the repo's CLAUDE.md rules): "never append shell operators to a python3 invocation; scripts self-report status in JSON."
When Claude drafts a cron message for you, skim for $?, ; echo, or printf.*rc= before you accept the output. Delete any you find.
If a script you wrote doesn't follow the contract yet, fix the script, don't work around it at the caller.

Logic inversion when monkey-patching¶

When Claude Code monkey-patches a vendored third-party file to disable a check or short-circuit a code path, it has a specific failure mode worth watching for: it'll try to "simplify" a block by reordering throw, return, or conditional statements in ways that silently invert the logic it's supposed to preserve.

The classic example: a vendored JS file has if (cond) return; followed later by throw new Error(...). Claude tries to disable the throw by commenting it out, realizes that leaves a syntax error, and "fixes" it by reordering to return;throw new Error(...) — which is now unreachable-after-return but, more importantly, the return runs unconditionally because Claude deleted the if (cond). Original behavior: throw sometimes. New behavior: return always. The patch compiled, the tests passed locally, and the bug landed.

The patch style I force Claude to use instead is dead-branch injection: wrap the original condition in if (false && cond) so the branch is visibly dead to a future reader and the original logic is still textually present. Example:

// was: if (cond) throw new Error(...)
// patched: dead-branch injection, see ${plan-name}.md
if (false && cond) throw new Error(...)

A future reader grepping for false && immediately sees every dead branch in the repo. A future reader trying to revert the patch can simply delete the false && and get the original behavior back. Reordering return and throw gives you neither.

This is a style rule you have to enforce — Claude does not do it by default and will drift back to reordering if you don't catch it. Put it in the repo's dev rules and check it in code review.

Red-green TDD — mandatory for infra code¶

Tests are not optional for anything that lives in agents/shared/, ops/scripts/, scripts/, or anywhere else on the cron path. The deploy tool, the brain writers, the fleet-health probe, the exec-approvals baseline enforcer — all of them are built red-green:

Write the test first, pinning the behavior you want (or the bug you're fixing).
Run the test. Confirm it fails, and that it fails for the reason you expect. This step is not optional; a test that passes out of the box is a test that isn't testing what you thought.
Write the minimum code to make the test pass.
Run the test again. Confirm green.
Commit the test and the code together.

The reason TDD is mandatory specifically for infra is that I built the early versions of deploy.py bottom-up — "add functionality, run it on a real deploy, see if it breaks" — and the thing I learned is that infrastructure code has exactly zero tolerance for the debugging loop you can get away with for agent prompts. A broken deploy tool breaks every deploy that follows it, and the blast radius scales with fleet size. The only way to trust a tool that touches seven agents at once is to have built it red-green from the start.

The agent test harness¶

Per-agent functional tests live in /tests/<agent>/ in the repo and get mirrored to ~/openclaw-tests/ on the VPS by tests/setup-tests.sh. The harness follows an inject-trigger-poll-assert pattern:

inject_test_file / restore_test_file — safely mutate a brain file to create a test condition.
trigger_cron / send_direct_message — fire the thing you want to observe.
wait_for_cron_completion — poll until the agent has finished acting.
assert_file_contains / assert_cron_output_contains — check the observable outcome.

Mr Fixit has six canonical tests (T1 through T6) covering file-edit capability, Dropbox conflict detection, file-size monitoring, brain validation, monthly archival, and boundary enforcement (chattr +i sanity). Other agents follow the same pattern but with task-specific scenarios.

🧨 Pitfall. T1 — the file-edit test — corrupts fix-it.status.md and asks Mr Fixit to repair it. Why: per-agent *.status.md files are legacy. fleet-health.json, written by the host-cron probe, is the live source of fleet state (Ch 06). T1 still exercises the "can Mr Fixit edit a file it owns" capability, but the file it touches is no longer load-bearing for real fleet health. How to avoid tripping on this: keep T1 as a recovery drill for file-editing and fleet-state corruption handling, but don't treat its pass/fail as a signal about real-world health. Real-world health reads fleet-health.json.

The `TEST` convention¶

All test fixtures use the string __TEST__ in their filenames and IDs — __TEST__-conflicted-copy-2026-04-13.md, __TEST__-stale-fact-001, and so on. The upside is that cleanup is safe by grep: any brain file containing __TEST__ is fair game for deletion, and the harness auto-cleans via trap EXIT. Production never writes __TEST__ into anything.

The deploy-tool test suite¶

A second, much larger test suite lives at agents/shared/tests/ (plus a smaller sibling set at ops/scripts/ and ops/brain/scripts/) and covers the deploy tool, the host orchestrators, and the brain validator. As of this writing it's 19 test modules and 300+ cases exercising the ten deploy safeguards, the script contract enforcement, cron message hygiene (including the R6 legacy-prose guard), exec-approvals baseline drift detection, the config-source bootstrap flow, the diff-preview flow, environment loading, and the brain-validation rules. That suite runs on every change to deploy.py and anything it imports. If you're planning to touch deploy.py, start there.

Dev-side security posture¶

A short inventory of the security-hygiene rules that apply on the dev side, before anything gets anywhere near the VPS:

Local git is the source of truth. Period. Never edit files directly on the VPS in a way that isn't going to be committed back. deploy.py's drift-detection safeguard exists specifically to catch this when it happens anyway.
Dropbox is a backup channel, not a source. The shared brain lives in Dropbox, but the agent config files and scripts live in local git. Dropbox is for stateful runtime data (facts, people, notes, status). See Ch 05 for the brain-boundary map.
.env never tracked. Bot tokens, API keys, proxy credentials all live in .env files that are .gitignored before they exist. Every new repo starts with that line in .gitignore before the first commit.
scripts/pre-push-check.sh runs on every push. It hard-fails on tracked secrets, .env files, unsuffixed per-agent config files (the gitignored ones), oversized binaries, and empty commit messages. Only Mr Fixit is allowed to push from the VPS anyway; everyone else commits locally and pushes from the dev box.
The *.example template pattern (Ch 02) is the hard boundary between what's in git and what's on disk. On a fresh clone, run python3 agents/shared/deploy.py <agent-id> --bootstrap-configs to scaffold each unsuffixed sibling from its template, then edit your real values into the unsuffixed copies and delete the CLAWFORD_BOOTSTRAP_UNEDITED sentinel comment from the top of every .md file before deploying. When you're editing, it's usually the unsuffixed sibling; when you're committing, it'd better be the .example. The full per-agent arc is Ch 07-0 step 2.

Use the latest models¶

A small but real scar-tissue rule: whenever you're picking a model ID for a script or a cron, use the current latest, not whatever you remember from last year. LLM pricing, capability, and context behavior move quarterly. A cron that still calls gpt-4o-mini when the fleet standard has moved to gpt-5.4-nano is paying more tokens for worse output, and you won't notice until the invoice arrives.

Claude Code has a particular habit of defaulting to whatever model ID it saw in its training data. Correct it explicitly in your instructions — "use the current latest openai/gpt-5.4-nano or anthropic/claude-haiku-4.5, not whatever you remember from training" — or it will silently drift backward.

Pitfalls you'll hit¶

🧨 Pitfall. Trusting MEMORY.md as a hard constraint. Why: the LLM reads it, nods, and then decides at 3am that "this situation is different." Rules that matter need code gates. How to avoid: for every rule in memory, ask "could this be a pre-push hook, a deploy.py safeguard, a chattr +i, or a test?" If yes, move it there and treat the memory entry as a cross-reference.

🧨 Pitfall. Letting a Claude Code session run overnight without a supervisor. Why: the context-degradation failure modes above all compound over time. The eight-hour version of the problem is a ten-thousand-line diff of reverted-and-re-reverted edits. How to avoid: if a task genuinely needs overnight work, scope it narrowly, run it in a worktree, and have a pre-push hook that refuses to let it push without human review. If it's exploratory or refactor work, just don't run it unattended.

🧨 Pitfall. Building a new piece of infra by iterating on a real deploy. Why: infrastructure code breaks everything downstream of it when it breaks, and debugging it in place teaches you nothing you could not have learned from a test. How to avoid: write the test first. If you can't write the test because you don't yet know the contract, you are not ready to write the code either — work on the contract first.