GitHire

Case 01 · Real production incident ·

From one prompt
to a production incident,
in just 20 minutes.

22 lines of code, written by AI, passing local tests, merged to main, deployed to production — and pushing Redis to 503. This is a retro of that real incident, walked along the GitHire six-step workflow to find which step was skipped — and that is the step where the incident was buried.

  1. 10:19 User submits the prompt
  2. 10:24 AI writes SCAN+HGETALL
  3. 10:39 Commit · opens PR
  4. 21:35 PR merged · deployed
  5. 23:54 Production incident · first firefight
  6. +1d 11:34 Redis SET fully replaces SCAN

Incident code: 22 lines · Firefight commits: 5 · Time to full fix: 25 hours

Why the workflow exists · Method

The speed at which AI writes code
has long outrun the speed humans review it.

The old engineering paradigm assumed 'humans write code, AI assists,' so review was the last step in the sequence — humans write → tests → review. In the AI-native era that line is reversed: AI writes, humans review; humans frame, AI executes.

The six-step workflow is not guardrails for AI — it is decision points for humans. The Issue is the framing decision; architect review is the direction decision; sandbox and AI review are two more 'still time to back out' decisions. The moment a decision point becomes ceremonial, AI's speed turns into the speed of incidents.

The case below is walked through all six steps:
what each step should be in the ideal, versus what it actually was on the day.

See the full workflow first ↗

STEP 01 / 06 · ISSUE

Ideal: spell out 'whom you are solving what for'
Actual: one line in a chat box.

Real prompt · 2026-05-14 10:19 +0800 · Codex CLI · gpt-5.5

Right now the domestic-model check is done by prefix matching. I want to turn it into a field on model detail — I think there was already a made_in_china field.

We should use that field and make it more flexible and configurable. In CI we should also run search checks separately for the China site and the global site, to make sure:

  1. China site: the corresponding search endpoint only returns models that are made_in_china.
  2. Global site: no restriction, returns normally.

It is suggested to add a small E2E step in smoke tests to cover this.

STEP 02 / 06 · SANDBOX

Ideal: a long-running sandbox with real data
Actual: local repo, pytest all green.

Codex edits files in the local repo and runs pytest tests/test_site_mode.py -q; 103 test cases all pass. But pytest is wired to fakeredis or an in-memory mock with only a handful of test rows. In real production model_detail::* has dozens to hundreds of keys, and every user pings /api/site/config at front-end startup.

The value of a sandbox is not "the environment runs"; it is "the environment looks like production". Missing data scale, missing QPS, missing concurrency — drop one, and the implementation AI writes can fall over when the magnitude jumps.

STEP 03 / 06 · EXECUTE

Ideal: AI completes the implementation inside the sandbox
Actual: five minutes in, 22 lines of patch are out.

wowchat/site_config.py · generated by AI · 10:24:27
def get_domestic_model_ids() -> List[str]:
    model_ids = []
    cursor = 0
    while True:
        cursor, keys = r.scan(cursor, match='model_detail::*', count=200)
        if keys:
            pipe = r.pipeline()
            for key in keys:
                pipe.hgetall(key)
            for key, detail in zip(keys, pipe.execute()):
                if not is_made_in_china_model_detail(detail):
                    continue
                model_id = detail.get('engine') or key.split('::', 1)[1]
                if model_id:
                    model_ids.append(model_id)
        if cursor == 0:
            break
    return sorted(set(model_ids))

@router.get('/config')
async def get_site_config(request: Request):
    return ResponseWrapper(result=SiteConfigModel(
        site_mode=resolve_site_mode(request),
        domestic_model_ids=get_domestic_model_ids(),  # ← runs on every request
    ))

STEP 04 / 06 · AI REVIEW

Ideal: a second AI reads the PR
Actual: this step did not happen.

From prompt to commit, the whole loop ran inside one Codex session; no second agent ever read the diff. Local tests passing became the only "go" signal.

If there had been an independent review agent with a system prompt that included "Flag any new Redis SCAN / KEYS / HGETALL across full keyspace in request-path code", this incident would most likely have been caught right here.

What AI review really is

Not "re-run the tests" — but reading the same code with a different set of priors. A review agent is meant to know performance patterns, security patterns, maintainability patterns, complementary to the generating agent, not overlapping.

STEP 05 / 06 · ARCHITECT · the key step

Ideal: a human architect calls the direction
Actual: this step was skipped.

From prompt to commit, 20 minutes total.
At no point did someone who knows the system look up and ask: "Every /api/site/config call scans all of model_detail::* — what does that mean?"

Questions the architect should have asked

  1. How many calls per minute does this endpoint take?
  2. How many model_detail::* keys today? Projected to how many?
  3. Could we maintain a set of "domestic model ids" explicitly and avoid runtime scanning?
  4. Is this change on a hot or cold path? Should it go through a cache?
  5. If Redis jitters or slows, will this call drag the upstream API down with it?

Why AI won’t ask these on its own

AI sees the code context of the current repo — it does not see production QPS, it does not see Redis capacity curves, it does not see the pits this team has already fallen into. That is the "system-side context" only an architect has, which is exactly why this step has to be a human.

AI can write a correct implementation in 5 minutes; an architect only needs 30 seconds to recognise "wrong path". Those 30 seconds are the most expensive time in the whole workflow.

STEP 06 / 06 · PRODUCTION

Ideal: merge to main, ship cleanly
Actual: 23:54 that night, production alerted.

About two hours after deploy, monitors started firing: /api/site/config p99 jumped into seconds, Redis slow logs filled with SCAN and a flood of HGETALL, CPU pinned, and every other Redis-backed endpoint was dragged down too. First paint stalled; new users could not get in.

At the moment of ship, no one thought anything was wrong — CI was all green, the PR description was clear, the AI-written code "looked tidy". That is the signature of this kind of incident: it passes every traditional gate.

Fix chain · 5 commits · 25 hours

No single shot lands the fix —
it is a chain of not-quite-cures.

  1. Incident +5 min

    Cache the result + move SCAN to a threadpool

    First reflex: keep SCAN off the main thread. The problem is postponed, not solved — the moment the cache expires, things spike again.

  2. +30 min

    Switch to in-memory cache + 60s background refresh

    SCAN drops from "every request" to "every minute". Blast radius shrinks, but startup still slams it.

  3. +90 min

    Drop background refresh · startup-only warmup

    Concede that online refresh is risky; retreat to scanning once at startup. But every restart still takes the hit.

  4. +6 hours

    Restore the handler · fix the caching contract

    Fix the endpoint behaviour so the fallback path does not introduce new bugs.

  5. +25 hours

    Maintain domestic ids in a Redis SET

    This is the cure: write into an explicit set when models go up or down, and read with SMEMBERS — dropping from an O(N) scan to an O(1) lookup. The SCAN path is removed from production code entirely.

Each firefight "looked a little better than the last," but the full cure took 25 hours. Thirty extra seconds from the architect at Step 05 would have erased these 25 hours.

Rewrite · same spec

What an architect-readable Issue
actually looks like.

Original prompt · four things missing

BAD

Right now the domestic-model check is done by prefix matching. I want to turn it into a field on model detail…

Run search checks separately for the China site and the global site…

Suggest adding an E2E step in smoke tests.

  • ✗ Constraints
  • ✗ Non-goals
  • ✗ Verification
  • ✗ Architecture notes

Same spec · rewritten

GOOD

Goal Move the "domestic model" check from engine-prefix matching to a model_detail.made_in_china field — so adding or removing a model is a one-field change.

Constraints /api/site/config is a high-frequency endpoint; every user hits it on front-end startup. Scanning the Redis keyspace per request is not allowed. model_detail is ~80 entries today, projected to grow to ~500.

Non-goals Do not implement via SCAN / KEYS. Do not run O(N) work on the request path. This change does not touch the front-end cache strategy.

Architecture notes Maintain an explicit Redis SET (e.g. domestic_model_ids); write on add/remove; reads are O(1).

Verification China site search only returns made_in_china=1; global site unchanged; live /api/site/config p99 does not increase; smoke gets a new E2E assertion.

  • ✓ Constraints
  • ✓ Non-goals
  • ✓ Verification
  • ✓ Architecture notes

The right-hand version is about 150 words longer. Those 150 words buy back 25 hours of firefighting and one production incident.

Install the six-section template as the Prompt Spec Skill ↗ — Read SKILL.md directly, or npx skills add realRoc/skills --skill prompt-spec

Takeaways

This case teaches three things.

  1. AI coding speed forces review upstream.

    From prompt to prod-bound commit took 20 minutes. By the time "PR review" sees the problem, it is already too late — the real review happens at the prompt stage, done by issue framing.

  2. A prompt without constraints / non-goals is a blank check.

    AI will not ask itself "what is the call frequency" or "how large is the data". Those numbers have to be put in the Issue by a human. Every blank in the spec is an entry point for the next incident.

  3. Architect review is the most expensive 30 seconds in the workflow.

    AI writes code in 5 minutes; the architect needs 30 seconds to recognise "wrong path". This step has to be a human — only humans have the system-side context (QPS curves, past incidents, capacity plans).