Background Agents

When you select rows in the spreadsheet and hit run, Tern spins up one agent per row. Each agent gets a sandboxed checkout of your repo and follows the task instructions you assigned to that row. You can run a single row to test, a filtered batch, or an entire PR’s worth of rows at once.

What Happens During a Run

Each agent gets a sandboxed checkout of your repo and works through your task step by step:

  1. Reads the task instructions attached to its row.
  2. Executes each step against the file, using context from the spreadsheet (file path, stored variables from earlier steps).
  3. Runs any validation commands you’ve defined (e.g., run: npm test {file}).
  4. If validation fails and max_retries is set, the agent sees the error and tries again.
  5. Context from ## Store blocks carries between steps, so the agent can gather information in one pass and use it in the next.

See Task Instructions for the full syntax: steps, validation, retries, and stored context.

When the agents finish, Tern commits the changes to a branch and opens a pull request. If you organized rows into named PRs in the spreadsheet, each group gets its own branch and its own PR, scoped to exactly the files you planned, ready for review.

That’s the whole loop: plan in the spreadsheet, hit run, get a PR back.

Results

The output of a run is a grid: every row you selected × every step in your task. Each cell shows whether that step succeeded for that row, how long it took, and what the agent did. You can click into any cell to read the full agent conversation.

This is where iteration happens. When 180 of 200 files pass and 20 fail, you can inspect the failures, see that they all hit the same edge case in step 2, fix your task instructions once, and re-run just the failures. One improvement applies everywhere.

Results flow back into the spreadsheet as columns: pass/fail status, timing, and cost (tokens and dollars).

Iterating with Evals

You want to improve your task instructions without breaking what already works.

Golden files solve this. These are files from previous successful runs, locked to the git SHA they came from. They preserve the codebase as it was at the time of the run.

An eval run takes your current task instructions and runs them against the golden files at their original SHA. Since prompts are versioned implicitly on every change, you’re testing: does this version of my instructions still produce correct results on files that previously passed?

Eval runs are ephemeral; nothing changes in your repo. You’re scoring the task, not modifying code.

What you can measure

Success. Did the transformation apply correctly? Eval runs compare against your golden results to catch regressions.

Timing. How long did each step take? If a step takes 10 seconds on one file and 2 minutes on another, something’s different about those files.

Cost. Tokens and dollars spent on the run.