Metrics — Dynvo docs

Health score

0–100 · per feature & flow

What it meansA single composite score that rolls bug-fix ratio, churn, and recency-weighted regression density into one number. Recent bug fixes count 2x because last-quarter pain is where today's attention belongs.

Why it mattersIt's the one number an EM can sort a backlog by. ≥70 is healthy (quietly works — deprioritise), 50–70 is caution (watch list), below 50 is firefighting (actively bleeding velocity). Below 50 sits in the top decile of bug-fix density across every feature we've ever observed. One caveat we validated ourselves: health describes the PRESENT — it barely predicts next quarter (AUC ~0.55 against future fixes in our calibration study). Sort by it to see today's pain; use Risk for forward-looking triage.

How it's computedEmpirically calibrated against ~2500 dev features across 17 trained repos. Pure git history — no LLM, no test runs.

Bug-fix ratio

0–1 · feature drill-down

What it meansThe fraction of recent commits on a feature classified as fixes — matched by commit-message regex (fix:, bug, hotfix, regression).

Why it mattersUnder 0.15 is normal maintenance noise; 0.15–0.30 means something is structurally off; above 0.30 means most commits here are firefighting. Industry-typical SaaS runs 8–15%, so anything sustained above that is structural, not transient. We removed this column from the main table after our own calibration study showed it is near-chance as a PREDICTOR of future bugs (AUC 0.555) — it remains here in the feature drill-down as an honest descriptive stat, with the underlying fix commits as receipts.

How it's computedPure git over the rolling window of recent commits per feature. No LLM.

Impact score

0–100 · blast radius

What it meansHow many other features depend on this one via imports, shared files, and co-change. Touching high-impact code ripples downstream.

Why it matters≥70 is critical (a change here ripples across the product), 40–70 is connective (touches multiple flows), under 40 is a safe-to-refactor leaf. Impact ≥70 + low coverage is the case for mandatory senior review — wire it to CODEOWNERS.

How it's computedGraph-centrality percentiles across the feature dependency graph. The top quartile is the "Stripe webhook handler" shape — touched by everything.

Coverage

0–100 · behavioral + lcov merge

What it meansPer-feature / per-flow test reach. It answers "is this feature tested?" — not "did some line execute?". Most OSS repos ship no lcov report, so we estimate behavioral coverage from git history with zero setup, then merge real lcov line coverage on top if you upload it.

Why it matters≥70 is covered (regressions surface in CI), 40–70 is partial (happy path only — edge cases ship to prod), under 40 is uncovered. Features below 40 correlate with 3× higher post-deploy incident rate. Don't refactor a feature below 40 — write tests first.

How it's computedA 7-signal composite with zero LLM calls: co_change (0.35) + bug_fix_test (0.25) + freshness (0.15) + density (0.10) + co_author (0.10) + ci_workflow (0.05) + commit_msg (0.05). Each signal and the composite are clamped to [0, 1]. Every number carries a confidence band — high (5+ signals fired or lcov merged), medium (3–4), low (≤2) — so a noisy read can't masquerade as ground truth.

Ownership / bus factor

distinct authors · 90d

What it meansThe count of distinct authors who touched a feature in the last 90 days. Bus factor 1 means one person knows this code — they leave, you bleed.

Why it matters≥3 is resilient (safe to lose any single author), 2 is fragile (pair-program the next change), 1 is critical (single point of failure). Sub-3 ownership correlates with 2–4 week onboarding delays per incident on the affected feature — and "8 critical features owned by one person" is the conversation that gets headcount approved.

How it's computedDerived from git blame / author history over the trailing 90 days, scoped per feature. No LLM.

AI %

0–100 · line-weighted, M7

What it meansThe share of a feature's current lines whose last author is an AI coding agent — classified from git identity + Co-Authored-By trailers against a versioned agent registry, not a guess.

Why it mattersThis is a LOWER BOUND by construction: a human committing agent output under their own identity with no trailer is undetectable, so the real share is always ≥ what's shown. Use it as a floor, not a ceiling, when deciding where extra review scrutiny belongs.

How it's computed`git blame` at the scanned commit, each line inheriting its commit's classification (agents.yaml registry); bot commits excluded from both sides. Falls back to commit-weighted share on very large repos (line-blame pre-flight gated at 20k tracked files).

How to read it

· (no data)no agent trailers anywhere in history. This does NOT mean no AI — it means no disclosure discipline. If the team writes with Copilot/Cursor, the real share is invisible until trailers are on.
~1%the open-source median (our 38-repo study): assistants in use, few full-agent commits.
10–30%active agent-assisted development (we measure formbricks at ~10%, inbox-zero at ~28%). Read it next to review hygiene — this is where second-review rules start earning their keep.
60%+an AI-native codebase (we've measured 63–64% in the wild). The human REVIEWER is now the real knowledge owner — bus factor should be counted over reviewers, not authors.

How to improve itYou don't push this number up or down — you make it HONEST. Turn on Co-Authored-By trailer discipline (Claude Code and Cursor emit trailers by default), so the floor approaches the truth; then route high-AI% features to stricter review instead of arguing about the total.

Rework %

0–100 · 28d horizon, M3

What it meansThe share of a feature's added lines that were rewritten or deleted within 28 days of landing. It tracks LINE SURVIVAL, not commit volume — a feature can be busy and still have every line stick.

Why it mattersHigh rework on freshly-landed code is a quality-on-entry signal, catchable BEFORE the bug ships: code that gets rewritten within a month of landing usually wasn't right the first time. Split self- vs cross-author rework to tell "iterating on my own work" apart from "someone else had to fix this."

How it's computedOne streaming `git log --reverse -U0` pass; content-hashed added lines are retired as churned by a later matching delete/modify within the horizon; pure moves and formatting-sweep commits (>30% of repo lines) are excluded so a rename or a prettier run never counts as churn.

How to read it

0–5%very stable intake — code lands right the first time (mature, well-reviewed projects; documenso sits near 3%).
8–15%normal for active development (our corpus median is ~13%). Nothing to fix; just watch the trend.
20%+code is systematically redone within a month of landing. Look for the cause: missing review, prototype phase, or unreviewed agent output.
AI ≫ human gapthe sharpest read of all: when AI-line rework runs 2–3x the human rate (we've measured 22% vs 8% in one repo), review is not keeping pace with the agents.

How to improve itTighten quality-at-entry: real review on agent PRs, characterization tests before refactors, smaller PRs. Because the horizon is 28 days, the number responds within one or two windows — it's the fastest-moving quality metric we ship.

Orphaned %

0–100 · departed-author lines, M1

What it meansThe share of a feature's current lines whose last author has made no commits anywhere in the repo within the recent activity window — code nobody currently on the team wrote.

Why it mattersOrphaned code is risk hiding in plain sight: no one active can answer "why is this like this" without archaeology. Rendered muted at exactly 0% (silence over noise) so a healthy feature doesn't compete for attention with the ones that need it.

How it's computedGit author history over a 90-day activity window; AI/bot-authored lines inherit their human reviewer via Co-Authored-By trailer order when derivable. Confidence demotes to low on a young repo or a stale clone, where the window can't be trusted.

How to read it

0% (muted dot)everyone who wrote this code is still active. Rendered as a quiet dot on purpose — zero here is a state, not an achievement, and it shouldn't compete for attention.
5–15%natural team turnover. Fine on its own; worth noting on features that are also high-Risk.
25%+ on a featureediting this zone is archaeology — nobody active can answer "why is it like this". PR comments start warning "the author of this code has departed".
50–90%+effectively nobody's code (we've seen 93% zones in real repos). First candidate for re-ownership, documentation, or a deliberate rewrite decision.

How to improve itKnowledge rotation: give the worst zones an explicit active owner, onboard new hires into them, document while fixing small issues. The number falls as active authors touch the lines — it responds to real work, not to policy documents.

Risk

0–100 · within-repo percentile, P4 calibration

What it meansA feature's within-scan percentile rank of predicted FUTURE bug-fix risk — 0 is the safest feature in this repo, 100 is the riskiest. Built from four scan-JSON fields (no git walk): prior fix density, days since last touch, contributor count, commit volume.

Why it mattersThis is a rank, not a probability — never read it as "N% chance of a bug". Use it to triage which features to review or test first THIS quarter. The P4 study found the engine's existing Health/bug-fix-ratio is close to chance (AUC 0.555) at predicting future fixes; this signal (pooled grouped-CV AUC 0.78) is validated specifically for within-repo ranking, not for comparing raw scores across different repos.

How it's computedThe study's own published logistic-regression coefficients (docs/specs/metrics/p4-calibration-report.md §5), applied to within-scan average-rank-transformed inputs — the same recipe (`rankdata(method="average")/n`) the study validated with GroupKFold(5)-by-repo cross-validation. Confidence is low when the feature has under 5 commits in the scan (too little activity to trust the rank).

How to read it

· (low confidence)under 5 commits in the scan — too little activity to rank. Absence of a rank is honest, not a bug.
0–40the calm tail of this repo. Don't spend scarce review budget here.
60–80elevated — put these features on this quarter's test/review list.
90+the top decile of THIS repo. Combined with high Impact, this quadrant is where your next incident is statistically most likely to come from.

How to improve itLower its inputs: stabilize fix-dense features (fewer fix commits per feature), and let hot zones cool. It's a within-repo RANK — it falls when a feature stops being the hottest thing in its own repo, and raw values must never be compared across repositories.