Bug-hunt fixes: cve-watch 403, repro-check, hardening-drift guard, publish/prune #1

Merged
zach merged 4 commits from fix/bughunt-cve-watch-repro-config-publish into main 2026-05-28 23:23:14 +00:00
Owner

Fixes from an Opus 4.8 bug hunt (3 read-only Explore agents) across the CI workflows, build pipeline, and publish/repro tooling. Four self-contained clusters, one commit each. All changes pass the validate.yml gates (shellcheck, yamllint, py_compile, intent-matches-policy) plus targeted unit tests of the pure logic; end-to-end repro/prune runs need a Linux+docker host and are flagged as operator verification.

Clusters

cve-watch (37d4aa4) — the open / update tracking issue step was dying with an opaque curl: (22) ... 403.

  • Replaced curl -fsS with a helper capturing HTTP status and body, printed on >=400 — the next failure is self-diagnosing.
  • Dropped the server-side &q= issue search (needs the issue indexer + extra token scope); the existing client-side jq title match suffices.
  • Normalized the upstream-vs-local comparison to the X.Y.Z-hardenedN core, fixing a false "new upstream" that fired every hour because local .P packaging tags never match upstream.
  • Reconciled the ISSUE_TOKEN scope docs to write:issue + read:repository.

reproducibility (bdff799)repro-check.sh could not pass for any release.

  • strip-signatures.sh now decompresses .ko.xz before stripping (it was scanning compressed bytes -> no-op) and truncates at the correct offset idx-12-sig_len. Verified: project-key and ephemeral-key signatures over the same module strip to identical bytes.
  • Added the missing gcc-<ver>-plugin-dev to the rebuild deps (without it the fidelity assertion aborts on CONFIG_GCC_PLUGINS), added sbsigntool to the preflight, and turned the silent released-.deb fetch skip into a loud, actionable warning.

config integrity (b583129)

  • New tools/check-hardening-drift.sh, run in build-kernel.yml, fails the build if an inherited hardening symbol (not covered by the fidelity assertion) regresses vs the most recent released final.config. Excludes intent-pinned symbols and "disable-is-hardened" families to avoid inverted false positives. Verified on the real 1.2->1.3 pair (passes) and a synthetic LSM drop (fails).
  • Stopped the per-build tmpfs signing-key path leaking into committed final.config; scrubbed the three existing ones.
  • Anchored bump-seed.sh's signer check (was an unanchored substring grep).
  • Doc accuracy: the seed intentionally lags the kernel version; the drift guard, not a version-match gate, covers inherited symbols. (Seed bump to 7.0.10 deferred — anthraxx hasn't packaged it.)

publish/prune (a24d4ba)

  • Retention groups by version core so a packaging-rebuild burst can't evict an older supported kernel; malformed versions are logged, not silently bucketed.
  • Failed Fastly purges now fail loud (were silently exit-0, risking stale apt metadata); archive is idempotent (409/404 tolerant); bounded retry on transient 5xx; atomic manifest write.

Operator follow-ups

  • Re-mint ISSUE_TOKEN with write:issue + read:repository — the code makes the failure diagnosable and drops the indexer dependency, but the scope is what actually clears the 403.
  • The first seed-era build may trip the drift guard vs the pre-seed 1.3 baseline; review and pin the symbol or set ALLOW_HARDENING_REGRESSION=1 once.

Deferred (documented, not in this PR)

repro-check.yml's workflow_run trigger may not fire on Forgejo; draft-release 409 fallback; intent-matches-policy substring match; unshare -n silent no-isolation fallback; tag regex accepting .0.

🤖 Generated with Claude Code

Fixes from an Opus 4.8 bug hunt (3 read-only Explore agents) across the CI workflows, build pipeline, and publish/repro tooling. Four self-contained clusters, one commit each. All changes pass the `validate.yml` gates (shellcheck, yamllint, py_compile, intent-matches-policy) plus targeted unit tests of the pure logic; end-to-end repro/prune runs need a Linux+docker host and are flagged as operator verification. ## Clusters **cve-watch (37d4aa4)** — the `open / update tracking issue` step was dying with an opaque `curl: (22) ... 403`. - Replaced `curl -fsS` with a helper capturing HTTP status **and body**, printed on >=400 — the next failure is self-diagnosing. - Dropped the server-side `&q=` issue search (needs the issue indexer + extra token scope); the existing client-side `jq` title match suffices. - Normalized the upstream-vs-local comparison to the `X.Y.Z-hardenedN` core, fixing a false "new upstream" that fired **every hour** because local `.P` packaging tags never match upstream. - Reconciled the `ISSUE_TOKEN` scope docs to `write:issue` + `read:repository`. **reproducibility (bdff799)** — `repro-check.sh` could not pass for any release. - `strip-signatures.sh` now decompresses `.ko.xz` before stripping (it was scanning compressed bytes -> no-op) and truncates at the correct offset `idx-12-sig_len`. Verified: project-key and ephemeral-key signatures over the same module strip to identical bytes. - Added the missing `gcc-<ver>-plugin-dev` to the rebuild deps (without it the fidelity assertion aborts on `CONFIG_GCC_PLUGINS`), added `sbsigntool` to the preflight, and turned the silent released-`.deb` fetch skip into a loud, actionable warning. **config integrity (b583129)** - New `tools/check-hardening-drift.sh`, run in `build-kernel.yml`, fails the build if an **inherited** hardening symbol (not covered by the fidelity assertion) regresses vs the most recent released `final.config`. Excludes intent-pinned symbols and "disable-is-hardened" families to avoid inverted false positives. Verified on the real 1.2->1.3 pair (passes) and a synthetic LSM drop (fails). - Stopped the per-build tmpfs signing-key path leaking into committed `final.config`; scrubbed the three existing ones. - Anchored `bump-seed.sh`'s signer check (was an unanchored substring grep). - Doc accuracy: the seed intentionally lags the kernel version; the drift guard, not a version-match gate, covers inherited symbols. (Seed bump to 7.0.10 deferred — anthraxx hasn't packaged it.) **publish/prune (a24d4ba)** - Retention groups by version **core** so a packaging-rebuild burst can't evict an older supported kernel; malformed versions are logged, not silently bucketed. - Failed Fastly purges now fail loud (were silently exit-0, risking stale apt metadata); archive is idempotent (409/404 tolerant); bounded retry on transient 5xx; atomic manifest write. ## Operator follow-ups - **Re-mint `ISSUE_TOKEN`** with `write:issue` + `read:repository` — the code makes the failure diagnosable and drops the indexer dependency, but the scope is what actually clears the 403. - The first seed-era build may trip the drift guard vs the pre-seed `1.3` baseline; review and pin the symbol or set `ALLOW_HARDENING_REGRESSION=1` once. ## Deferred (documented, not in this PR) `repro-check.yml`'s `workflow_run` trigger may not fire on Forgejo; draft-release 409 fallback; `intent-matches-policy` substring match; `unshare -n` silent no-isolation fallback; tag regex accepting `.0`. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
The `open / update tracking issue` step died with an opaque
`curl: (22) ... 403`. Three compounding problems:

1. `curl -fsS` discarded the response body, so the operator saw only
   exit 22 with no Forgejo error. Replace both API calls with an
   api_call helper that captures `%{http_code}` + body and prints the
   body to stderr on >=400 — the next failure is self-diagnosing.

2. The dedup search used a server-side `&q=` query, which depends on
   the issue indexer and requires the repository-read scope category.
   Drop it: the code already filters the open-issue list client-side by
   exact title (jq select), so `state=open&type=issues&limit=50` plus
   the existing jq match is sufficient and needs less scope surface.

3. ISSUE_TOKEN was documented as `write:issue` only, but listing issues
   needs `read:repository` too (Forgejo enforces token scopes even on
   anonymously-public endpoints). Reconcile the workflow comment and the
   three operator docs to `write:issue` + `read:repository`. (Re-minting
   the PAT is the operator action that actually clears the 403.)

Separately, the upstream-vs-local comparison compared the newest local
tag (which carries our `.P` packaging suffix, e.g. v7.0.10-hardened1.4)
against an upstream value that can only ever be `...-hardenedN`, so
`new=true` fired every hour and tried to refile. Normalize the local
tag to its X.Y.Z-hardenedN core before comparing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
repro-check.sh could not succeed for any seed-era release; four distinct
bugs, all of which now fail or mislead silently:

1. strip-signatures.sh searched for the module-signature magic in the
   raw bytes of *.ko.xz. Modules are signed THEN compressed, so the magic
   is inside the xz stream -> rfind missed -> no-op. Both sides no-op'd
   identically, but the signed bytes differ (project key vs the rebuild's
   ephemeral key), so the image .deb always mismatched. Now decompress
   (.ko/.ko.xz/.ko.gz/.ko.zst) and canonicalize to the DECOMPRESSED,
   unsigned form on both sides; never recompress (xz/zstd output isn't
   byte-stable across versions). One python walk instead of one-per-.ko.

2. The truncation offset was idx-12, which left the variable-length
   signature blob behind. Cut at idx-12-sig_len, where sig_len is the
   big-endian u32 in the last 4 bytes of struct module_signature.

3. repro-check.sh's rebuild container omitted gcc-<ver>-plugin-dev, which
   build-dep does not pull. Without it kconfig drops the GCC_PLUGINS
   subtree and configure-kernel.sh fidelity-aborts on CONFIG_GCC_PLUGINS
   -> "rebuild produced no .debs". Install it, mirroring build-kernel.yml.

4. Default mode silently skipped fetching the released .debs (nothing
   ever writes the canonical_path the loop reads), so a mismatch produced
   no diffoscope report and the operator couldn't tell why. The verdict
   (hash vs manifest) doesn't need them, so warn loudly + explain the
   missing diff and point at RELEASED_DIR, instead of skipping in silence.

Also require sbsigntool (sbattach) in the preflight: without it the
released vmlinuz's Authenticode signature isn't stripped, which would
itself produce a false mismatch on the image .deb.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Config-integrity hardening (seed bump to 7.0.10 deliberately deferred —
anthraxx hasn't packaged it yet, so the 7.0.9-seed / 7.0.10-tree skew is
expected and permanent for now).

B2 — ephemeral signing-key path leaked into committed final.config.
configure-kernel.sh's post-fidelity override repoints CONFIG_MODULE_SIG_KEY
at a per-build tmpfs path; build-debs.sh copied that verbatim into
final.config, which publish.yml commits. Result: a runner path
(/tmp/tmp.XXXX/signing.pem) in public git and a spurious one-line diff
every build. Normalize the line back to the intent placeholder when
emitting final.config, and scrub the three already-committed configs.

B1 — inherited hardening symbols bypassed all drift detection. The
fidelity assertion only covers the ~59 symbols pinned in intent.config;
inherited ones (LSMs, BPF_UNPRIV_DEFAULT_OFF, ...) could silently regress.
Add tools/check-hardening-drift.sh and run it in build-kernel.yml after the
build: it fails if an enabled hardening symbol dropped vs the most recent
released final.config. It excludes intent-pinned symbols (already covered)
and "disable-for-hardening" families (DEVMEM/KEXEC/HIBERNATION/... where
off is the hardened state, incl. their children) so turning those off, or
their subtrees vanishing, is not misread as a regression. Verified on the
real 1.2->1.3 pair (no false positive) and a synthetic LSM drop (caught).
It runs in the workflow, not build-debs.sh, so repro-check's rebuild isn't
coupled to a policy gate. ALLOW_HARDENING_REGRESSION=1 overrides a reviewed
intentional drop.

B3 — docs claimed the seed is "version-matched to the patched tree". It
isn't (and needn't be). config-architecture.md and the upstream-seed.toml
comment now state the seed routinely lags the kernel version and that the
drift guard, not a version-match gate, covers inherited symbols.

C12 — bump-seed.sh's signer check was an unanchored substring grep. Use
the same anchored "Primary key fingerprint:" / "using RSA key" parse plus
exact de-spaced fingerprint comparison as build/fetch-seed.sh.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fix(publish/prune): correct retention; fail loud on stale-metadata risks
All checks were successful
validate / shellcheck (pull_request) Successful in 12s
validate / yamllint (pull_request) Successful in 11s
validate / pycompile (pull_request) Successful in 4s
validate / intent-matches-policy (pull_request) Successful in 4s
validate / no-placeholder-digests (pull_request) Has been skipped
a24d4ba31e
C5 — prune retention grouped by major only, so a burst of packaging
rebuilds on the current kernel (7.0.10-hardened1, .1, .2, .3) filled the
keep-3 quota and evicted an older, still-supported kernel (7.0.9-hardened1)
from main — the opposite of the documented "keep N minor versions per
major". Group by the version CORE (major.minor.patch-hardenedN) instead:
keep the newest N cores per major, and within a kept core collapse to the
newest packaging rebuild (older .P archive). Rebuilds no longer count
against the version quota.

C6 — the unparseable-version sentinel (0,) was truthy, so the `if not
parts` guard never fired; junk-named versions landed in a bogus "major 0"
bucket and were partly archived in arbitrary order. Return () (falsy), and
leave unparseable versions in main with a logged warning — prune should
not auto-archive a version it can't interpret.

C8 — both tools recorded a failed Fastly purge and still exited 0. A
publish/rotation that updates pool files but fails to purge leaves apt
clients served stale InRelease/Packages until TTL (and prune can leave a
Packages index pointing at deleted pool files). Now an ATTEMPTED-but-failed
purge exits non-zero; absent creds is a loud warning, not silence (a
direct-origin/no-CDN deploy is supported per SECURITY.md).

C7 — archive (upload-to-archive then delete-from-main) was not idempotent:
a failure mid-rotation left a package in both components and a re-run
409'd. upload_deb gains conflict_ok (archive-side 409 = already moved ->
proceed), and http_delete tolerates 404 (already removed) — so a resumed
rotation converges.

C9 — upload_deb turned every error into a non-retryable SystemExit, so a
transient 503/network blip aborted a multi-deb publish. Add bounded
exponential-backoff retry on 5xx and network errors only; 4xx (incl. 409)
is never retried.

C10 — the manifest was written in place and publish.py runs twice (main,
debug) against the same file, so a crash mid-write could truncate it. Write
via temp + os.replace, with a trailing newline to match build-manifest.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
zach merged commit 68d16cb7f7 into main 2026-05-28 23:23:14 +00:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
unredacted/linux-hardened-unredacted!1
No description provided.