Problem statement
An empty weekly digest is useful telemetry only if collectors were healthy. If your platform release feed depends on manual checks or a few bookmarked blogs, you will either miss bursts of activity or publish stale summaries that waste engineers' attention. The operational goal is simple: deterministically detect authoritative publish events (releases, tags, blog posts), record verifiable provenance, and surface collector health so an absence of items is actionable.
Why determinism matters
Platform teams manage many repos and publishing surfaces (GitHub Releases, Git tags, Backstage blog posts, vendor blogs, and third-party research repos). A missed release can cause undocumented breaking changes, missed security updates, or loss of trust in downstream digests and IDPs. Two requirements follow:
- Deterministically detect authoritative artifacts (releases, tags, published posts).
- Attach verifiable provenance (canonical URL, publishedAt, tag/sha, delivery/request id).
Treat the absence of news as data — but only after proving the collection pipeline was healthy.
Detection mechanisms: prefer push, fallback to pull with health checks
- GitHub Releases REST API (polling or on-demand)
Use the REST Releases endpoint when webhooks are unavailable or for on-demand checks. Record tag_name, published_at, html_url, name, prerelease and the request/response metadata so you can surface collector errors.
Example (Node.js + @octokit/rest) — add retries, backoff and rate-limit handling for production:
// package: @octokit/rest
const { Octokit } = require('@octokit/rest');
const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
async function latestRelease(owner, repo) {
try {
const resp = await octokit.rest.repos.listReleases({ owner, repo, per_page: 5 });
if (!resp.data || resp.data.length === 0) return null;
const release = resp.data.find(r => !r.draft) || resp.data[0];
return {
tag: release.tag_name,
publishedAt: release.published_at,
url: release.html_url,
name: release.name,
prerelease: release.prerelease,
// provenance: include request id / timestamp
fetchedAt: new Date().toISOString()
};
} catch (err) {
// record error for health dashboards and surface as collector-degraded
console.error('listReleases error', err.message);
throw err;
}
}
// usage
(async () => {
const r = await latestRelease('backstage', 'backstage');
console.log(r);
})();- GitHub GraphQL (efficient aggregation across many repos)
GraphQL is efficient when aggregating across many repositories. Request only the fields you need and page results. Example query for a single repo's recent releases:
query($owner:String!, $repo:String!) {
repository(owner: $owner, name: $repo) {
releases(first: 5, orderBy: {field: CREATED_AT, direction: DESC}) {
nodes {
tagName
name
url
publishedAt
isPrerelease
}
}
}
}When polling many repos, batch queries and implement cursor-based pagination to stay within GraphQL complexity limits.
- Webhooks — the preferred push path
Subscribe to the repository "release" event and any relevant publishing events for vendor blogs or CMS platforms. Webhooks deliver immediate, authoritative signals and keep polling costs down.
Key operational points:
- Validate payloads using X-Hub-Signature-256 and the raw request body.
- Use X-GitHub-Delivery for idempotency and deduplication.
- Persist the canonical tuple (repo, tag, published_at, html_url) and delivery/request identifiers.
Minimal Express handler with correct signature verification (use raw body to compute HMAC):
const express = require('express');
const crypto = require('crypto');
const app = express();
// capture raw body for signature verification
app.use(express.json({ verify: (req, res, buf) => { req.rawBody = buf; } }));
function verifySig(req, secret) {
const sig = req.get('x-hub-signature-256');
if (!sig || !req.rawBody) return false;
const hmac = crypto.createHmac('sha256', secret);
const digest = 'sha256=' + hmac.update(req.rawBody).digest('hex');
try {
return crypto.timingSafeEqual(Buffer.from(sig), Buffer.from(digest));
} catch (e) {
return false;
}
}
app.post('/github-webhook', (req, res) => {
if (!verifySig(req, process.env.WEBHOOK_SECRET)) return res.status(401).end();
const event = req.get('x-github-event');
if (event !== 'release') return res.status(204).end();
const payload = req.body;
if (payload.action === 'published') {
const release = payload.release;
const deliveryId = req.get('x-github-delivery');
// persist canonical record: owner/repo, tag, published_at, html_url, author
// include provenance: deliveryId, request timestamp
console.log('release published:', release.tag_name, release.published_at, 'delivery:', deliveryId);
}
res.status(200).end();
});
app.listen(8080);- RSS/Atom and vendor blogs (lower trust)
Prefer RSS/Atom where available. For scraped HTML, require a canonical URL and published date, and snapshot the content (WARC or HTML) for provenance. Mark scraped items as lower confidence and require human verification before surfacing them in high-trust channels.
Ingest patterns and canonical event shape
After detection, normalize items into a canonical event schema so downstream consumers (IDP, Backstage, weekly digests) can rely on a consistent shape.
Canonical event shape (JSON):
- id: owner/repo@tag or blog://canonical-url
- kind: release | blog-post | report
- source: repository or site (e.g., github.com/backstage/backstage)
- published_at: ISO8601
- detected_at: ISO8601
- canonical_url
- title
- author
- metadata: { tag, sha, changelog_url }
- provenance: { request_id, delivery_id, raw_snapshot_url }
Persist events into an append-only store (Kafka topic, BigQuery table, or object store with an index). Separation of detection and deterministic enrichment keeps the pipeline auditable.
Deterministic enrichment tasks
Run deterministic enrichment jobs that:
- Resolve tag -> commit sha (git refs API). Example:
curl -s -H "Authorization: token $GITHUB_TOKEN" \
"https://api.github.com/repos/backstage/backstage/git/refs/tags/v1.0.0" | jq -r '.object.sha'- Download release assets when needed and snapshot release notes.
- Compute a simple impact score (breaking-change heuristics, presence of migration docs, binary vs. patch).
- Index enriched records for search and consumption.
Operational checks to avoid false negatives
- Export metrics from every collector (pollers and webhook handlers): last_success_timestamp, last_duration_seconds, consecutive_errors. Alert on spikes or >1 hour of missed successes.
- Record upstream errors and surface collector-degraded state in the digest rather than silence (e.g., "collection degraded for GitHub API: 403 rate limit").
- Maintain a per-source watermark (latest published_at seen). If a poll returns fewer items than that watermark, investigate for deletions or filter regressions.
- Implement idempotency using canonical IDs and dedupe by delivery/request id.
Example: assembling a weekly digest via GitHub Actions
A practical flow: scheduled GitHub Action runs a GraphQL aggregator for a curated list of repos, writes a digest file into a repo, and opens a PR. Requirements:
- Include a sentinel that records collector health (OK/DEGRADED) in the PR.
- Only publish the digest when at least one new authoritative item is present or when the collectors report degraded state (so emptiness is explainable).
- Include delivery/request IDs for each observed item so reviewers can cross-check.
What to implement first (practical checklist)
- Add webhooks for repositories you control and implement signature verification with raw-body HMAC.
- Build a small GraphQL poller for repos you cannot webhook; add health metrics and backoff.
- Normalize detected items into a canonical event shape and persist to an append-only store.
- Add deterministic enrichment (tag->sha, impact scoring) and index results.
- Surface collector health in the digest or IDP and treat absence of items as actionable telemetry.
Conservative integration into IDPs
Only surface items that meet a confidence threshold (authoritative source + collector health OK). For lower-confidence items (scraped HTML), mark as unverified and queue for human review before promoting into Backstage components or dashboards.
Summary
Deterministic release detection reduces missed updates and restores trust in digests and IDP feeds. Prefer push (webhooks), use GraphQL for efficient aggregation, record verifiable provenance, and instrument collector health so an empty digest is operationally meaningful.