Scraping resilience, metadata tooling, and repository hygiene

Consolidates mosaic and session hardening (login retry, skip processed scans, no retry on 404, started_at), progress reporting (Markdown tables, by-year rollup, rolling-window rate/ETA), and metadata workflow scripts (run_metadata_scan.sh, scan_progress_report.py, export_machine_metadata.py). Adds mosaic reconstruction sample JPEGs referenced by the report. Updates .gitignore for backup/ and .claude/; sample_random_scans helper is documented for branch testing/sample-runs only (see README).
2026-05-14 19:52:53 -04:00
parent 752c278dff
commit 6390f5d529
23 changed files with 788 additions and 188 deletions
@@ -14,6 +14,7 @@ from bs4 import BeautifulSoup

 from spruce.download_result import (
    OK,
+    PERMANENT_MISSING,
    UNKNOWN,
    DownloadResult,
    classify_http_error,
@@ -263,6 +264,10 @@ class MachineSession:
                    and exc.response is not None
                ):
                    sc = exc.response.status_code
+                cl = classify_http_error(sc, exc)
+                if cl == PERMANENT_MISSING:
+                    # 404/410 will never succeed — don't waste time retrying.
+                    return DownloadResult(0, sc, str(exc), cl)
                if attempt < retries:
                    log.warning(
                        "Attempt %d/%d failed %s: %s — retrying in %.0fs",
@@ -281,7 +286,6 @@ class MachineSession:
                        url,
                        exc,
                    )
-                    cl = classify_http_error(sc, exc)
                    return DownloadResult(0, sc, str(exc), cl)
        return DownloadResult(0, None, "download_file: exhausted", UNKNOWN)