Add --metadata-only mode; harden resume and idempotency
- Add --metadata-only flag: fetches scan detail pages, writes metadata.json + scans.csv rows, skips all image downloads. Re-runs skip scans whose metadata.json already exists. - Atomic progress.json saves (temp-file rename). - Heal-on-resume: tiles on disk but not in progress are silently re-marked before building the pending list. - scans.csv dedup: skip row if mosaic URL already in progress. - Rename mosaic_downloaded -> mosaic_on_disk (reflects disk state). - --recheck now checks mosaics as well as tiles. - RunStats dataclass replaces raw int return; richer run summary. - Fix argparse allow_abbrev reverted; fix --scan-id + --metadata-only glob fallback when scan_time is absent. - Add .venv/ to .gitignore. - README: fix typo, update worker counts, document all new behaviour.
This commit is contained in:
@@ -42,15 +42,15 @@ A full-tube scan covers a 310 mm × 740 mm cylinder at 3.01 × 2.26 mm steps, pr
|
||||
|
||||
### Download speed
|
||||
|
||||
Tile downloads are server-limited: the RootView PHP backend renders tiles on-demand, sustaining ~**0.67 tiles/sec** with 8 parallel workers regardless of local bandwidth. Mosaics are pre-rendered and download ~20× faster per MB.
|
||||
Tile downloads are server-limited: the RootView PHP backend renders tiles on-demand, sustaining ~**0.67 tiles/sec** with 4 parallel workers regardless of local bandwidth. Mosaics are pre-rendered and download ~20× faster per MB.
|
||||
|
||||
| Scenario | Estimated time |
|
||||
|---|---|
|
||||
| All mosaics (4 workers) | ~3 months |
|
||||
| Full tiles for one scan (8 workers) | ~14 hours |
|
||||
| Full tiles for one scan (4 workers) | ~14 hours |
|
||||
| All tiles, full-tube machines only | Years — not recommended |
|
||||
|
||||
**Recommended approach:** archive mosaics first (`--mosaic-only`), then selectively download tiles for priority scans.
|
||||
**Recommended approach:** inventory all scans first (`--metadata-only`, ~80 hours serial or ~7 hours if machines run in parallel), then archive mosaics (`--mosaic-only`), then selectively download tiles for priority scans.
|
||||
|
||||
---
|
||||
|
||||
@@ -58,7 +58,7 @@ Tile downloads are server-limited: the RootView PHP backend renders tiles on-dem
|
||||
|
||||
```bash
|
||||
# 1. Clone / download this repo
|
||||
cd spruce_scrapper
|
||||
cd spruce_scraper
|
||||
|
||||
# 2. Install dependencies (Python 3.10+)
|
||||
pip install -r requirements.txt
|
||||
@@ -84,6 +84,10 @@ python scraper.py --list-scans --machine "BW3-20 [AMR-26]"
|
||||
# Preview what would be downloaded (dry run)
|
||||
python scraper.py --machine "BW3-20 [AMR-26]" --dry-run
|
||||
|
||||
# Inventory scan parameters only (no images downloaded) — very fast
|
||||
python scraper.py --metadata-only
|
||||
python scraper.py --machine "BW3-20 [AMR-26]" --metadata-only
|
||||
|
||||
# Download mosaics only for one machine
|
||||
python scraper.py --machine "BW3-20 [AMR-26]" --mosaic-only
|
||||
|
||||
@@ -103,11 +107,12 @@ python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374 --workers 4
|
||||
|---|---|
|
||||
| `--config FILE` | Config file path (default: `config.yaml`) |
|
||||
| `--machine LABEL` | Restrict to one machine, e.g. `"BW3-20 [AMR-26]"` |
|
||||
| `--scan-id ID` | Download only this scan (use with `--machine`) |
|
||||
| `--scan-id ID` | Restrict to one scan ID (use with `--machine`; works with all modes) |
|
||||
| `--mosaic-only` | Download mosaics only; skip individual tiles |
|
||||
| `--metadata-only` | Fetch scan parameters only; write `metadata.json` + `scans.csv` rows, skip all images. Re-runs skip scans whose `metadata.json` already exists |
|
||||
| `--dry-run` | Print what would be downloaded without saving |
|
||||
| `--workers N` | Parallel download threads (default: 2, hard cap: 4) |
|
||||
| `--recheck` | Scan archive for zero-byte/missing tiles and remove them from `.progress.json` so they re-download on next run |
|
||||
| `--recheck` | Scan archive for zero-byte/missing tiles and mosaics; remove bad entries from `.progress.json` so they re-download on next run |
|
||||
| `--list-machines` | Print all machines and exit |
|
||||
| `--list-scans` | Print all scans for `--machine` and exit |
|
||||
| `--verbose` / `-v` | Debug logging |
|
||||
@@ -128,7 +133,7 @@ archives/
|
||||
├── metadata.json # full scan parameters (grid, timestamps, etc.)
|
||||
├── mosaic.jpg # pre-stitched full image (~16 MB)
|
||||
└── tiles/
|
||||
├── tile_r000_c000.jpg # row 0, column 0
|
||||
├── tile_r000_c000.jpg # row 0, column 0 (zero-padding matches grid size)
|
||||
├── tile_r000_c001.jpg
|
||||
└── ... # 33,784 tiles total for a full-tube scan
|
||||
```
|
||||
@@ -137,10 +142,14 @@ Tile filenames encode position: `tile_r{row}_c{col}.jpg` where row increases wit
|
||||
|
||||
### Metadata files
|
||||
|
||||
**`scans.csv`** columns: `machine`, `machine_id`, `scan_id`, `name`, `scan_time`, `start_x`, `start_y`, `end_x`, `end_y`, `dx`, `dy`, `nx`, `ny`, `total_tiles`, `scan_lines`, `scan_mode`, `start_datetime`, `end_datetime`, `status`, `user`, `disk_space_mb`, `mosaic_url`, `mosaic_local_path`, `mosaic_downloaded`
|
||||
**`scans.csv`** columns: `machine`, `machine_id`, `scan_id`, `name`, `scan_time`, `start_x`, `start_y`, `end_x`, `end_y`, `dx`, `dy`, `nx`, `ny`, `total_tiles`, `scan_lines`, `scan_mode`, `start_datetime`, `end_datetime`, `status`, `user`, `disk_space_mb`, `mosaic_url`, `mosaic_local_path`, `mosaic_on_disk`
|
||||
|
||||
- `mosaic_on_disk`: `True` if `mosaic.jpg` exists on disk at row-write time, regardless of which run downloaded it. Useful for inventory — reflects actual archive state rather than what happened in the current run.
|
||||
|
||||
**`tiles.csv`** columns: `machine`, `machine_id`, `scan_id`, `scan_time`, `row_index`, `col_index`, `x_mm`, `y_mm`, `url`, `local_path`, `downloaded_at`, `file_size_bytes`
|
||||
|
||||
- `downloaded_at`: ISO 8601 UTC timestamp of when the tile was fetched. Empty if the download failed.
|
||||
|
||||
---
|
||||
|
||||
## Site structure (RootView)
|
||||
@@ -161,20 +170,47 @@ Grid coordinates (X, Y) are in millimetres, starting from `(start_x, start_y)` w
|
||||
|
||||
## Resume and reliability
|
||||
|
||||
- **Resumable**: `.progress.json` records every completed URL. Re-running the same command skips already-downloaded files.
|
||||
- **Resumable**: `.progress.json` records every completed URL. Re-running the same command skips already-downloaded files. `--metadata-only` re-runs additionally skip any scan whose `metadata.json` already exists on disk — no HTTP request is made.
|
||||
- **Atomic progress saves**: `.progress.json` is written via a temp-file rename, so a crash mid-save never produces a corrupt or empty progress file.
|
||||
- **Heal on resume**: at the start of each scan's tile pass, any tile file that exists on disk but isn't recorded in progress is silently re-marked as complete, preventing duplicate `tiles.csv` rows and redundant re-downloads.
|
||||
- **Retry logic**: each tile download retries up to 3 times with exponential backoff (5 s → 10 s → 20 s) before logging a warning and moving on.
|
||||
- **Worker cap**: the RootView server renders tiles on a single-threaded PHP process. Running more than 4 concurrent requests causes cascading read timeouts. The default is 2 workers; the scraper hard-caps at 4 and warns loudly if you try to exceed it.
|
||||
- **Crash recovery**: if a run is killed mid-flight, some in-progress tiles may have been written as zero-byte files without being marked complete. Run `--recheck` before resuming — it deletes zero-byte files on disk and removes their URLs from `.progress.json` so they are cleanly re-downloaded.
|
||||
- **Worker cap**: the RootView server renders tiles on a single-threaded PHP process. Running more than 4 concurrent requests causes cascading timeouts. The default is 2 workers; the scraper hard-caps at 4 and warns if you try to exceed it.
|
||||
- **Crash recovery**: run `--recheck` to find and remove zero-byte or missing tile and mosaic files from `.progress.json` so they are cleanly re-downloaded on the next run.
|
||||
|
||||
```bash
|
||||
# After any interrupted run, always do this first:
|
||||
# After a hard crash, optionally run recheck before resuming:
|
||||
python scraper.py --recheck
|
||||
# Then resume normally:
|
||||
# Then resume normally — the scraper picks up where it left off:
|
||||
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Run summary
|
||||
|
||||
Every run prints a summary table on completion:
|
||||
|
||||
```
|
||||
──────────────────────────────────────────────
|
||||
Run complete
|
||||
──────────────────────────────────────────────
|
||||
Machines: 1
|
||||
Scans fetched: 428 (2 already cached, 0 failed)
|
||||
Metadata written: 428 (new JSON files)
|
||||
──────────────────────────────────────────────
|
||||
Scans CSV: archives/scans.csv
|
||||
Progress: archives/.progress.json
|
||||
──────────────────────────────────────────────
|
||||
```
|
||||
|
||||
- **Scans fetched**: metadata detail page was retrieved from the server this run.
|
||||
- **Already cached**: `metadata.json` already existed on disk; no HTTP request was made.
|
||||
- **Failed**: fetch error or scan missing required grid parameters.
|
||||
- **Metadata written**: new `metadata.json` files created (shown in `--metadata-only` mode).
|
||||
- Mosaic and tile counts appear in their respective modes.
|
||||
|
||||
---
|
||||
|
||||
## Dependencies
|
||||
|
||||
| Package | Purpose |
|
||||
|
||||
Reference in New Issue
Block a user