Kolpack-Software-Consulting/SPRUCE-scraper

Files

T

poprhythm f2193011ca Add --metadata-only mode; harden resume and idempotency

- Add --metadata-only flag: fetches scan detail pages, writes
  metadata.json + scans.csv rows, skips all image downloads.
  Re-runs skip scans whose metadata.json already exists.
- Atomic progress.json saves (temp-file rename).
- Heal-on-resume: tiles on disk but not in progress are silently
  re-marked before building the pending list.
- scans.csv dedup: skip row if mosaic URL already in progress.
- Rename mosaic_downloaded -> mosaic_on_disk (reflects disk state).
- --recheck now checks mosaics as well as tiles.
- RunStats dataclass replaces raw int return; richer run summary.
- Fix argparse allow_abbrev reverted; fix --scan-id + --metadata-only
  glob fallback when scan_time is absent.
- Add .venv/ to .gitignore.
- README: fix typo, update worker counts, document all new behaviour.

2026-04-24 09:44:57 -04:00

10 KiB

Raw Blame History

Spruce Minirhizotron Scraper

A Python tool for archiving image data collected by minirhizotron cameras at the Spruce experiment site. It authenticates against the RootView web interface, enumerates all scans across all 12 camera machines, and downloads image tiles and mosaics to a structured local archive with full metadata.

Background

Minirhizotron cameras are inserted into clear tubes buried in the ground to image root systems non-destructively over time. This project archives data from the SPRUCE (Spruce and Peatland Responses Under Changing Environments) experiment, which monitors boreal peatland responses to warming and elevated CO₂.

The 12 AMR camera machines (BW1-4 through BW3-21) are managed by a RootView web application at http://205.149.147.131:8010. Each scan captures a grid of overlapping image tiles along a buried tube. The server also pre-renders a full stitched mosaic for each scan.

Archive inventory (as of April 2026)

Machine	Scans	Scan type (sampled)
BW1-4 [AMR-15]	6,121	Mixed (full-tube + partial)
BW1-6 [AMR-19]	18,198	Full-tube (~33,784 tiles, ~1.7 GB each)
BW1-7 [AMR-18]	430	Full-tube (~33,784 tiles, ~1.8 GB each)
BW2-8 [AMR-25]	8,191	Partial (~400 tiles, ~10 MB each)
BW2-10 [AMR-22]	16,537	Not yet sampled
BW2-11 [AMR-23]	26,763	Not yet sampled
BW2-13 [AMR-24]	13,537	Not yet sampled
BW3-16 [AMR-16]	7,325	Not yet sampled
BW3-17 [AMR-20]	471	Not yet sampled
BW3-19 [AMR-21]	15,186	Not yet sampled
BW3-20 [AMR-26]	23,052	Full-tube (~33,784 tiles, ~1.95 GB each)
BW3-21 [AMR-17]	10,115	Not yet sampled
Total	145,926

Storage estimates

What	Size	Notes
Mosaics only	~2.4 TB	145,926 × 16.6 MB per mosaic
Full tiles (mixed scans)	~160 TB	Assumes 40% full-tube, 60% partial
Full tiles (worst case)	~368 TB	If all scans are full-tube

A full-tube scan covers a 310 mm × 740 mm cylinder at 3.01 × 2.26 mm steps, producing a 103 × 328 = 33,784 tile grid. Each tile is ~79 KB on average (JPEG, 137 KB at the tube surface).

Download speed

Tile downloads are server-limited: the RootView PHP backend renders tiles on-demand, sustaining ~0.67 tiles/sec with 4 parallel workers regardless of local bandwidth. Mosaics are pre-rendered and download ~20× faster per MB.

Scenario	Estimated time
All mosaics (4 workers)	~3 months
Full tiles for one scan (4 workers)	~14 hours
All tiles, full-tube machines only	Years — not recommended

Recommended approach: inventory all scans first (--metadata-only, ~80 hours serial or ~7 hours if machines run in parallel), then archive mosaics (--mosaic-only), then selectively download tiles for priority scans.

Setup

# 1. Clone / download this repo
cd spruce_scraper

# 2. Install dependencies (Python 3.10+)
pip install -r requirements.txt

# 3. Configure credentials
cp config.example.yaml config.yaml
# Edit config.yaml: set username and password

config.yaml is gitignored and never committed.

Usage

# List all available machines (no login needed)
python scraper.py --list-machines

# List all scans for a machine
python scraper.py --list-scans --machine "BW3-20 [AMR-26]"

# Preview what would be downloaded (dry run)
python scraper.py --machine "BW3-20 [AMR-26]" --dry-run

# Inventory scan parameters only (no images downloaded) — very fast
python scraper.py --metadata-only
python scraper.py --machine "BW3-20 [AMR-26]" --metadata-only

# Download mosaics only for one machine
python scraper.py --machine "BW3-20 [AMR-26]" --mosaic-only

# Download mosaics for all machines
python scraper.py --mosaic-only

# Download all tiles for a specific scan
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374 --workers 4

# Resume an interrupted download (automatically skips completed files)
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374 --workers 4

All options

Flag	Description
`--config FILE`	Config file path (default: `config.yaml`)
`--machine LABEL`	Restrict to one machine, e.g. `"BW3-20 [AMR-26]"`
`--scan-id ID`	Restrict to one scan ID (use with `--machine`; works with all modes)
`--mosaic-only`	Download mosaics only; skip individual tiles
`--metadata-only`	Fetch scan parameters only; write `metadata.json` + `scans.csv` rows, skip all images. Re-runs skip scans whose `metadata.json` already exists
`--dry-run`	Print what would be downloaded without saving
`--workers N`	Parallel download threads (default: 2, hard cap: 4)
`--recheck`	Scan archive for zero-byte/missing tiles and mosaics; remove bad entries from `.progress.json` so they re-download on next run
`--list-machines`	Print all machines and exit
`--list-scans`	Print all scans for `--machine` and exit
`--verbose` / `-v`	Debug logging

Output layout

archives/
├── .progress.json              # tracks completed URLs for resume support
├── scans.csv                   # scan-level metadata for every processed scan
├── tiles.csv                   # tile-level metadata for every downloaded tile
│
└── BW3-20__AMR-26/
    └── 2024-07-29/
        └── 158374/
            ├── metadata.json   # full scan parameters (grid, timestamps, etc.)
            ├── mosaic.jpg      # pre-stitched full image (~16 MB)
            └── tiles/
                ├── tile_r000_c000.jpg   # row 0, column 0 (zero-padding matches grid size)
                ├── tile_r000_c001.jpg
                └── ...                 # 33,784 tiles total for a full-tube scan

Tile filenames encode position: tile_r{row}_c{col}.jpg where row increases with depth (Y in mm) and column increases along the tube circumference (X in mm).

Metadata files

scans.csv columns: machine, machine_id, scan_id, name, scan_time, start_x, start_y, end_x, end_y, dx, dy, nx, ny, total_tiles, scan_lines, scan_mode, start_datetime, end_datetime, status, user, disk_space_mb, mosaic_url, mosaic_local_path, mosaic_on_disk

mosaic_on_disk: True if mosaic.jpg exists on disk at row-write time, regardless of which run downloaded it. Useful for inventory — reflects actual archive state rather than what happened in the current run.

tiles.csv columns: machine, machine_id, scan_id, scan_time, row_index, col_index, x_mm, y_mm, url, local_path, downloaded_at, file_size_bytes

downloaded_at: ISO 8601 UTC timestamp of when the tile was fetched. Empty if the download failed.

Site structure (RootView)

The RootView interface runs on a standard PHP stack. Key endpoints discovered:

Endpoint	Description
`POST index.php`	Login (`RTLLogin=1`, `RTLNAME`, `RTLUSER`, `RTLPWD`)
`POST index.php {cmd:scan, start:N, FilterCount:320}`	Paginated scan list
`GET index.php?cmd=scan&mode=view&id=ID`	Scan detail (grid params, disk usage)
`GET index.php?cmd=image&mode=image_scan&id=ID&s=1&x=X&y=Y`	Individual tile JPEG
`GET http://<host>:8011/RootView_Database/ID/mosaic.jpg`	Pre-stitched mosaic

Grid coordinates (X, Y) are in millimetres, starting from (start_x, start_y) with step (dx, dy).

Resume and reliability

Resumable: .progress.json records every completed URL. Re-running the same command skips already-downloaded files. --metadata-only re-runs additionally skip any scan whose metadata.json already exists on disk — no HTTP request is made.
Atomic progress saves: .progress.json is written via a temp-file rename, so a crash mid-save never produces a corrupt or empty progress file.
Heal on resume: at the start of each scan's tile pass, any tile file that exists on disk but isn't recorded in progress is silently re-marked as complete, preventing duplicate tiles.csv rows and redundant re-downloads.
Retry logic: each tile download retries up to 3 times with exponential backoff (5 s → 10 s → 20 s) before logging a warning and moving on.
Worker cap: the RootView server renders tiles on a single-threaded PHP process. Running more than 4 concurrent requests causes cascading timeouts. The default is 2 workers; the scraper hard-caps at 4 and warns if you try to exceed it.
Crash recovery: run --recheck to find and remove zero-byte or missing tile and mosaic files from .progress.json so they are cleanly re-downloaded on the next run.

# After a hard crash, optionally run recheck before resuming:
python scraper.py --recheck
# Then resume normally — the scraper picks up where it left off:
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374

Run summary

Every run prints a summary table on completion:

──────────────────────────────────────────────
  Run complete
──────────────────────────────────────────────
  Machines:             1
  Scans fetched:        428  (2 already cached, 0 failed)
  Metadata written:     428  (new JSON files)
──────────────────────────────────────────────
  Scans CSV:            archives/scans.csv
  Progress:             archives/.progress.json
──────────────────────────────────────────────

Scans fetched: metadata detail page was retrieved from the server this run.
Already cached: metadata.json already existed on disk; no HTTP request was made.
Failed: fetch error or scan missing required grid parameters.
Metadata written: new metadata.json files created (shown in --metadata-only mode).
Mosaic and tile counts appear in their respective modes.

Dependencies

Package	Purpose
`requests`	HTTP client
`beautifulsoup4` + `lxml`	HTML parsing
`pyyaml`	Config file
`tqdm`	Progress bars

10 KiB Raw Blame History Unescape Escape