SPRUCE-scraper/README.md

# Spruce Minirhizotron Scraper

A Python tool for archiving image data collected by minirhizotron cameras at the Spruce experiment site. It authenticates against the RootView web interface, enumerates all scans across all 12 camera machines, and downloads image tiles and mosaics to a structured local archive with full metadata.

---

## Background

[Minirhizotron cameras](https://en.wikipedia.org/wiki/Minirhizotron) are inserted into clear tubes buried in the ground to image root systems non-destructively over time. This project archives data from the **SPRUCE** (Spruce and Peatland Responses Under Changing Environments) experiment, which monitors boreal peatland responses to warming and elevated CO₂.

The 12 AMR camera machines (`BW1-4` through `BW3-21`) are managed by a **RootView** web application at `http://205.149.147.131:8010`. Each scan captures a grid of overlapping image tiles along a buried tube. The server also pre-renders a full stitched mosaic for each scan.

---

## Archive inventory (as of April 2026)

| Machine | Scans | Scan type (sampled) |
|---|---:|---|
| BW1-4 [AMR-15] | 6,121 | Mixed (full-tube + partial) |
| BW1-6 [AMR-19] | 18,198 | Full-tube (~33,784 tiles, ~1.7 GB each) |
| BW1-7 [AMR-18] | 430 | Full-tube (~33,784 tiles, ~1.8 GB each) |
| BW2-8 [AMR-25] | 8,191 | Partial (~400 tiles, ~10 MB each) |
| BW2-10 [AMR-22] | 16,537 | Not yet sampled |
| BW2-11 [AMR-23] | 26,763 | Not yet sampled |
| BW2-13 [AMR-24] | 13,537 | Not yet sampled |
| BW3-16 [AMR-16] | 7,325 | Not yet sampled |
| BW3-17 [AMR-20] | 471 | Not yet sampled |
| BW3-19 [AMR-21] | 15,186 | Not yet sampled |
| BW3-20 [AMR-26] | 23,052 | Full-tube (~33,784 tiles, ~1.95 GB each) |
| BW3-21 [AMR-17] | 10,115 | Not yet sampled |
| **Total** | **145,926** | |

### Storage estimates

| What | Size | Notes |
|---|---|---|
| Mosaics only | ~2.4 TB | 145,926 × 16.6 MB per mosaic |
| Full tiles (mixed scans) | ~160 TB | Assumes 40% full-tube, 60% partial |
| Full tiles (worst case) | ~368 TB | If all scans are full-tube |

A full-tube scan covers a 310 mm × 740 mm cylinder at 3.01 × 2.26 mm steps, producing a **103 × 328 = 33,784 tile grid**. Each tile is ~79 KB on average (JPEG, 137 KB at the tube surface).

### Download speed

Tile downloads are server-limited: the RootView PHP backend renders tiles on-demand, sustaining ~**0.67 tiles/sec** with 4 parallel workers regardless of local bandwidth. Mosaics are pre-rendered and download ~20× faster per MB.

| Scenario | Estimated time |
|---|---|
| All mosaics (4 workers) | ~3 months |
| Full tiles for one scan (4 workers) | ~14 hours |
| All tiles, full-tube machines only | Years — not recommended |

**Recommended approach:** inventory all scans first (`--metadata-only`, ~80 hours serial or ~7 hours if machines run in parallel), then archive mosaics (`--mosaic-only`), then selectively download tiles for priority scans.

---

## Setup

```bash
# 1. Clone / download this repo
cd spruce_scraper

# 2. Install dependencies (Python 3.10+)
pip install -r requirements.txt

# 3. Configure credentials
cp config.example.yaml config.yaml
# Edit config.yaml: set username and password
```

`config.yaml` is gitignored and never committed.

---

## Usage

```bash
# List all available machines (no login needed)
python scraper.py --list-machines

# List all scans for a machine
python scraper.py --list-scans --machine "BW3-20 [AMR-26]"

# List only the first table page (one HTTP call; up to 320 — newest/first per server order)
python scraper.py --list-scans --list-scans-first-page-only --machine "BW3-20 [AMR-26]"

# Preview what would be downloaded (dry run)
python scraper.py --machine "BW3-20 [AMR-26]" --dry-run

# Inventory scan parameters only (no images downloaded) — very fast
python scraper.py --metadata-only
python scraper.py --machine "BW3-20 [AMR-26]" --metadata-only

# Download mosaics only for one machine
python scraper.py --machine "BW3-20 [AMR-26]" --mosaic-only

# Download mosaics for all machines
python scraper.py --mosaic-only

# One random completed scan per machine (helper script): check out branch `testing/sample-runs`,
# then see `scripts/sample_random_scans.sh` and `docs/sample_random_scans_run_progress.md`.

# Download all tiles for a specific scan
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374 --workers 4

# Resume an interrupted download (automatically skips completed files)
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374 --workers 4
```

### All options

| Flag | Description |
|---|---|
| `--config FILE` | Config file path (default: `config.yaml`) |
| `--machine LABEL` | Restrict to one machine, e.g. `"BW3-20 [AMR-26]"` |
| `--scan-id ID` | Restrict to one scan ID (use with `--machine`; works with all modes) |
| `--mosaic-only` | Download mosaics only; skip individual tiles |
| `--metadata-only` | Fetch scan parameters only; write `metadata.json` + `scans.csv` rows, skip all images. Re-runs skip scans whose `metadata.json` already exists |
| `--dry-run` | Print what would be downloaded without saving |
| `--workers N` | Parallel download threads (default: 2, hard cap: 4) |
| `--recheck` | Scan archive for zero-byte/missing tiles and mosaics; remove bad entries from `.progress.json` so they re-download on next run |
| `--list-machines` | Print all machines and exit |
| `--list-scans` | Print all scans for `--machine` and exit |
| `--list-scans-first-page-only` | With `--list-scans`: a single list request (up to 320 scans) instead of paginating the full history |
| `--verbose` / `-v` | Debug logging |

### `config.yaml` (optional keys)

| Key | Description |
|---|---|
| `write_exif` | If true (default), write EXIF to each `mosaic.jpg` after download. Set to false to skip. |
| `machine_metadata` | Map of machine label → optional fields for mosaic EXIF: `plot_number`, `enclosure` (bool), `temp_treatment` (number or string), `co2_treatment` (`ambient` / `elevated`), `latitude_wgs_84`, `longitude_wgs_84`, `elevation_masl`. Omitted keys are not written. |

`config.example.yaml` lists all 12 machine labels with full `machine_metadata` (plot, enclosure, treatments, WGS84 coordinates, elevation) and an optional `machines` filter (commented).

---

## Output layout

```
archives/
├── .progress.json              # tracks completed URLs for resume support
├── scans.csv                   # scan-level metadata for every processed scan
├── tiles.csv                   # tile-level metadata for every downloaded tile
│
└── BW3-20__AMR-26/
    └── 2024-07-29/
        └── 158374/
            ├── metadata.json   # full scan parameters (grid, timestamps, etc.)
            ├── mosaic.jpg      # pre-stitched full image (~16 MB), EXIF after download
            └── tiles/
                ├── tile_r000_c000.jpg   # row 0, column 0 (zero-padding matches grid size)
                ├── tile_r000_c001.jpg
                └── ...                 # 33,784 tiles total for a full-tube scan
```

Tile filenames encode position: `tile_r{row}_c{col}.jpg` where row increases with depth (Y in mm) and column increases along the tube circumference (X in mm).

**Mosaic `mosaic.jpg` EXIF** (when `write_exif` is true in `config.yaml`, default on): set immediately after a successful download via `piexif` (no re-encoding). Includes `DateTime` / `DateTimeOriginal` (from scan time), `ImageDescription` (machine, scan id, name), `Make` = RootView, `Model` = machine label, `Software` = RootView + server version, `ProcessingSoftware` = this scraper, `Artist` (user), a one-line `UserComment` (grid size, pointer to `metadata.json`, and when set in `machine_metadata`: `plot_number`, `enclosure`, `temp_treatment`, `co2_treatment`), `XPKeywords` with the same treatment fields when any of those four are set, and GPS when `latitude_wgs_84`, `longitude_wgs_84`, and optionally `elevation_masl` are set. See `config.example.yaml` for the `machine_metadata` layout.

### Metadata files

**`scans.csv`** columns: `machine`, `machine_id`, `scan_id`, `name`, `scan_time`, `start_x`, `start_y`, `end_x`, `end_y`, `dx`, `dy`, `nx`, `ny`, `total_tiles`, `scan_lines`, `scan_mode`, `start_datetime`, `end_datetime`, `status`, `user`, `disk_space_mb`, `mosaic_url`, `mosaic_local_path`, `mosaic_on_disk`, `mosaic_download_status`, `mosaic_error`, `mosaic_error_code`, `mosaic_error_class`

- `mosaic_on_disk`: `True` if `mosaic.jpg` exists on disk at row-write time, regardless of which run downloaded it. Useful for inventory — reflects actual archive state rather than what happened in the current run.
- `mosaic_download_status`: one of `downloaded`, `failed`, `already_done`, `dry_run`, `skipped_metadata_only` (in `--metadata-only` mode). Failed attempts are still written so you can see missing server-side images in the same CSV.
- `mosaic_error` / `mosaic_error_code` / `mosaic_error_class`: set when the URL was tried and the file was not stored successfully. **`mosaic_error_class`** is a coarse hint: `permanent_missing` for HTTP 404/410, `transient` for 5xx or common network/timeout-style failures, and `unknown` for other cases (including a 200 with an empty body). **Rows are append-only;** a failed download leaves an audit record without overwriting prior runs’ history. Delete or rotate the CSVs if you need a new header (see `spruce.settings.SCANS_CSV_FIELDS` / `TILES_CSV_FIELDS`).

**`tiles.csv`** columns: `machine`, `machine_id`, `scan_id`, `scan_time`, `row_index`, `col_index`, `x_mm`, `y_mm`, `url`, `local_path`, `status`, `error`, `error_code`, `error_class`, `downloaded_at`, `file_size_bytes`

- `status`: `downloaded`, `failed`, or `dry_run` (if `--dry-run`). Failed rows are kept for the same reason as mosaics.
- `error` / `error_code` / `error_class`: same rough semantics as the mosaic fields (`permanent_missing` / `transient` / `unknown`). `error_code` is the HTTP status when available.
- `downloaded_at`: ISO 8601 UTC timestamp when the tile was fetched. Empty on failure.

---

## Site structure (RootView)

The RootView interface runs on a standard PHP stack. Key endpoints discovered:

| Endpoint | Description |
|---|---|
| `POST index.php` | Login (`RTLLogin=1`, `RTLNAME`, `RTLUSER`, `RTLPWD`) |
| `POST index.php {cmd:scan, start:N, FilterCount:320}` | Paginated scan list |
| `GET index.php?cmd=scan&mode=view&id=ID` | Scan detail (grid params, disk usage) |
| `GET index.php?cmd=image&mode=image_scan&id=ID&s=1&x=X&y=Y` | Individual tile JPEG |
| `GET http://<host>:8011/RootView_Database/ID/mosaic.jpg` | Pre-stitched mosaic |

Grid coordinates (X, Y) are in millimetres, starting from `(start_x, start_y)` with step `(dx, dy)`.

---

## Resume and reliability

- **Resumable**: `.progress.json` records every completed URL. Re-running the same command skips already-downloaded files. `--metadata-only` re-runs additionally skip any scan whose `metadata.json` already exists on disk — no HTTP request is made.
- **Atomic progress saves**: `.progress.json` is written via a temp-file rename, so a crash mid-save never produces a corrupt or empty progress file.
- **Heal on resume**: at the start of each scan's tile pass, any tile file that exists on disk but isn't recorded in progress is silently re-marked as complete, preventing duplicate `tiles.csv` rows and redundant re-downloads.
- **Retry logic**: each tile download retries up to 3 times with exponential backoff (5 s → 10 s → 20 s) before logging a warning and moving on.
- **Worker cap**: the RootView server renders tiles on a single-threaded PHP process. Running more than 4 concurrent requests causes cascading timeouts. The default is 2 workers; the scraper hard-caps at 4 and warns if you try to exceed it.
- **Crash recovery**: run `--recheck` to find and remove zero-byte or missing tile and mosaic files from `.progress.json` so they are cleanly re-downloaded on the next run.

```bash
# After a hard crash, optionally run recheck before resuming:
python scraper.py --recheck
# Then resume normally — the scraper picks up where it left off:
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374
```

---

## Run summary

Every run prints a summary table on completion:

```
──────────────────────────────────────────────
  Run complete
──────────────────────────────────────────────
  Machines:             1
  Scans (metadata) fetched: 428  (2 already cached, 0 metadata failed)
  Metadata written:     428  (new JSON files)
──────────────────────────────────────────────
  Scans CSV:            archives/scans.csv
  Progress:             archives/.progress.json
──────────────────────────────────────────────
```

- **Scans (metadata) fetched**: RootView scan detail page was retrieved (grid params, etc.). This does not mean the mosaic downloaded successfully; use **Mosaics downloaded** / **Mosaics failed** when not in `--metadata-only` mode.
- **Already cached**: `metadata.json` already existed on disk; no HTTP request was made.
- **metadata failed**: metadata fetch error or scan missing required grid parameters.
- **Metadata written**: new `metadata.json` files created (shown in `--metadata-only` mode).
- **Mosaics failed** (when present): mosaic URL was requested but the file was not saved (e.g. HTTP 404, or empty body). Check the log for the exact URL.
- Mosaic and tile counts appear in their respective modes.

---

## Dependencies

| Package | Purpose |
|---|---|
| `requests` | HTTP client |
| `beautifulsoup4` + `lxml` | HTML parsing |
| `pyyaml` | Config file |
| `tqdm` | Progress bars |
| `piexif` | EXIF for downloaded mosaics |