Files
SPRUCE-scraper/README.md
T
poprhythm e122f6435a Initial commit
Add spruce scraper with CLI, session management, parsers, progress tracking,
recheck logic, and test suite. Includes example config and README.
2026-04-22 10:41:18 -04:00

7.7 KiB
Raw Blame History

Spruce Minirhizotron Scraper

A Python tool for archiving image data collected by minirhizotron cameras at the Spruce experiment site. It authenticates against the RootView web interface, enumerates all scans across all 12 camera machines, and downloads image tiles and mosaics to a structured local archive with full metadata.


Background

Minirhizotron cameras are inserted into clear tubes buried in the ground to image root systems non-destructively over time. This project archives data from the SPRUCE (Spruce and Peatland Responses Under Changing Environments) experiment, which monitors boreal peatland responses to warming and elevated CO₂.

The 12 AMR camera machines (BW1-4 through BW3-21) are managed by a RootView web application at http://205.149.147.131:8010. Each scan captures a grid of overlapping image tiles along a buried tube. The server also pre-renders a full stitched mosaic for each scan.


Archive inventory (as of April 2026)

Machine Scans Scan type (sampled)
BW1-4 [AMR-15] 6,121 Mixed (full-tube + partial)
BW1-6 [AMR-19] 18,198 Full-tube (~33,784 tiles, ~1.7 GB each)
BW1-7 [AMR-18] 430 Full-tube (~33,784 tiles, ~1.8 GB each)
BW2-8 [AMR-25] 8,191 Partial (~400 tiles, ~10 MB each)
BW2-10 [AMR-22] 16,537 Not yet sampled
BW2-11 [AMR-23] 26,763 Not yet sampled
BW2-13 [AMR-24] 13,537 Not yet sampled
BW3-16 [AMR-16] 7,325 Not yet sampled
BW3-17 [AMR-20] 471 Not yet sampled
BW3-19 [AMR-21] 15,186 Not yet sampled
BW3-20 [AMR-26] 23,052 Full-tube (~33,784 tiles, ~1.95 GB each)
BW3-21 [AMR-17] 10,115 Not yet sampled
Total 145,926

Storage estimates

What Size Notes
Mosaics only ~2.4 TB 145,926 × 16.6 MB per mosaic
Full tiles (mixed scans) ~160 TB Assumes 40% full-tube, 60% partial
Full tiles (worst case) ~368 TB If all scans are full-tube

A full-tube scan covers a 310 mm × 740 mm cylinder at 3.01 × 2.26 mm steps, producing a 103 × 328 = 33,784 tile grid. Each tile is ~79 KB on average (JPEG, 137 KB at the tube surface).

Download speed

Tile downloads are server-limited: the RootView PHP backend renders tiles on-demand, sustaining ~0.67 tiles/sec with 8 parallel workers regardless of local bandwidth. Mosaics are pre-rendered and download ~20× faster per MB.

Scenario Estimated time
All mosaics (4 workers) ~3 months
Full tiles for one scan (8 workers) ~14 hours
All tiles, full-tube machines only Years — not recommended

Recommended approach: archive mosaics first (--mosaic-only), then selectively download tiles for priority scans.


Setup

# 1. Clone / download this repo
cd spruce_scrapper

# 2. Install dependencies (Python 3.10+)
pip install -r requirements.txt

# 3. Configure credentials
cp config.example.yaml config.yaml
# Edit config.yaml: set username and password

config.yaml is gitignored and never committed.


Usage

# List all available machines (no login needed)
python scraper.py --list-machines

# List all scans for a machine
python scraper.py --list-scans --machine "BW3-20 [AMR-26]"

# Preview what would be downloaded (dry run)
python scraper.py --machine "BW3-20 [AMR-26]" --dry-run

# Download mosaics only for one machine
python scraper.py --machine "BW3-20 [AMR-26]" --mosaic-only

# Download mosaics for all machines
python scraper.py --mosaic-only

# Download all tiles for a specific scan
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374 --workers 4

# Resume an interrupted download (automatically skips completed files)
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374 --workers 4

All options

Flag Description
--config FILE Config file path (default: config.yaml)
--machine LABEL Restrict to one machine, e.g. "BW3-20 [AMR-26]"
--scan-id ID Download only this scan (use with --machine)
--mosaic-only Download mosaics only; skip individual tiles
--dry-run Print what would be downloaded without saving
--workers N Parallel download threads (default: 2, hard cap: 4)
--recheck Scan archive for zero-byte/missing tiles and remove them from .progress.json so they re-download on next run
--list-machines Print all machines and exit
--list-scans Print all scans for --machine and exit
--verbose / -v Debug logging

Output layout

archives/
├── .progress.json              # tracks completed URLs for resume support
├── scans.csv                   # scan-level metadata for every processed scan
├── tiles.csv                   # tile-level metadata for every downloaded tile
│
└── BW3-20__AMR-26/
    └── 2024-07-29/
        └── 158374/
            ├── metadata.json   # full scan parameters (grid, timestamps, etc.)
            ├── mosaic.jpg      # pre-stitched full image (~16 MB)
            └── tiles/
                ├── tile_r000_c000.jpg   # row 0, column 0
                ├── tile_r000_c001.jpg
                └── ...                 # 33,784 tiles total for a full-tube scan

Tile filenames encode position: tile_r{row}_c{col}.jpg where row increases with depth (Y in mm) and column increases along the tube circumference (X in mm).

Metadata files

scans.csv columns: machine, machine_id, scan_id, name, scan_time, start_x, start_y, end_x, end_y, dx, dy, nx, ny, total_tiles, scan_lines, scan_mode, start_datetime, end_datetime, status, user, disk_space_mb, mosaic_url, mosaic_local_path, mosaic_downloaded

tiles.csv columns: machine, machine_id, scan_id, scan_time, row_index, col_index, x_mm, y_mm, url, local_path, downloaded_at, file_size_bytes


Site structure (RootView)

The RootView interface runs on a standard PHP stack. Key endpoints discovered:

Endpoint Description
POST index.php Login (RTLLogin=1, RTLNAME, RTLUSER, RTLPWD)
POST index.php {cmd:scan, start:N, FilterCount:320} Paginated scan list
GET index.php?cmd=scan&mode=view&id=ID Scan detail (grid params, disk usage)
GET index.php?cmd=image&mode=image_scan&id=ID&s=1&x=X&y=Y Individual tile JPEG
GET http://<host>:8011/RootView_Database/ID/mosaic.jpg Pre-stitched mosaic

Grid coordinates (X, Y) are in millimetres, starting from (start_x, start_y) with step (dx, dy).


Resume and reliability

  • Resumable: .progress.json records every completed URL. Re-running the same command skips already-downloaded files.
  • Retry logic: each tile download retries up to 3 times with exponential backoff (5 s → 10 s → 20 s) before logging a warning and moving on.
  • Worker cap: the RootView server renders tiles on a single-threaded PHP process. Running more than 4 concurrent requests causes cascading read timeouts. The default is 2 workers; the scraper hard-caps at 4 and warns loudly if you try to exceed it.
  • Crash recovery: if a run is killed mid-flight, some in-progress tiles may have been written as zero-byte files without being marked complete. Run --recheck before resuming — it deletes zero-byte files on disk and removes their URLs from .progress.json so they are cleanly re-downloaded.
# After any interrupted run, always do this first:
python scraper.py --recheck
# Then resume normally:
python scraper.py --machine "BW3-20 [AMR-26]" --scan-id 158374

Dependencies

Package Purpose
requests HTTP client
beautifulsoup4 + lxml HTML parsing
pyyaml Config file
tqdm Progress bars