Dedup file system tools

Rust 100%

Find a file

Derek Slone-Zhen 29d2b87d4d `compare_within`		2026-04-29 13:06:31 +10:00
crates	`compare_within`	2026-04-29 13:06:31 +10:00
.codex	v0.2	2026-04-27 17:55:26 +10:00
.envrc	`compare_within`	2026-04-29 13:06:31 +10:00
.gitignore	v0.2	2026-04-27 17:55:26 +10:00
Cargo.lock	`compare_within`	2026-04-29 13:06:31 +10:00
Cargo.toml	`compare_within`	2026-04-29 13:06:31 +10:00
cross-disk-inventory-dedupe-framework.md	v0.2	2026-04-27 17:55:26 +10:00
README.md	`compare_within`	2026-04-29 13:06:31 +10:00
rust-toolchain.toml	v0.2	2026-04-27 17:55:26 +10:00

README.md

Inventory Workspace

This repository is a Rust-first implementation of the design in cross-disk-inventory-dedupe-framework.md.

Current Shape

crates/inventory-core
- scan, compare, verify, prune, and SQLite schema helpers
crates/inventory-cli
- inventory CLI surface
cross-disk-inventory-dedupe-framework.md
- architecture and workflow note

Design Intent

The hot path lives in Rust:

scan mounted trees into per-root _Inventory/inventory.sqlite
capture filesystem UUID/source/type as inventory metadata
compare two inventories cheaply with size + head_hash + tail_hash
scan only the head BLAKE3 window initially, then fill tail hashes lazily during compare
lazily compute full BLAKE3 only for current candidates
persist verified duplicate pairs into both the master and slave inventory DBs
generate a reviewable slave-delete script from the verified set
keep an audit trail of suggested slave deletions in both DBs

The UI stays thin:

CLI first
ratatui later for a native Rust TUI
Datasette remains a separate read/search surface over aggregated SQLite data

What Works Today

Implemented now:

inventory init-db
inventory scan
inventory aggregate
inventory compare
inventory inspect
inventory verify
inventory prune-slave

CLI Shape

inventory init-db --kind inventory /mnt/disk/_Inventory/inventory.sqlite
inventory scan /mnt/disk
inventory scan /mnt/disk --hash-mode none
inventory scan /mnt/disk --hash-mode head --hash-min-size 1048576
inventory scan /mnt/22Tb-mirror/12Tb-mirror/Multimedia/video \
  --inventory-root /mnt/22Tb-mirror
inventory scan /mnt/disk --progress
inventory scan /mnt/read-only-source --inventory-dir-path /some/writable/place/source-inventory
inventory aggregate /srv/catalog/catalog.sqlite \
  /mnt/diskA/_Inventory/inventory.sqlite \
  /mnt/diskB/_Inventory/inventory.sqlite

inventory compare master.sqlite slave.sqlite
inventory compare master.sqlite slave.sqlite --progress
inventory compare master.sqlite slave.sqlite --min-size 1048576
inventory compare-within /mnt/disk/_Inventory/inventory.sqlite
inventory inspect candidates master.sqlite slave.sqlite
inventory inspect verified master.sqlite slave.sqlite --limit 20
inventory inspect prune-actions master.sqlite slave.sqlite
inventory verify master.sqlite slave.sqlite
inventory verify master.sqlite slave.sqlite --progress
inventory verify master.sqlite slave.sqlite --byte-compare
inventory prune-slave master.sqlite slave.sqlite --script delete-slave-dupes.sh

Command Intent

inventory scan
- updates one per-root inventory DB
- supports scanning a subtree while storing paths relative to a larger inventory root
- supports --hash-mode head|none, with head as the default
- supports --hash-min-size so initial or follow-up scan hashing can ignore smaller files
- computes or reuses only the head BLAKE3 window when hashing is enabled
- can do a metadata-only first pass with --hash-mode none
- optional --progress prints periodic counters and current path to stderr
- checkpoints long scans into SQLite roughly every 30s
- Ctrl-C stops after the current file, flushes the current batch, and leaves missing-file reconciliation for the next completed scan
- permission-denied or otherwise unreadable entries are counted and skipped instead of aborting the scan
- readable files that cannot be opened for hashing stay inventoried but are left unhashed
- does not precompute full hashes
inventory compare
- starts from size-only collisions, then lazily fills missing head hashes only for those candidates
- finds cheap size+head candidates from two inventory DBs
- lazily computes tail hashes only for those size+head collisions
- ignores files smaller than 4096 bytes by default, configurable with --min-size
- ignores .git, .Trash-*, and $RECYCLE.BIN paths even if they already exist in the inventory DB
- optional --progress prints candidate and tail-hash progress to stderr
- checkpoints long tail-hash runs into SQLite roughly every 30s
- Ctrl-C stops after the current file, flushes the current batch, and leaves existing candidate rows untouched until the next full compare run
- persists candidate_pairs into both DBs
- clears stale verified_pairs for that master/slave pairing
inventory compare-within
- finds suspected duplicate pairs inside one inventory DB
- stores one ordered pair per duplicate relationship, not self-matches and not both directions
- reuses the same size/head/tail candidate pipeline as cross-disk compare
- inherits the same periodic checkpointing and Ctrl-C flush behavior as inventory compare
- does not yet apply a keep-policy for which in-tree path should win
inventory verify
- loads the persisted candidate set
- computes missing full hashes only for those candidate files
- reuses cached full hashes while file metadata is unchanged
- hashes master and slave in parallel when they appear to be on different backing disks
- optional --progress prints full-hash progress to stderr
- checkpoints long full-hash runs into SQLite roughly every 30s
- Ctrl-C stops after the current file, flushes the current batch, and leaves verified_pairs to be refreshed on the next full verify run
- persists verified_pairs into both DBs
- optional --byte-compare adds a stricter final proof layer
inventory inspect
- gives a read-only view over candidate_pairs, verified_pairs, or prune_actions
- filters by the current master/slave pairing
- avoids dropping into raw SQLite for common review tasks
inventory prune-slave
- reads the persisted verified set
- writes a deterministic shell script of quoted rm -- ... commands
- records the suggestion in prune_actions in both DBs
- during --execute, flushes deletion audit incrementally so an interruption still leaves a trail for files already removed
- can later support direct --execute, but script review is the intended workflow

Current SQLite Roles

Per inventory DB:

files
- path, size, metadata, presence state
inventory_meta
- root path, filesystem UUID/source/type, scan timestamps, scanner version
file_hashes
- head, tail, and optional full BLAKE3
candidate_pairs
- compare-time duplicate candidates
verified_pairs
- verify-time confirmed duplicates
prune_actions
- audit trail for suggested or executed slave deletions

Build And Test

From this repo:

cargo fmt
cargo test

Immediate Next Steps

Add richer prune policies, for example extension filters or trusted-master-root filters.
Add a purpose-built TUI view over inspect data and long-running scan/verify jobs.
Add ratatui once the backend behavior feels settled.