Dedup file system tools
- Rust 100%
| crates | ||
| .codex | ||
| .envrc | ||
| .gitignore | ||
| Cargo.lock | ||
| Cargo.toml | ||
| cross-disk-inventory-dedupe-framework.md | ||
| README.md | ||
| rust-toolchain.toml | ||
Inventory Workspace
This repository is a Rust-first implementation of the design in cross-disk-inventory-dedupe-framework.md.
Current Shape
crates/inventory-core- scan, compare, verify, prune, and SQLite schema helpers
crates/inventory-cliinventoryCLI surface
cross-disk-inventory-dedupe-framework.md- architecture and workflow note
Design Intent
The hot path lives in Rust:
- scan mounted trees into per-root
_Inventory/inventory.sqlite - capture filesystem UUID/source/type as inventory metadata
- compare two inventories cheaply with
size + head_hash + tail_hash - scan only the head
BLAKE3window initially, then fill tail hashes lazily during compare - lazily compute full
BLAKE3only for current candidates - persist verified duplicate pairs into both the master and slave inventory DBs
- generate a reviewable slave-delete script from the verified set
- keep an audit trail of suggested slave deletions in both DBs
The UI stays thin:
- CLI first
ratatuilater for a native Rust TUIDatasetteremains a separate read/search surface over aggregated SQLite data
What Works Today
Implemented now:
inventory init-dbinventory scaninventory aggregateinventory compareinventory inspectinventory verifyinventory prune-slave
CLI Shape
inventory init-db --kind inventory /mnt/disk/_Inventory/inventory.sqlite
inventory scan /mnt/disk
inventory scan /mnt/disk --hash-mode none
inventory scan /mnt/disk --hash-mode head --hash-min-size 1048576
inventory scan /mnt/22Tb-mirror/12Tb-mirror/Multimedia/video \
--inventory-root /mnt/22Tb-mirror
inventory scan /mnt/disk --progress
inventory scan /mnt/read-only-source --inventory-dir-path /some/writable/place/source-inventory
inventory aggregate /srv/catalog/catalog.sqlite \
/mnt/diskA/_Inventory/inventory.sqlite \
/mnt/diskB/_Inventory/inventory.sqlite
inventory compare master.sqlite slave.sqlite
inventory compare master.sqlite slave.sqlite --progress
inventory compare master.sqlite slave.sqlite --min-size 1048576
inventory compare-within /mnt/disk/_Inventory/inventory.sqlite
inventory inspect candidates master.sqlite slave.sqlite
inventory inspect verified master.sqlite slave.sqlite --limit 20
inventory inspect prune-actions master.sqlite slave.sqlite
inventory verify master.sqlite slave.sqlite
inventory verify master.sqlite slave.sqlite --progress
inventory verify master.sqlite slave.sqlite --byte-compare
inventory prune-slave master.sqlite slave.sqlite --script delete-slave-dupes.sh
Command Intent
-
inventory scan- updates one per-root inventory DB
- supports scanning a subtree while storing paths relative to a larger inventory root
- supports
--hash-mode head|none, withheadas the default - supports
--hash-min-sizeso initial or follow-up scan hashing can ignore smaller files - computes or reuses only the head
BLAKE3window when hashing is enabled - can do a metadata-only first pass with
--hash-mode none - optional
--progressprints periodic counters and current path tostderr - checkpoints long scans into SQLite roughly every
30s Ctrl-Cstops after the current file, flushes the current batch, and leaves missing-file reconciliation for the next completed scan- permission-denied or otherwise unreadable entries are counted and skipped instead of aborting the scan
- readable files that cannot be opened for hashing stay inventoried but are left unhashed
- does not precompute full hashes
-
inventory compare- starts from size-only collisions, then lazily fills missing head hashes only for those candidates
- finds cheap size+head candidates from two inventory DBs
- lazily computes tail hashes only for those size+head collisions
- ignores files smaller than
4096bytes by default, configurable with--min-size - ignores
.git,.Trash-*, and$RECYCLE.BINpaths even if they already exist in the inventory DB - optional
--progressprints candidate and tail-hash progress tostderr - checkpoints long tail-hash runs into SQLite roughly every
30s Ctrl-Cstops after the current file, flushes the current batch, and leaves existing candidate rows untouched until the next full compare run- persists
candidate_pairsinto both DBs - clears stale
verified_pairsfor that master/slave pairing
-
inventory compare-within- finds suspected duplicate pairs inside one inventory DB
- stores one ordered pair per duplicate relationship, not self-matches and not both directions
- reuses the same size/head/tail candidate pipeline as cross-disk compare
- inherits the same periodic checkpointing and
Ctrl-Cflush behavior asinventory compare - does not yet apply a keep-policy for which in-tree path should win
-
inventory verify- loads the persisted candidate set
- computes missing full hashes only for those candidate files
- reuses cached full hashes while file metadata is unchanged
- hashes master and slave in parallel when they appear to be on different backing disks
- optional
--progressprints full-hash progress tostderr - checkpoints long full-hash runs into SQLite roughly every
30s Ctrl-Cstops after the current file, flushes the current batch, and leavesverified_pairsto be refreshed on the next full verify run- persists
verified_pairsinto both DBs - optional
--byte-compareadds a stricter final proof layer
-
inventory inspect- gives a read-only view over
candidate_pairs,verified_pairs, orprune_actions - filters by the current master/slave pairing
- avoids dropping into raw SQLite for common review tasks
- gives a read-only view over
-
inventory prune-slave- reads the persisted verified set
- writes a deterministic shell script of quoted
rm -- ...commands - records the suggestion in
prune_actionsin both DBs - during
--execute, flushes deletion audit incrementally so an interruption still leaves a trail for files already removed - can later support direct
--execute, but script review is the intended workflow
Current SQLite Roles
Per inventory DB:
files- path, size, metadata, presence state
inventory_meta- root path, filesystem UUID/source/type, scan timestamps, scanner version
file_hashes- head, tail, and optional full
BLAKE3
- head, tail, and optional full
candidate_pairs- compare-time duplicate candidates
verified_pairs- verify-time confirmed duplicates
prune_actions- audit trail for suggested or executed slave deletions
Build And Test
From this repo:
cargo fmt
cargo test
Immediate Next Steps
- Add richer prune policies, for example extension filters or trusted-master-root filters.
- Add a purpose-built TUI view over
inspectdata and long-running scan/verify jobs. - Add
ratatuionce the backend behavior feels settled.