Dedup file system tools
Find a file
2026-04-29 13:06:31 +10:00
crates compare_within 2026-04-29 13:06:31 +10:00
.codex v0.2 2026-04-27 17:55:26 +10:00
.envrc compare_within 2026-04-29 13:06:31 +10:00
.gitignore v0.2 2026-04-27 17:55:26 +10:00
Cargo.lock compare_within 2026-04-29 13:06:31 +10:00
Cargo.toml compare_within 2026-04-29 13:06:31 +10:00
cross-disk-inventory-dedupe-framework.md v0.2 2026-04-27 17:55:26 +10:00
README.md compare_within 2026-04-29 13:06:31 +10:00
rust-toolchain.toml v0.2 2026-04-27 17:55:26 +10:00

Inventory Workspace

This repository is a Rust-first implementation of the design in cross-disk-inventory-dedupe-framework.md.

Current Shape

  • crates/inventory-core
    • scan, compare, verify, prune, and SQLite schema helpers
  • crates/inventory-cli
    • inventory CLI surface
  • cross-disk-inventory-dedupe-framework.md
    • architecture and workflow note

Design Intent

The hot path lives in Rust:

  • scan mounted trees into per-root _Inventory/inventory.sqlite
  • capture filesystem UUID/source/type as inventory metadata
  • compare two inventories cheaply with size + head_hash + tail_hash
  • scan only the head BLAKE3 window initially, then fill tail hashes lazily during compare
  • lazily compute full BLAKE3 only for current candidates
  • persist verified duplicate pairs into both the master and slave inventory DBs
  • generate a reviewable slave-delete script from the verified set
  • keep an audit trail of suggested slave deletions in both DBs

The UI stays thin:

  • CLI first
  • ratatui later for a native Rust TUI
  • Datasette remains a separate read/search surface over aggregated SQLite data

What Works Today

Implemented now:

  • inventory init-db
  • inventory scan
  • inventory aggregate
  • inventory compare
  • inventory inspect
  • inventory verify
  • inventory prune-slave

CLI Shape

inventory init-db --kind inventory /mnt/disk/_Inventory/inventory.sqlite
inventory scan /mnt/disk
inventory scan /mnt/disk --hash-mode none
inventory scan /mnt/disk --hash-mode head --hash-min-size 1048576
inventory scan /mnt/22Tb-mirror/12Tb-mirror/Multimedia/video \
  --inventory-root /mnt/22Tb-mirror
inventory scan /mnt/disk --progress
inventory scan /mnt/read-only-source --inventory-dir-path /some/writable/place/source-inventory
inventory aggregate /srv/catalog/catalog.sqlite \
  /mnt/diskA/_Inventory/inventory.sqlite \
  /mnt/diskB/_Inventory/inventory.sqlite

inventory compare master.sqlite slave.sqlite
inventory compare master.sqlite slave.sqlite --progress
inventory compare master.sqlite slave.sqlite --min-size 1048576
inventory compare-within /mnt/disk/_Inventory/inventory.sqlite
inventory inspect candidates master.sqlite slave.sqlite
inventory inspect verified master.sqlite slave.sqlite --limit 20
inventory inspect prune-actions master.sqlite slave.sqlite
inventory verify master.sqlite slave.sqlite
inventory verify master.sqlite slave.sqlite --progress
inventory verify master.sqlite slave.sqlite --byte-compare
inventory prune-slave master.sqlite slave.sqlite --script delete-slave-dupes.sh

Command Intent

  • inventory scan

    • updates one per-root inventory DB
    • supports scanning a subtree while storing paths relative to a larger inventory root
    • supports --hash-mode head|none, with head as the default
    • supports --hash-min-size so initial or follow-up scan hashing can ignore smaller files
    • computes or reuses only the head BLAKE3 window when hashing is enabled
    • can do a metadata-only first pass with --hash-mode none
    • optional --progress prints periodic counters and current path to stderr
    • checkpoints long scans into SQLite roughly every 30s
    • Ctrl-C stops after the current file, flushes the current batch, and leaves missing-file reconciliation for the next completed scan
    • permission-denied or otherwise unreadable entries are counted and skipped instead of aborting the scan
    • readable files that cannot be opened for hashing stay inventoried but are left unhashed
    • does not precompute full hashes
  • inventory compare

    • starts from size-only collisions, then lazily fills missing head hashes only for those candidates
    • finds cheap size+head candidates from two inventory DBs
    • lazily computes tail hashes only for those size+head collisions
    • ignores files smaller than 4096 bytes by default, configurable with --min-size
    • ignores .git, .Trash-*, and $RECYCLE.BIN paths even if they already exist in the inventory DB
    • optional --progress prints candidate and tail-hash progress to stderr
    • checkpoints long tail-hash runs into SQLite roughly every 30s
    • Ctrl-C stops after the current file, flushes the current batch, and leaves existing candidate rows untouched until the next full compare run
    • persists candidate_pairs into both DBs
    • clears stale verified_pairs for that master/slave pairing
  • inventory compare-within

    • finds suspected duplicate pairs inside one inventory DB
    • stores one ordered pair per duplicate relationship, not self-matches and not both directions
    • reuses the same size/head/tail candidate pipeline as cross-disk compare
    • inherits the same periodic checkpointing and Ctrl-C flush behavior as inventory compare
    • does not yet apply a keep-policy for which in-tree path should win
  • inventory verify

    • loads the persisted candidate set
    • computes missing full hashes only for those candidate files
    • reuses cached full hashes while file metadata is unchanged
    • hashes master and slave in parallel when they appear to be on different backing disks
    • optional --progress prints full-hash progress to stderr
    • checkpoints long full-hash runs into SQLite roughly every 30s
    • Ctrl-C stops after the current file, flushes the current batch, and leaves verified_pairs to be refreshed on the next full verify run
    • persists verified_pairs into both DBs
    • optional --byte-compare adds a stricter final proof layer
  • inventory inspect

    • gives a read-only view over candidate_pairs, verified_pairs, or prune_actions
    • filters by the current master/slave pairing
    • avoids dropping into raw SQLite for common review tasks
  • inventory prune-slave

    • reads the persisted verified set
    • writes a deterministic shell script of quoted rm -- ... commands
    • records the suggestion in prune_actions in both DBs
    • during --execute, flushes deletion audit incrementally so an interruption still leaves a trail for files already removed
    • can later support direct --execute, but script review is the intended workflow

Current SQLite Roles

Per inventory DB:

  • files
    • path, size, metadata, presence state
  • inventory_meta
    • root path, filesystem UUID/source/type, scan timestamps, scanner version
  • file_hashes
    • head, tail, and optional full BLAKE3
  • candidate_pairs
    • compare-time duplicate candidates
  • verified_pairs
    • verify-time confirmed duplicates
  • prune_actions
    • audit trail for suggested or executed slave deletions

Build And Test

From this repo:

cargo fmt
cargo test

Immediate Next Steps

  1. Add richer prune policies, for example extension filters or trusted-master-root filters.
  2. Add a purpose-built TUI view over inspect data and long-running scan/verify jobs.
  3. Add ratatui once the backend behavior feels settled.