Analysis Workflow comparison

find_clones

Hash bucketing and line similarity scoring to find structurally similar code. Detects same algorithm with different variable names.

−86%
Token reduction
partial (text-only grep)
Native baseline
1,871 vs 13,107
Tokens (CS vs native)

What Native Tools Cannot Do

Grep finds exact text matches. If someone copy-pasted a function and renamed the variables, grep will not find it. If someone reimplemented the same algorithm with different formatting, grep will not find it. If two functions follow the same structural pattern (fetch, validate, transform, return) with completely different domain logic, grep will not find it.

find_clones uses structural similarity detection. It hashes code blocks by their AST shape and groups structurally similar functions together, then scores them by line-level similarity. Two functions that implement the same pattern with different variable names, different string literals, and different formatting will be detected as clones.

How It Works

The detection pipeline has two stages:

  1. Hash bucketing. Functions are hashed by their structural shape — the pattern of control flow, calls, and assignments — ignoring identifiers and literals. Functions with the same structural hash are grouped as candidates.

  2. Line similarity scoring. Within each candidate group, line-by-line comparison produces a similarity percentage. The min_similarity parameter (default 0.7) filters out pairs that share structure but diverge significantly in implementation.

This two-stage approach is faster than pairwise comparison of all functions (which would be O(n^2)) while still catching clones that differ in naming and formatting.

Benchmark

ApproachTokensWhat You Get
grep for suspected patterns13,107Exact text matches only. Misses renamed clones.
find_clones1,871Structural clones with similarity scores, file locations, paired source

The native approach requires you to already suspect which patterns are duplicated and search for them specifically. find_clones discovers duplication you did not know existed.

What the Output Contains

Each clone pair includes:

  • Both function names and locations — file path and line number for each
  • Similarity score — percentage of structural and textual overlap
  • Paired source excerpts — the relevant code from both functions, so you can judge whether the duplication is worth extracting

When to Use It

find_clones is a cleanup and architecture tool. Use it when:

  • You are starting a DRY (Don’t Repeat Yourself) effort and need to find extraction candidates. Clone pairs with similarity above 80% are strong candidates for shared utility functions.
  • A codebase has grown through AI-assisted development, which tends to generate copy-paste variants. CodeSift’s own codebase found 17 instances of the same countLines utility reimplemented across files during internal testing.
  • You are reviewing a module and suspect patterns are repeated. Clone detection confirms or refutes the suspicion with data.
  • Before a major refactoring, to understand which functions can be consolidated.

The min_similarity parameter controls sensitivity. At 0.7 (default), you get clones that share 70% or more structure. Raise it to 0.9 for near-exact duplicates only. Lower it to 0.5 for looser structural similarity that may reveal shared patterns worth abstracting.

Pair with frequency_analysis for a different angle on the same problem: find_clones finds pairwise duplicates, while frequency_analysis finds recurring AST shapes across the entire codebase.

Benchmark note

This benchmark compares CodeSift against the closest practical native workflow an agent would use for the same task. For some tools, that baseline is a direct shell equivalent such as rg or find. For AST-aware, graph-aware, and LSP-backed tools, the baseline is a multi-step workflow rather than a strictly identical command. Results should be read as agent-workflow comparisons: token cost, call count, and practical context efficiency.