Rabin-Karp String Matching in Java - Interview Preparation Guide

Rabin-Karp is the rolling-hash string-matching pattern for comparing many same-length substrings quickly. Strong candidates emphasize that the hash is a fast filter, not a proof by itself, and they explain collision handling explicitly.

Interview lens

A strong explanation should name the invariant, the safe transition, and the condition that makes this pattern preferable to brute force.

Pattern Summary Table

Pattern	When to Use	Key Idea	Example
04 14 Rabin Karp String Matching	many fixed-length substring comparisons are needed	maintain a rolling hash so each window shift is O(1)	Find Pattern in Text

Problem Statement

Given a text and pattern, or many equal-length substring checks, avoid rebuilding string comparisons from scratch at every shift.

Note

Emphasize the constraints before coding. The real signal is often whether the brute-force search space, update volume, or graph model makes the naive solution impossible.

Pattern Recognition Signals

Keywords in the problem: rolling hash, substring hash, collision, fixed-length window.
Structural signal: consecutive windows differ by one outgoing and one incoming character.
Complexity signal: the optimized version avoids repeated rescans, recomputation, or state explosion that brute force would suffer.

Important

If many same-length substring checks are needed and rolling comparison helps, think Rabin-Karp.

Java Template

// Rolling hash window update.
long hash = 0;
for (int i = 0; i < m; i++) hash = (hash * base + s.charAt(i)) % mod;

Single-Pattern Search Example

public int rabinKarp(String text, String pat) {
    int n = text.length(), m = pat.length();
    if (m == 0) return 0;
    if (m > n) return -1;

    long mod = 1_000_000_007L, base = 911382323L;
    long pow = 1, ph = 0, th = 0;
    for (int i = 0; i < m; i++) {
        ph = (ph * base + pat.charAt(i)) % mod;
        th = (th * base + text.charAt(i)) % mod;
        if (i > 0) pow = (pow * base) % mod;
    }

    for (int i = 0; i + m <= n; i++) {
        if (ph == th && text.regionMatches(i, pat, 0, m)) return i; // collision-safe check
        if (i + m < n) {
            th = (th - text.charAt(i) * pow) % mod;
            if (th < 0) th += mod;
            th = (th * base + text.charAt(i + m)) % mod;
        }
    }
    return -1;
}

Dry Run (Conceptual)

Text: "abracadabra", pattern: "cad" (m=3)

compute hash of pattern and first window "abr"
slide one char each step, update hash in O(1)
when hash matches at window "cad", verify exact substring to avoid collision
return starting index

Rolling hash avoids re-hashing full substring each shift.

Collision Handling Rule

Hash match is necessary, not sufficient. Always verify actual substring equality before declaring match.

For very large-scale matching, use double hashing to reduce collision probability further.

Problem 1: Find All Pattern Occurrences

Problem description: Given text and pattern, return all starting indices where pattern appears in text.

What we are solving actually: We want to compare many same-length windows quickly. Rolling hash lets us update the next window in O(1) instead of re-checking every character from scratch.

What we are doing actually:

Compute the pattern hash and the first window hash.
Compare hashes for each window.
Verify the substring only when hashes match.
Slide the window by removing the old left character and adding the new right character.

public List<Integer> search(String text, String pattern) {
    List<Integer> ans = new ArrayList<>();
    int m = pattern.length();
    if (m > text.length()) return ans;

    long base = 911382323L;
    long mod = 1_000_000_007L;
    long patternHash = 0, windowHash = 0, power = 1;

    for (int i = 0; i < m; i++) {
        patternHash = (patternHash * base + pattern.charAt(i)) % mod;
        windowHash = (windowHash * base + text.charAt(i)) % mod;
        if (i + 1 < m) power = (power * base) % mod; // base^(m-1) is needed when removing the leftmost character.
    }

    for (int i = 0; i + m <= text.length(); i++) {
        if (patternHash == windowHash && text.regionMatches(i, pattern, 0, m)) {
            ans.add(i); // Verify equal hashes to guard against rare collisions.
        }
        if (i + m == text.length()) break;

        windowHash = (windowHash - text.charAt(i) * power) % mod;
        if (windowHash < 0) windowHash += mod;
        windowHash = (windowHash * base + text.charAt(i + m)) % mod; // Slide one character to the right.
    }
    return ans;
}

Debug steps:

print the rolling hash before and after each slide to confirm the update formula
test one repeated-text case like "aaaaa" with pattern "aa"
verify the invariant that windowHash always represents exactly the current length-m window

Problem-Fit Checklist

Identify whether input size or query count requires preprocessing or specialized data structures.
Confirm problem constraints (sorted input, non-negative weights, DAG-only, immutable array, etc.).
Validate that the pattern gives asymptotic improvement over brute-force under worst-case input.
Define explicit success criteria: value only, index recovery, count, path reconstruction, or ordering.

Invariant and Reasoning

Write one invariant that must stay true after every transition (loop step, recursion return, or update).
Ensure each step makes measurable progress toward termination.
Guard boundary states explicitly (empty input, singleton, duplicates, overflow, disconnected graph).
Add a quick correctness check using a tiny hand-worked example before coding full solution.

Complexity and Design Notes

Compute time complexity for both preprocessing and per-query/per-update operations.
Track memory overhead and object allocations, not only Big-O notation.
Prefer primitives and tight loops in hot paths to reduce GC pressure in Java.
If multiple variants exist, choose the one with the simplest correctness proof first.

Production Perspective

Convert algorithmic states into explicit metrics (queue size, active nodes, cache hit ratio, relaxation count).
Add guardrails for pathological inputs to avoid latency spikes.
Keep implementation deterministic where possible to simplify debugging and incident analysis.
Separate pure algorithm logic from I/O and parsing so the core stays testable.

Implementation Workflow

Implement the minimal correct template with clear invariants.
Add edge-case tests before optimizing.
Measure complexity-sensitive operations on realistic input sizes.
Refactor for readability only after behavior is locked by tests.

Common Mistakes

Choosing the pattern without proving problem fit.
Ignoring edge cases (empty input, duplicates, overflow, disconnected state).
Mixing multiple strategies without clear invariants.
No complexity analysis against worst-case input.

Practice Set (Recommended Order)

Find the Index of the First Occurrence in a String (LC 28)
LeetCode
Repeated DNA Sequences (LC 187)
LeetCode

Key Takeaways

This pattern is most effective when transitions are explicit and invariants are enforced at every step.
Strong preconditions and boundary handling make these implementations production-safe.
Reuse this template and adapt it per problem constraints.

Find posts and pages