The Algorithm Hiding Behind git diff
You've run git diff hundreds of times.
Red lines. Green lines. Done.
But have you ever stopped and asked — what algorithm is actually doing that?
It turns out, the idea is closely related to one of the most classic problems in computer science: Longest Common Subsequence.
In this article, we'll explore how Git-style diffing works, why LCS is the right mental model, how the actual algorithm Git uses — Myers diff — connects to it, and what tradeoffs real tools make when choosing a diff algorithm.
This is the first article in my series "DSA Application in Real Life" — where I explore how common data structures and algorithms power the tools developers use every day.
The Problem Git Is Solving
Imagine we have an old version of a file:
function add(a, b) {
return a + b;
}
Then we update it:
function addNumbers(a, b) {
return a + b;
}
When we run git diff, Git shows:
-function add(a, b) {
+function addNumbers(a, b) {
return a + b;
}
This looks obvious to us as humans. Only the function name changed.
But Git does not "understand" JavaScript the way we do. At the diffing level, Git treats the file as a sequence of lines. Its job is to compare two sequences and decide:
- Which lines stayed the same?
- Which lines were deleted?
- Which lines were added?
This is a sequence comparison problem — and that's exactly where LCS comes in.
Why Simple Line-by-Line Comparison Is Not Enough
A beginner might think Git just compares files line by line:
Old line 1 vs New line 1
Old line 2 vs New line 2
Old line 3 vs New line 3
This works only when changes happen at the same position. But real code changes are rarely that simple.
Consider this old file:
login()
validate()
save()
logout()
Now we insert one new line:
login()
checkPermission()
validate()
save()
logout()
A naive line-by-line comparison would produce:
Old: login() New: login() same
Old: validate() New: checkPermission() different
Old: save() New: validate() different
Old: logout() New: save() different
Old: (nothing) New: logout() added
That makes it look like almost the entire file changed — which is completely wrong. Only one line was added.
A smarter approach does not compare by position only. It first finds what stayed common between the two files.
That is the LCS idea.
LCS: The Mental Model Behind Diffing
LCS stands for Longest Common Subsequence.
A subsequence means you can pick elements from a sequence while keeping their relative order — but they do not need to be adjacent.
Example:
Old = [A, B, C, D]
New = [A, C, E, D]
The longest common subsequence is:
[A, C, D]
Because A, C, and D appear in both sequences in the same order.
Applied to file diffing, the lines of each file become the sequences:
Old = [login(), validate(), save(), logout()]
New = [login(), checkPermission(), validate(), save(), logout()]
The LCS is:
[login(), validate(), save(), logout()]
Now Git-style diffing can reason:
- These lines are common → unchanged
-
checkPermission()is only in the new file → added
Result:
login()
+checkPermission()
validate()
save()
logout()
That's the core idea.
The Actual LCS Algorithm with Code
Here's the classic dynamic programming solution you've likely seen in competitive programming:
def lcs_length(A, B):
m, n = len(A), len(B)
# dp[i][j] = LCS length of A[:i] and B[:j]
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if A[i - 1] == B[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
return dp[m][n]
For sequences:
A = [A, B, C, D]
B = [A, C, E, D]
The DP table looks like this:
"" A C E D
"" [ 0 0 0 0 0 ]
A [ 0 1 1 1 1 ]
B [ 0 1 1 1 1 ]
C [ 0 1 2 2 2 ]
D [ 0 1 2 2 3 ]
The answer is:
dp[4][4] = 3
So the LCS length is 3, and the LCS is:
[A, C, D]
Time complexity: O(m × n)
Space complexity: O(m × n)
For large files, this gets expensive — which is why Git does not use textbook LCS directly.
Reconstructing the LCS from the DP Table
The DP table gives us the length of the LCS.
But to build an actual diff, we also need the common lines themselves.
We can get them by backtracking from the bottom-right corner of the table.
def build_lcs(A, B):
m, n = len(A), len(B)
dp = [[0] * (n + 1) for _ in range(m + 1)]
# Build DP table
for i in range(1, m + 1):
for j in range(1, n + 1):
if A[i - 1] == B[j - 1]:
dp[i][j] = dp[i - 1][j - 1] + 1
else:
dp[i][j] = max(dp[i - 1][j], dp[i][j - 1])
# Backtrack to reconstruct the actual LCS
lcs = []
i, j = m, n
while i > 0 and j > 0:
if A[i - 1] == B[j - 1]:
lcs.append(A[i - 1])
i -= 1
j -= 1
elif dp[i - 1][j] >= dp[i][j - 1]:
i -= 1
else:
j -= 1
return lcs[::-1]
old_file = ["A", "B", "C", "D"]
new_file = ["A", "C", "E", "D"]
print(build_lcs(old_file, new_file))
# Output: ['A', 'C', 'D']
Now we do not only know the LCS length. We also know the actual common lines.
That is what lets us decide which lines stayed unchanged, which lines were deleted, and which lines were added.
How LCS Builds the Diff
Once you know the LCS, building the diff becomes straightforward:
- Lines in the LCS → unchanged
- Lines in old but not in the LCS → deleted
- Lines in new but not in the LCS → added
Example:
Old = [A, B, C, D]
New = [A, C, E, D]
LCS = [A, C, D]
B is only in Old → deleted
E is only in New → added
Diff output:
A
-B
C
+E
D
This is the basic shape of what Git, GitHub pull requests, VS Code comparison, and merge tools show: unchanged lines, deleted lines, and added lines.
Does Git Actually Use Textbook LCS?
Not directly.
Git's default algorithm is Myers diff — and it solves a slightly different but deeply related problem called the Shortest Edit Script.
The Shortest Edit Script asks:
What is the smallest number of insertions and deletions needed to transform the old file into the new file?
LCS and Shortest Edit Script are closely connected.
LCS asks:
What is the longest structure that stayed the same?
Shortest Edit Script asks:
What is the smallest set of changes needed to transform one sequence into another?
When only insertions and deletions are allowed, minimizing the edit script is mathematically related to maximizing the LCS length.
For two sequences with lengths m and n:
edit distance = m + n - 2 × LCS length
So yes, they are two sides of the same coin — but they approach the problem from different directions.
When we say "Git uses LCS-based diffing," the accurate meaning is:
Git's diffing is based on sequence-comparison ideas rooted in LCS, but its default implementation uses Myers' shortest edit script algorithm, which is faster in practice.
How Myers Diff Actually Works
Myers models the diff problem as a graph search.
Imagine a grid where:
- The X-axis represents lines of the old file
- The Y-axis represents lines of the new file
- Moving right means deleting a line from the old file
- Moving down means inserting a line from the new file
- Moving diagonally means the lines match, so no edit is needed
For:
Old = [A, B, C, D]
New = [A, C, E, D]
The matching positions are:
A matches A
C matches C
D matches D
A simplified grid looks like this:
Old
A B C D
┌────┬────┬────┬────┐
New A │╲ │ │ │ │
├────┼────┼────┼────┤
C │ │ │╲ │ │
├────┼────┼────┼────┤
E │ │ │ │ │
├────┼────┼────┼────┤
D │ │ │ │╲ │
└────┴────┴────┴────┘
The diagonal marks show where the two sequences have the same line.
But the important part is not only the matching cells.
The important part is the path from the top-left corner to the bottom-right corner.
For this example, one shortest path is:
(0,0)
│
├─ diagonal: A matches A
│
(1,1)
│
├─ right: delete B
│
(2,1)
│
├─ diagonal: C matches C
│
(3,2)
│
├─ down: insert E
│
(3,3)
│
├─ diagonal: D matches D
│
(4,4)
So the path is:
diagonal → right → diagonal → down → diagonal
Which means:
Keep A
Delete B
Keep C
Insert E
Keep D
Now let's walk through that path step by step.
Step 1: Match A
Both files start with A, so Myers can move diagonally.
Old: A B C D
New: A C E D
Match: A
No edit is needed.
Step 2: Delete B
After A, the old file has B, but the new file has C.
They do not match, so one shortest path deletes B.
-B
Step 3: Match C
Now both sides line up at C, so Myers moves diagonally again.
Match: C
Step 4: Insert E
After C, the new file has E, but the old file moves toward D.
So Myers inserts E.
+E
Step 5: Match D
Finally, both files match again at D.
The final shortest edit script is:
A
-B
C
+E
D
In this example, the shortest edit script has only two edits:
Delete B
Insert E
So here, D = 2.
That is the key idea behind Myers.
It is not randomly comparing lines.
It is searching the edit graph for the shortest path that converts one sequence into another.
The path with the fewest right and down moves gives the shortest edit script.
Diagonal moves are free because they represent lines that already match.
The algorithm is commonly described as:
Time complexity: O(ND)
Where:
-
Nis the total number of lines across both files -
Dis the size of the shortest edit script
In simple words, D means how many insertions and deletions are needed to transform the old file into the new file.
For space complexity, it depends on the implementation:
Common Myers implementation: O(N)
Linear-space Myers variant: O(D)
This is why Myers performs very well when two files are mostly similar, which is the common case in real codebases.
Instead of comparing every possible pair of lines like textbook LCS DP, Myers focuses on finding a short path of edits between the two versions.
Diff as an Edit Script
Let's walk through a concrete edit script:
Old file → [A, B, C, D]
New file → [A, C, E, D]
Step 1: Delete B
A, C, D
Step 2: Insert E after C
A, C, E, D
Edit script:
Delete B, Insert E
That is just two operations.
Git-style diff output:
A
-B
C
+E
D
Clean, minimal, and easy to understand.
Why This Matters in Real Development
When we review code, we're not just looking at text changes — we're trying to understand intent.
A good diff makes that easy:
function calculateTotal(items) {
- return items.length;
+ return items.reduce((sum, item) => sum + item.price, 0);
}
Any reviewer immediately understands: the old code counted items, and the new code sums their prices.
A bad diff creates noise and confusion.
That's why diff algorithms matter. They are not only about correctness. They are also about readability.
The Tradeoff: Shortest Diff vs Most Readable Diff
The smallest diff is not always the most readable one — especially in code with repeated patterns:
if (user) {
return true;
}
if (admin) {
return true;
}
if (owner) {
return true;
}
When many lines look similar, a diff algorithm can match the wrong lines.
The result may be technically correct, but hard to read.
That's why Git ships multiple diff algorithms.
Git's Four Diff Algorithms
git diff --diff-algorithm=myers # default
git diff --diff-algorithm=minimal
git diff --diff-algorithm=patience
git diff --diff-algorithm=histogram
Here's what each one does and when to use it.
Myers
Fast and generally good.
This is what runs when you just type:
git diff
Best for everyday use.
Minimal
Tries harder to find the smallest possible diff.
It can be slower, but useful when patch size matters.
Patience
Prioritizes human readability.
It matches unique lines first, which helps avoid false alignments on repeated code.
Best for reviewing refactors or moved code blocks.
Histogram
An evolution of Patience that also handles low-frequency lines well.
It often produces readable output for real codebases. Some developers prefer setting it as their global default because it can make source code diffs easier to review.
To set Histogram as your global default:
git config --global diff.algorithm histogram
Myers vs Patience Diff
Myers is very good at finding a short edit script.
But sometimes the shortest diff is not the most readable diff.
This usually happens when a file has repeated or similar-looking lines. In that case, the algorithm may choose matches that are technically valid but not ideal for human review.
Consider this example.
Old version:
def validate_user(user):
if not user.email:
return False
return True
def save_user(user):
database.save(user)
def validate_admin(admin):
if not admin.email:
return False
return True
New version:
def validate_admin(admin):
if not admin.email:
return False
return True
def validate_user(user):
if not user.email:
return False
return True
def save_user(user):
database.save(user)
Here, validate_admin moved from the bottom to the top.
Because the functions contain repeated lines like:
return False
return True
a shortest-diff algorithm can sometimes align the repeated lines in a confusing way.
A Myers-style diff may produce a technically correct result like this:
+def validate_admin(admin):
+ if not admin.email:
+ return False
+ return True
+
def validate_user(user):
if not user.email:
return False
return True
def save_user(user):
database.save(user)
-
-def validate_admin(admin):
- if not admin.email:
- return False
- return True
This output is correct: it shows that validate_admin was added at the top and removed from the bottom.
But for a reviewer, the important idea is simpler:
One function moved position.
Patience diff tries to make this kind of refactor easier to read by first looking for unique lines as anchors, such as:
def validate_user(user):
def save_user(user):
def validate_admin(admin):
These unique lines help the algorithm avoid matching only repeated lines like return False and return True.
For this small example, Patience may still produce an output that looks very similar to Myers:
+def validate_admin(admin):
+ if not admin.email:
+ return False
+ return True
+
def validate_user(user):
if not user.email:
return False
return True
def save_user(user):
database.save(user)
-
-def validate_admin(admin):
- if not admin.email:
- return False
- return True
So the important point is not that Patience magically shows a special "move" operation.
Git diffs are still usually represented as additions and deletions.
The real benefit of Patience appears more clearly in larger refactors, especially when a file has many repeated lines such as:
return False
return True
else:
break
continue
}
In those cases, Myers may match repeated lines too aggressively, while Patience prefers stronger unique anchors. That often makes the final diff easier for humans to review.
The exact output can vary depending on file context and Git version, but the idea is the same:
- Myers focuses on finding a short edit script.
- Patience focuses more on stable, unique anchors.
- The shortest diff is not always the clearest diff.
Algorithm Complexity Summary
| Algorithm | Rough Idea | Best For |
|---|---|---|
| Textbook LCS DP |
O(m × n) time and space |
Learning the concept |
| Myers diff |
O(ND) in the common description |
Default everyday diffs |
| Minimal | Spends extra work to reduce diff size | Smaller patches |
| Patience | Uses unique lines as anchors | Refactors / moved blocks |
| Histogram | Extends Patience using low-frequency lines | Often readable code diffs |
Where:
-
m= number of lines in the old file -
n= number of lines in the new file -
N= total number of lines across both files -
D= size of the shortest edit script
Where the DSA Is Hiding
In competitive programming, LCS is a textbook DP problem.
In the real world, the same idea appears in:
Git diff
GitHub pull request review
VS Code file comparison
Merge conflict resolution
Google Docs version history
Code review platforms
Patch generation
The input changes — lines of code, words in a document, DOM nodes in a UI, events in a timeline — but the core question is always the same:
What stayed the same, and what changed?
A Real-World Developer Example
Old code:
function createUser(name, email) {
const user = { name, email };
saveUser(user);
return user;
}
New code:
function createUser(name, email, role) {
const user = { name, email, role };
validateUser(user);
saveUser(user);
return user;
}
A well-tuned diff shows:
-function createUser(name, email) {
+function createUser(name, email, role) {
- const user = { name, email };
+ const user = { name, email, role };
+ validateUser(user);
saveUser(user);
return user;
}
Any reviewer immediately understands:
- a
roleparameter was added - the role is stored on the user object
- validation was introduced before saving
That's the value of a good diff algorithm.
It is not just computing differences. It is helping humans understand change.
Why Git Usually Works at the Line Level
By default, Git usually presents diffs at the line level because source code is naturally organized line by line.
For example, if we change this line:
-const total = price * quantity;
+const total = price * quantity * tax;
A character-level diff could say that only * tax was appended.
That is more precise, but precision is not always the same as readability.
In real code reviews, developers usually care about which lines changed and how those changes affect the surrounding code.
Character-level diffs can become noisy very quickly, especially when formatting, indentation, or multiple small edits happen in the same line.
That is why line-level diffing is a good default for most developer workflows.
But Git still gives you more detailed options when you need them:
git diff --word-diff
This shows word-level changes inside modified lines.
The best algorithm is not always the most precise one. It is the one that gives the most useful output for the context.
LCS vs Myers: The Mental Model
LCS: Find the longest part that stayed the same.
Myers: Find the shortest set of changes to get from old to new.
LCS gives you the intuition.
Myers gives Git an efficient practical algorithm.
When only insertions and deletions are allowed, these two views are mathematically connected:
edit distance = old length + new length - 2 × LCS length
So:
- If the LCS is long → fewer edits are needed
- If the LCS is short → more edits are needed
They measure the same underlying change from different directions.
Why This Is a Great Example of DSA in Real Life
Many beginners ask:
Where do we actually use DSA in real projects?
git diff is one of the best answers — because every developer runs it daily without thinking about it.
When you run git diff, you're using an algorithm.
When you review a pull request on GitHub, you're using an algorithm.
When you resolve merge conflicts, you're relying on algorithms that compare versions of files.
The algorithm is invisible behind a clean developer experience.
That's what good engineering looks like: the user sees red and green lines, and behind it is a carefully designed algorithmic solution built on decades of computer science research.
That's the beauty of DSA.
Not just for interviews.
Inside the tools you use every day.
Practical Commands to Try
# Try different algorithms on any repo
git diff --diff-algorithm=myers
git diff --diff-algorithm=patience
git diff --diff-algorithm=histogram
git diff --diff-algorithm=minimal
# Word-level diff, great for prose or config files
git diff --word-diff
# Set histogram as your permanent default
git config --global diff.algorithm histogram
Final Thoughts
When you first learn LCS, it may look like just another dynamic programming problem.
But the core idea is powerful:
Find what stayed the same so we can understand what changed.
That simple idea appears everywhere.
Git uses related sequence-comparison ideas to show file changes. Code review tools use similar techniques to help developers understand pull requests. Merge tools use them to combine work from different branches. Document editors use them to show version history.
So the next time you run:
git diff
remember that you are not just seeing red and green lines.
You are seeing dynamic programming intuition, graph search, and decades of algorithmic research — all compressed into one everyday developer command.


















