Currently, the differ only compares full lines, as
seperated by line breaks. The problem is, in most
languages line breaks are simply whitespaces, and
should be ignored, not compared by. So, what I want is
to instead of dividing the file into lines for diff, to
divide it to tokens.
Tokens are identified as one of 3 groups:
-Text token: Any consequtive string of numeral and
letter(both uppercase and lowercase) characters. Two
text tokens are different if their strings are different.
-Whitespace tokens: Any consequtive string of spaces,
tabs and line breaks. If comment ignoring is enabled,
comments are also considered as whitespaces and can be
a part of the whitespace string(does not break into
several tokens). If whitespace comparing is enabled,
two whitespace tokens are different if their strings
are different, otherwise two whitespace tokens are
always identical.
-Special characters: Any single character(not string)
not appearing in one of the above token types. Two
special characters are identical, only if the character
is identical.
Since the different token types don't have any common
characters, it's easy to break them apart. Let me
explain this with psuedo code. If the current system is:
If (IsLineBreak(current_char)) {
AddCharToLine(current_char);
NewLine();
}
else
AddCharToLine(current_char);
Then with token parsing it would be:
If (IsText(current_char)) {
If (last_token != text) {
NewLine();
last_token = text;
}
AddCharToLine(current_char);
}
Else If (IsWhiteSpace(current_char)) {
If (last_token != whitespace) {
NewLine();
last_token = whitespace;
}
AddCharToLine(current_char);
}
Else {
NewLine();
last_token = none;
AddCharToLine(current_char);
}
The only issue is displaying the result, since it's no
longer neatly divided into lines, but post-diff
line-break matching shouldn't be that hard. Just start
from the last identical token and count the line breaks
to get each matching line, or something like that.
This would really help with diffing large C/C++
programs where the line-break convension isn't consistant.