New versions will be in C# because I want to learn that language and a real application is the best way to learn.
The original tokeniser splits on whitespace; this means that strings such as:
using System.Windows.Forms;
appear as two tokens instead of the expected four (or five if you count the semicolon, seven if you count the to dots). This can obscure some similarities but is so much faster and so much simpler to implement than a general parsing algorithm that it is probably the best compromise. The real problem comes in analysing files created in general purpose text editors. Such files do not always include spaces between tokens such as operators. For instance the MS VB IDE editor adds spaces around infix operators, suppresses them before opening brackets, adds them before end of line comments and so on; this has the effect of ensuring that operators are seen as tokens, however, such spacing is not required by the compiler.
For instance the following line is perfectly legal VB:
A=B+Cbut if you enter it using the VB IDE it will become
A = B + C
The first will be seen as one token and the second as five tokens. This is a serious limitation if we think in terms of detecting plagiarism because then we must assume that the programmer of at least one of the files is hostile to the process, on the other hand in professional contexts we are generally looking for sloppy programming that involves verbatim copying without any intent to conceal, only an intent to get the job done as quickly as possible. This means that copied source will tend to have the same spacing as the original even if the programmer has altered it a little to fit its new home.
The problems associated with the original tokenisation method areeasily avoided with the .NET framework regular expression class because it has a Split method that allows a regular expression to be used to identify the delimiters.
Speed is better after the change to the way the tokens were stored. Now it is time to add more features. The first question is which ones. Possible features include:
Dotplots are two dimensional charts. To produce such a chart two text files are tokenised. The values along the axes are the tokens; imagine the whole text of the file written along the axis.
For example:
The quick brown fox
The X
dirty
brown X
dog
Dots are placed at each intersection point where the tokens are the same. The same technique can look for repeated sequences within a file.
Example with case ignored:
The quick brown fox and the dirty brown dog
The X X
quick X
brown X X
fox X
and X
the X X
dirty X
brown X X
dog X
Notice that the diagonal line is completely filled and that there are diagonals in the upper right and lower left quadrants.
The purpose of such a plot is to make detection of duplicated code easier. Many academic citations of this method, other than those in bilogical sciences, tend to couple it with plagiarism detection. While it can of course be used to detect stolen code in software development course and in detecting stolen passages from novels and scientific reports my feeling is that the best use for the technique is to aid in software maintenance by making it easy to detect code that has simply been copied and edited in order to get a program up and running quickly. There is nothing wrong with cut and paste development so long as the developer stays awake and refactors the code when the duplication reaches a certain level (one copy is alright, two should make you make a mental note, three means it's time to refactor)