Look at the position SCO is in.
Linux luminary Eric S. Raymond is taking the fight with The SCO Group right back to the basics: he has developed a utility known as a comparator that looks for common code segments in large source trees and which, on an Athlon 1.8 GHz box, has an effective comparison rate of over 55,000 lines per second.
…His comparator, the code for which can be downloaded here, uses a variant of an algorithm called "shred," which bears a resemblance to some techniques used for DNA sequencing.
The source trees get sliced into overlapping three-line shreds. The shreds then get turned into a list of 32-byte signatures by a process called MD5 hashing; each signature keeps information about its file and line number range.
"If the MD5 signatures are different, then the shreds that they were made from are different. When they match, it is almost certain than the two shreds they were made from are the same, to within odds of eighteen quadrillion to one. MD5 is normally used for making unforgeable digital signatures, but the side effect I'm exploiting is that it gives you a fast way to compare texts for equality," Raymond told eWEEK on Monday.
So, once all the signatures from all the code trees have been included in the comparator, all the "unique" signatures are then thrown out, leaving a list of shreds with duplicate signatures or common code segments. From there it is just report generation, he said.