How to avoid Whitespaces and comments in code duplication CPD tool

229 views Asked by At

We are using CPD tool for code duplication detection. CPD tool includes whitespaces and comments. Could you please let us know how we can avoide white spaces, comments so that correct cases of duplicity can come? Suppose we have 4 lines of duplicate code and 4 lines of comments then it returns 8 lines instead of 4.

1

There are 1 answers

0
Ira Baxter On

Which specific (copy-paste detector) CPD tool? There are many.

How a CPD detects duplicates depends on the primitive entities it compares. (I've built clone detectors).

Some only operate on source lines; these pretty much cannot distingish white space and comments from the programming language that you think you gave the tool. To it, your code is just raw text. Nor can these tools discover that "code block A is duplicate of code B with with regular changes (e.g., parameters)" which is what you really want know. (I think this kind of CPD give terrible answers, thus your question, but they have the advantage that they work on everything).

Some operate on language tokens, for the language(s) they happen to know. These tools tend to be pretty good about ignoring whitespace. Since they know comments are certain kinds of tokens, they typically can ignore comments, too, with some kind of command line switch. (Thus, "Which CPD tool?"). But they don't understand language structure, and thus think that the sequence

  }   {

is a clone of every other such sequence. Frankly, that's a stupid clone. Secondly, such token-based detectors can only detect parameters (places where the clones vary systematically) one token wide, typically replacement of just an identifier or a constant by another constant or identifier. Still this is a big step up in usability from the line-oriented CPD tools.

Some very few operate on language structure, e.g., use the grammar of the language to control matching (I happen to make one of these, CloneDR, see my bio). These can't make the mistake of the token-based CPD tools, so you get better detected clones Further, they can detect parameters consisting of (structured) sequences of tokens, e.g, when an expression has replaced an identifier, etc. IMHO (oops, opinion!) these give much better detected clones (which is why I build CloneDR).