Duplicate Code - The Pragmatic Approach

Remark: This key concept article is about the CodeImprover Duplication feature, which is available in a GitHub app.
Get started on this link

What is duplicate code?

Quoting Wiki, Duplicate code is defined as: 

Duplicate code is a computer programming term for a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. Duplicate code is generally considered undesirable for several reasons

To narrow this down a little bit – The “sequence of source code” is in general measured in lines of source code and the minimum entity scope of duplication definition is typical within a single version-control database.
Since the git version control technology governs more than half of the world’s source code (and counting) a minimum version control database most often equals a Git Repository.

Identifying code duplication is a challenge.
It is necessary to adopt some axioms.

How to identify duplicate code

There are at least 3 levels of being identical: 

  • Literal identical – identify and compare each character, no matter what.
    This non-parsing is language-agnostic and can thus be used on any programming language.
    It will miss a lot of content duplication if even an extra trailing space character, means that one line is not identical to another.
  • Clean Whitespaces
    The passing of whitespaces (tabs and spaces) is almost language agnostic and will identify duplication with greater precision.
  • Removal of comments and parsing of the language keywords is by nature language bound.
    The result is finding more duplication, at the cost of general usability. 

CodeImprover has chosen the middle ground. Whitespaces are removed, so it works on almost any language that respects the general standard for whitespaces in source code.
YAML is an example of a “language” where leading whitespaces does have an impact on the interpretation of the source code.

How to find duplication

Determine the length of sequence

A key decision is how long the sequence must be measured – before an identical sequence is defined as duplication.
This is probably the most disputable decision. Set too low, too much duplication is identified. Set too high – a lot of duplication goes undetected.

The right length of the sequence must be determined, based on a lot of local parameters. Including programming language, programming style, and quality goals.

The default value in CodeImprover Duplication is 7 lines of code, but it is configurable from as low as 5 lines of code and up to 100 lines of code. 

At the Language level

A lot of languages include almost ritual-like sequences of code, where good code quality encourages some duplication – especially in file prologues.
An example is the C# “using” statements, which in well-structured applications, might be numerous and duplicated in a lot of source code files.  

CodeImprover Duplication is configurable to exclude lines identified by regular expressions.
There are sample configuration (codeimprover.yaml) files on the CodeImprover home repo showing the exclusion of all lines starting with a “using” statement.
Se the Microsoft regular expressions reference for regular expressions.

At the repository level 

Some repos include duplet subtrees that might be copies of the important paths.
Comparing these copies to the master is meaningless.
Likewise, a lot of source code might be generated by tools – where duplication is not an issue.

CodeImprover Duplication is configurable to exclude specific paths and/or file extensions.
Excluded paths are identified by regular expressions.

Configuring CodeImprover Duplication

The CodeImprover/CodeImprover public GitHub repository also called CodeImprover home repo contains sample files for different languages.

Place a copy of the best-suited codeimprover.yaml file in your repository root, and adapt it to your needs.

Any duplication analysis reads the configuration values from this file if present in the repository root and the branch being analyzed.
If no codeimprover.yaml configuration file is found, CodeImprover Duplication falls back on hardcoded default configuration values – mentioned in the CodeImprover Configuration Reference.

Error correction must be done in every relevant duplicate code-block