What is duplicate code?
Duplicate code is a sequence of program source code copied from somewhere else, most often using copy/paste.
When the same program logic is needed more than once, the recommended approach is to isolate a piece of code in a method, procedure, or subprogram and call that method whenever that specific piece of business logic is needed.
Duplicate Code is the antithesis of the recommended approach. Duplicate code is not recommended and indicates that the programmers are unskilled or overstressed.
Why is duplicate code an agility killer?
The short version:
Agile software – meaning software that is easy to change – must follow several principles of SOLID
Especially the “Single-responsibility” principle is vital. Duplicated code implies a gross violation of this vital principle of agility.
The longer version:
Agility is by definition: “The ability to move quickly and easily”.
Hence an agile codebase is quick and easy to change, a non-agile codebase is hard and slow to change.
What makes the difference between an agile and a non-agile codebase?
Program source code consists of text just like a book and must be easy to read for the human programmer. The text must be structured in logical chapters and paragraphs like a legal document, to ensure the unambiguity of the compiled program.
Without strict order, the combined complexity of programming languages, compilers, and program states almost always leads to some level of unpredictability of the IT-System.
In other words – messy code equals erroneous IT-systems.
To make matters worse – violations of structure and order also jeopardizes the programmer’s ability to figure out how to make meaningful changes to the code.
Duplication is poison for structure and order (and agility).
It starts out in all innocence. 10 lines copied here or 15 lines copied there – just because we are in a hurry, it might be convenient in the context and it is very, very easy to do.
The Legal document parable.
Imagine a long legal document where a complicated exception applies to §15, §24, §457, and §6675778. If the complicated exception text is repeated in each § description – that is duplication. Now imagine that the exception business logic (and text) for some reason must be changed in the context of §24. What are the chances that the editor (programmer) remembers to consider whether the exception change also applies to the copies in §15, §457, and §6675778? The most likely outcome is that the business logic according to §24 becomes modernized and corrected, while the business logic that is executed according to §15, §457, and §6675778 is not updated – as probably intended by the business analyst. Over time this kind of repeated injection of duplicate poison adds up to a mess. Over time this might add up to self-contradiction, random behavior, and ambivalence. Also known as program-errors. Cleaning up this kind of mess is hard and cumbersome. First, you need to find the problems.
Remember – they might be duplicated.
How to find duplicate code
In the above legal document parable, it might be quite easy to search the well-defined exception description text. In the world of programming – this is not so. A piece of copied business logic code is most likely not in a well-defined paragraph of text.
So finding duplicate code means that every instance of an arbitrary number of source code lines must be searched for – across the entire codebase.
“Searching” and “arbitrary” in the same sentence sounds complicated. Let us break it down in more detail.
How many lines of code does it take to define a duplicate.
The first step is to define how many lines of code must be equal before is considered duplication. This is very difficult. The ideal number (hereafter called N) depends on a lot of different considerations. The programming language and the desired programming style – just to mention a few.
Setting N too low and you get overwhelmed by duplication detections. Setting N too high and you will miss important instances of duplication.
Search for duplication.
The next step is to execute a search for any N lines of code in the entire codebase. The theoretical maximum of searches needed, adds up to one search for almost each and every one of the lines of source code in the entire codebase. Sometimes it might be necessary to repeat the search series with different values of N.
Searching for Duplicate Code is obviously not a job suited for manual execution.
The better approach is to avoid duplication in the first place
To do that your developers must either be very disciplined or the development environment must implement an early warning system. The ideal warning system must be automatic and configurable since not all duplications are created equal.
A lot of programming languages have ritual sequences of code that are not considered duplication even if the same – say seven – lines of code occurs at the beginning of several source code files.
Some source code might be generated by tools and are never meant for programmers. Duplication checking probably does not make sense on machine-generated sourcecode.
The CodeImprover Duplication GitHub app is a configurable automatic duplicate code warning system.
It does only this one thing – it does it well!
- Read about the pragmatic approach to duplication here
- How to configure the CodeImprover (anti)Duplication app here
- Get started using it here
- Or go directly to the GitHub marketplace to find CodeImprover apps