“Duplicate code is a computer programming term for a sequence of source code that occurs more than once, either within a program or across different programs owned or maintained by the same entity. Duplicate code is generally considered undesirable for a number of reasons”.
This quote from the almighty truth – Wikipedia says it all – almost.
We believe that Duplicate Code is a lot worse than “undesirable”.
For one thing, we believe that it poses a major threat to maintainability and this article argues why!
The short version of undesirability.
Fixing errors and making changes is risky because one newer knows how many places should be corrected or changed.
Finding the duplications can be tricky because it is difficult to settle the scope of duplication. How many lines of code shall be included in the search for duplets.
And when a duplet is identified – it can be difficult to determine whether the change should be applied to the duplet. To determine this, the programmer has to take some time to grasp at least some of the context, for each duplet found.
At the end of the day, duplet searching is often ignored resulting in accumulated technical debt and increasing risk.
The longer version – the payroll sample.
Say you implemented a piece of code that makes a payroll for doctors. This goes well and you also get the job of making payrolls for nurses. Even though the tasks are quite similar, there are significant differences.
The fastest way forward is to make a copy of the doctor payroll software and adjust the copy for the nurse’s business payroll logic.
After some time there is a change of business rule that affects all employees. Implementing this change means adjusting both the doctor’s and the nurse’s payroll software.
You adjust the doctor’s payroll software and make the same adjustment to the source code for the nurse’s payroll. Almost but not quite double work, since the “figuring out” work was done the first time.
In a real-world scenario, it is very easy to forget that you also made a source code copy for the janitor’s payroll software, so the janitors are forgotten. The standard testing procedure catches this and the problem is easily fixed.
Even so, the business case of duplication might still hold for a while.
However, the problem mounts up in maintenance.
A bug is discovered in the nurse’s payroll software and you fix it. The good question is – did you fix the error in an area of the source code that is copied from the doctor’s payroll or is it in the source code that only exists in the nurses’ payroll.
The bug was found only in the nurse’s payroll system – so it is a good guess that the error originated from the source code used only for the nurses’ payroll system.
In many real-world situations, there is no systematic search for copies of the erroneous code.
Bear with me for a short detour on the nature of bugs.
Most bugs are found and corrected during tests. Automated or human.
The surviving bugs that make it into production have in other words been able to hide in some use-case and status combination that has not been tested. The number of combinations is endless and there is no such thing as error-free source code or 100% code (test) coverage.
The combination that triggered the error in the nurses’ system might never show up in the doctor’s payroll – but when it does – the error causes confusion – because according to error tracking, that error has been corrected. It is easily forgotten that the correction was done in another copy of the code.
You never know when to expect a visit from the last erroneous duplicate.
So on top of “enjoying the benefit” of finding the same error in different copies of the code at different times – it also causes considerable confusion and undermines the trust in the IT-system in general, and in the maintainers in particular.
Finding duplicate code is more complicated than one might think.
First, you have to decide how many lines of code that define a duplication. This number is hereafter called N. Usually, N is a number between 5 and 15, but the value of N depends.
Second, you have to identify every instance of N lines of source code and search for each instance in the entire codebase. The result is almost as many searches as there are lines of code in the codebase.
Then you realize that N was set too high or too low. You adjust N and start all over.
Obviously, a cumbersome job that is best avoided in the first place by never letting duplication come into existence.
Fighting duplication creation is an ever-going struggle, just like housecleaning. In a busy world of software development, it is so easy to copy a little bit here or there, just for testing and having every intention of doing the necessary house cleaning (called refactoring in programmer lingo) before checking in.
It goes without saying that without some sort of guardrail, duplicate code grows like weeds.
The CodeInprover Duplication GitHub app is such a guardrail. It is an automatic and configurable watchdog that either warns against duplication or prevents it, depending on your local configuration settings.