Do good metrics equal good code quality?

 In Python, Quality, Technical

In these days we have a lot of tools for creating and viewing metrics. The use of dashboards, column plots, pizza graphs and PowerPoint slides is very extensive. Especially big organizations, but also people altogether want to quantify most activities and characteristics. We want to maximize certain values and minimize others. Especially when it comes to rating work of the employees. In this article we will focus on metrics in Python.

Let’s dive into the main question. Does having good code metrics makes our code good?

Firstly, lets see what kind of metrics we can gather.

  1. How many defects do we have in the code base?
  2. Do we comply with the coding style?
  3. How much code is coveraged by the tests?
  4. How fast the code is?
  5. How much of the code do we really use?
  6. How much of the code base are code lines and how much space documentation occupies?
  7. How complex is the code? Can a new developer in the team can dive into the project easily?
  8. How many and how often the tests fail? Does it change overtime?
  9. How fast the code is?
  10. Is it well documented?
  11.  .. and many more ..

Most of the metrics that has been mentioned are statistically measurable. Every one of these metrics are propably important to every developer/student/programmer at some point in his or her career.

We will briefly go through the tools that help us get these metrics for Python language specifically. Keep in mind that most of the paradigms will be very similar to other technologies. You should quite easily find the tools that you can use.

Pylint package

Pylint is a tool that checks for errors in Python code. It helps to enforce Python coding standard and look for the so called code smells.
By default it enforces PEP8 which I recommend everybody to become familiar with. Most if not every technology has its own standard, so be sure to get to know yours.

Radon package

It is a reporting tool which analyzes the structure of the code and provides score for metrics such as maintability index (MI), cyclic complexity (CC), or raw metrics like lines of code, comments and others (RAW). For CC and MI, the lower the score the better. That means that code is less complex and easier to maintanance. The RAW metrics are just for information purposes. On their own, they do not tell much, except for how big the code base is.

Unittest or pytest packages

Are used for running the tests. Can provide statistics on how many test passed, failed or were skipped. When test fails it also shows the traceback so it’s easier to locate and fix the problem.
Remember that having 100 tests might not be better than having only 5. Arificially making many tests that duplicate each other does not make sense and it definitely does not make our code base have a better quality. For purpose of this article we should remember that quantity of tests does not say anything about their usefullnes or helpfulness. Although it is better to have them than not having anything at all.

It is impossible not to mention Test Driven Development (TDD) when talking about tests and code quality. I would say that using TDD approach in company ensures better quality, if used right.

Coverage package

Coverage package tells us how much of the code is tested or used, depending how we use it.

1. Checking test coverage

In this case we run the test and package marks the lines that were called via our test methods. The report is mostly in XML or HTML format. It is also good to run branch coverage. When running the coverage we can have different statistic for different type of tests. For example:

  • Unit test coverage: 85%
  • Integration tests coverage: 70%

Keep in mind that percentage of test coverage does not say anything about the test quality. For all we know ‘tests’ may not test anything just call the methods.

2. Checking the coverage for normal usage of the software

In my experience it is not used often, but I found it useful when we were having a big code base and were refactoring it. Some of the code was left behind and not deleted, because someone thought “Maybe I will use it later”. Altough it might be true for some time, at the end of the day we never used it, because the code was obsolete.

The gist of it is to look for what is called the “dead code”. Dead code is code that is not used in production, the methods that are never called and are just hanging in our repository. There are good reasons that we do not delete such code sometimes, for example: backward compatibility, but if code is deprecated it should be deleted from the code base.


There are few use cases when we use the profiler. We want to know:

  1. how many times a method is called
  2. how many times a certain line is called
  3. how much time is spent in a method (the sum and on a single call of the method)
  4. how much memory is used by a method
  5. how much memory is used over time

For Python the mostly used profiler is cProfile, which provides us with the information. We can visualize it using GProf2Dot. To time a single method which we want to be called for example 100000 times the good tool is timeit package.
When watching the memory usage we could use memory_profiler package.

Let’s code!

It is relativly easy to quantify mentioned metrics, but do they relate to the quality of the code.
In the section below let us focus on the code and how the quality changes based on what we do.

First method

Let us start with one of the easiest methods that we could think of.

def f(a):
    return a + 5

Does this code have good quality?


  • It is pretty straightforward
  • easy to read
  • no fluff
  • complexity is low


  • no documentation
  • no tests
  • inadequate name

First refactor – naming

If we only change the name of the method it helps a lot! We don’t need to see the source code of the method to have an idea what it does and it looks a lot better now.

def add_5(a):
    return a + 5

Is quality is higher? IMHO it is, probably most people would say so, because it is more verbose and user friendly.
But is quality high? – this is another question.

PEP8 compliant – docstring

Lets write the code according to PEP8 standard. We will add the docstring.

def add_5(a):
    Add 5 to the provided number

    a : int

        Number incremented by 5
    return a + 5

Is the quality higher because of the docstring? I will let you decide on this but we just added 12 lines of comments to 2 lines of code.
For many people it is not so easy to it read right now, because there is too much text and we have to look for the actual code.

Adding type hints

Let us use the type hints which were introduced in Python 3.5 and are getting on popularity right now.

def add_5(a: int) -> int:
    return a + 5

We just added 2 type hints, which are in the code, not additional docs. We didn’t add any more lines to the code. Of course, not everyone know type hints, but it still is pretty straightforward and easy to read. Variable a is of type int and the method also returns an int. Type hints are also useful when using an IDE, cause it hints what type of variables you should pass to the method when calling it.

Writing tests

We do not have any tests written for our code! Lets write it!

def test_add_5_to_0():
    assert add_5(0) == 5 # True

We have our first test written and it passes. We also run the test coverage and it says 100%.

Writing some more tests

But what will happen if we want to add something to 3 or 8 will the method still work? Lets test it.

def test_add_5_to_0():
    assert add_5(0) == 5 # True

def test_add_5_to_3():
    assert add_5(3) == 8 # True

def test_add_5_to_8(): 
    assert add_5(8) == 13 # True

Did writing 2 more test made our code quality higher? We are testing the same use case for our method so in fact it didn’t change anything. What we should test is the functionality of the method, not the ability of our computer to add two numbers together.

What to do when CC and MI metrics are high (which is not good)?
Easiest way is to try and reduce if statements, for loops and if possible divide method into smaller method with less functionality (see SOLID principles). The more nested, more condition we put to the method , the harder it is to maintain and refactor the code.



We can easily fall into the trap of ” wanting to have the best metrics”. We all want to brag about how good our metrics are, but is it really that important?

I think everybody should find a way that is best and most suitable for their own project. Some just want to see if the tests pass and don’t want to see the history of it. Other will build a Continous Integration (CI) system that will check everything on every commit and save the results. Sometimes we just do a little project that we will need next week and will never go back to it again.

One of the most used answers in Software development world is – It depends.

This is the same case. The best advice I can give you all is to consider what is your purpose for these metrics. If you collect all the metrics possible (which is awesome!) and do not study them and act based on your analysis it doesn’t help much. In these case you should just focus on development and not waste time.

If you collect metrics GO ALL THE WAY !!!

Code -> Collect -> Analyze -> Learn -> Code again


Recent Posts