All about Python Performance Optimization
Oct 2, 2020
Python Optimization
ETA: 6min(s)

Just like the title, this post is all about the Python performance optimization. For data science, handling large amount of data is inevitable. Although recently many tools have been developed for developers to accelerate their data science program, achieving good performance is still challenging. For one could easily fall into a performance rabbit hole by just writing one line inappropriately or troubling with squeeze out all the hardware performance. In this post, I will introduce a few handful tips to help you boost your Python data science programs and achieve 10X or more performance improvement.

๐Ÿงต๐Ÿงต๐Ÿงต Quick Boost with Multi-threading

Modern computer CPU are build with multiple cores while Python program could only use one core by default. For some task, replace a for loop with multi-threading code will result in performance improvement. Let's take an example:

import glob

files = glob.glob('./data/*/*.npy')
# files = [
#   './data/0/train.npy',
#   './data/0/label.npy',
#   './data/1/train.npy',
#   './data/1/label.npy',
#   ...
# ]

for f in files:
    output = some_operation(f)
    # ...

The above code apply a function some_operation to some files. While, if you have a large amount of files need to be process, the above code could be slow, a very likely performance bottleneck is the single thread Python feature. While, to apply multi threading to improve your code performance, it requires a little bit change on the original code:

import glob
from multiprocessing import Pool

def processor(f):
    output = some_operation(f)

    # ...

    # If you need the output, you can return it.
    # And the returned output will be ordered
    # as the input.
    return output

files = glob.glob('./data/*/*.npy')

p = Pool(8) # Number of CPU cores
outputs =, files)

The above code removed the for loop and replaced it loop body with a function processor, then map all files into multi threads processing through the map function. Here you wanna maximize the utilization of your CPU resources by assign as many threads as your CPU cores. The multiprocessing is a built-in module that came with most python distribution, which means you are likely to use it without pip install.

Apparently, you might want to ask a question: when should I use multi-threading? And it is an important question. To answer this question, we would best to understand why multi threading can improve performance. While if you are not interested in this part or hurry in time, please scroll down to the conclusion of this section.

๐Ÿ˜ด The boring part...

For the first place, multi threading is like assigning a huge task to multiple person. How could the original task be broke down is an important factor to the feasibility of multi threading. If the original task can be break down into multiple small and independent subtasks, we are likely to use multi threading. Like the above example, the processing of each file is independent to each other and much more lightweight comparing to the whole task.

I keep using the word likely because small and independent subtasks is just not enough to release the power of multi threading. You also need to know what takes most time of a single subtask, if the time is taken by CPU as doing some kind of calculation, e.g., do Fourier transform on a audio file, use multiple CPU will greatly improve the performance. While, if most time is consumed on I/O, e.g., reading a huge file, you might get even worse performance by using multi threading, because the psychical storage device need to schedule access order for multiple I/O jobs.

But what if my task is not independent to each other. Fortunately it might not be the end of world to the multi threading. A very commonly seen example is stitching multiple images, in such case, your operation function might take multiple input and generate only one output. This is also a perfect use case for multi threading, because even you have dependencies in your data, the operation do not depend on other operation results. This is an important feature, because during multi threading execution, the order of execution to each subtask can not be predicted easily, because if you assign a specific order to your subtasks execution, you will certainly lost some performance, because some subtask need to wait others, or even convert your multi threading code into a single threading one. An appropriate example would be the calculation on time serial data, because your current calculation might depend on previous time step. In this case, you are likely going to move your focus from parallelize to parallelize the calculation on each time step, or optimize the calculation itself.

A good way to monitor your multi threading program is using the command line tool htop. In the most ideal situation, you will see your CPU reaches 100% utilization and being occupied by your program!

Conclusion. If your task can be break down into small, independent and CPU bound subtasks, uses the multi threading feature. This method suits in most cases of doing CPU calculation, like feature extraction, image processing. As you can see, this method is easy to use, and can be applied to a lot of scenarios.

๐Ÿ”ข Accelerating Matrix Operation

Although multi threading boost your code easily, it is not the best way for accelerating matrix operation which is very common in linear algebra tasks. Here we use a intuitive example to explain the reason behind it:

import numpy as np

m = np.ones((10000, 10000))

for v in range(len(m.shape[0])):
    for u in range(len(m.shape[1])):
        m[v, u] = some_operation(m[v, u])

You may try to use multi threading for optimizing the above code, and you will get some performance improvement, but it might not be much even you used all your CPU cores. This is because the major performance bottleneck here is the array access time. Every array access was triggered in the Python environment and reached data with low-level C++ code.

So, is there any good way to speed up the access? Unfortunately, this could be very difficult if you don't want to do lots of modifications on your existing Python code. But if we rethink the problem, instead of trying to eliminate the performance bottleneck between Python and C++, we can turn everything into fast binary code!

This idea sounds super crazy, but could be super simple if we use right tools. For example, Numba provides JIT (just-in-time compilation) to turn Python code into compiled binary code. And it has deeply integrated with Numpy ecosystem, that could automatically optimize array access during compilation. Let's look at an example:

import numpy as np
from numba import njit

def optimized_func(m):
    for v in range(len(m.shape[0])):
        for u in range(len(m.shape[1])):
            m[v, u] = some_operation(m[v, u])

arr = np.ones((10000, 10000))

This example turns the above python code into an optimized Numba code. As you can see, we use the @njit decorator to optimize a function, and the optimized_func will be complied into binary code before execution.

There is always a "but". But since JIT compilation require strict and strong variable type, optimize a large Python program will not be easy. On the other hand, you shouldn't have bottle neck everywhere in your program, they should mostly located in some heavy computation functions. And every external function in njit decorated function should also be decorated.

We will not go into details of Numba optimization in this article, please refer to its documents for more comprehensive explanations. Besides Numba, Cython is also a choice. But personally I would not recommend Cython since it provide less IDE support.

๐ŸŽ Unleash the Power of GPU

For most time, you should be satisfied after using Numba in your code. However, there might be chance that your program has huge amount of linear algebra computation, e.g., a neural network. In this scenario, you could use GPU to boost your performance.

Before talking about GPU, we should take a look back to our existing optimization skills. A obvious technique that plays an important part is parallel, and it is also the most important feature of GPU. Unlike CPU have few but strong cores, GPU has many but week cores. If you can break down your task into small pieces that can be executed in parallel, execute them on GPU would be much faster than CPU. And this kind of task break can often happens on linear algebra programs that widely used in computer graphics and machine learning.

Fortunately, you can still use Numbda to compile your Python code into CUDA GPU code. And another choice is CuPy.

๐Ÿ”ฅ Nirvana Rebirth in C++

To be continued....