Optimizing Python Program Performance for Data Science Projects
Oct 2, 2020
Python Optimization
ETA: 3min(s)

In this article, I will introduce some handful Python data science program optimization tips. I will try to cover the most commonly seen scenarios in data science projects. Specially, although the rapid evolution of data science toolkit brought promising program performance to developers, there are still many hidden rabbit holes that may causes significant performance issues. This article will also provide insights and examples for helping you avoid making such mistakes in your program. This article is very readable, and does not assume you have a strong computer system/architecture background.

🧵🧵🧵 Instant Boost with Multi-threading

Modern computer CPUs commonly have multiple cores, however, Python program could only use one core by default. For some tasks, simply replacing a for loop with multi-threading code could result in a pretty decent performance improvement. Let's take a look at an example:

import glob

files = glob.glob('./data/*/*.npy')
# files = [
#   './data/0/train.npy',
#   './data/0/label.npy',
#   './data/1/train.npy',
#   './data/1/label.npy',
#   ...
# ]

for f in files:
    output = some_operation(f)
    # ...

The above code snippet applies an external function some_operation to a series of NumPy arrays loaded from file system. Let's imagine if you have a large amount of files need to be processed, the above code could potentially be really slow. A very likely performance bottleneck is the Python single thread feature. While, to leverage multi-threading for improving your code performance, we need to do a little bit change on the original code:

import glob
from multiprocessing import Pool

def processor(f):
    output = some_operation(f)

    # ...

    # If you need the output, you can return it.
    # And the returned output will be ordered
    # as the input.
    return output

files = glob.glob('./data/*/*.npy')

p = Pool(8) # Number of CPU cores
outputs = p.map(processor, files)

The above code replaced the for loop body with a function processor, then map all files into multi threads processing through the map function. Here you wanna maximize the utilization of your CPU resources by assign as many threads as your CPU cores. The multiprocessing is a built-in module that came with most python distribution, which means you are likely to use it without pip install.

Note: the Python's "multi-threading" feature is actually implemented as "multi-processing". Which means it will spawn multiple processes for your program instead of running multiple threads in a single process. More technical details refer to here.

Apparently, you might want to ask: when should I use multi-threading? This is a very good one. Typically, for computation bounded tasks, e.g. calculation, multi-threading typically works well. And you are likely to see a linear style performance increase. However, for I/O bounded tasks, e.g. file read/write, multi-threading does not work very well, because the major performance bottle typically came from disk I/O speed. If your task can be break down into small, independent and CPU bounded subtasks, uses the multi-threading.

I'm rewriting the whole article, please stay tuned.

🔢 Tensor Operation Broadcasting

🐎 Unleash the Power of GPU

🔥 Rebirth in C++