In this article, I will introduce some handful Python data science program optimization tips. I will try to cover the most commonly seen scenarios in data science projects. Specially, although the rapid evolution of data science toolkit brought promising program performance to developers, there are still many hidden rabbit holes that may causes significant performance issues. This article will also provide insights and examples for helping you avoid making such mistakes in your program. This article is very readable, and does not assume you have a strong computer system/architecture background.
🧵🧵🧵 Instant Boost with Multi-threading
Modern computer CPUs commonly have multiple cores, however, Python program could only use one core by default. For some tasks, simply replacing a
for loop with multi-threading code could result in a pretty decent performance improvement. Let's take a look at an example:
import glob files = glob.glob('./data/*/*.npy') # files = [ # './data/0/train.npy', # './data/0/label.npy', # './data/1/train.npy', # './data/1/label.npy', # ... # ] for f in files: output = some_operation(f) # ...
The above code snippet applies an external function
some_operation to a series of NumPy arrays loaded from file system. Let's imagine if you have a large amount of files need to be processed, the above code could potentially be really slow. A very likely performance bottleneck is the Python single thread feature. While, to leverage multi-threading for improving your code performance, we need to do a little bit change on the original code:
import glob from multiprocessing import Pool def processor(f): output = some_operation(f) # ... # If you need the output, you can return it. # And the returned output will be ordered # as the input. return output files = glob.glob('./data/*/*.npy') p = Pool(8) # Number of CPU cores outputs = p.map(processor, files)
The above code replaced the
for loop body with a function
processor, then map all files into multi threads processing through the
map function. Here you wanna maximize the utilization of your CPU resources by assign as many threads as your CPU cores. The
multiprocessing is a built-in module that came with most python distribution, which means you are likely to use it without pip install.
Note: the Python's "multi-threading" feature is actually implemented as "multi-processing". Which means it will spawn multiple processes for your program instead of running multiple threads in a single process. More technical details refer to here.
Apparently, you might want to ask: when should I use multi-threading? This is a very good one. Typically, for computation bounded tasks, e.g. calculation, multi-threading typically works well. And you are likely to see a linear style performance increase. However, for I/O bounded tasks, e.g. file read/write, multi-threading does not work very well, because the major performance bottle typically came from disk I/O speed. If your task can be break down into small, independent and CPU bounded subtasks, uses the multi-threading.
I'm rewriting the whole article, please stay tuned.