Multi-processing in Python
Useful tips to eke out all possible performance in Python using multiprocessing library
Overview
Sometimes Python is slow. By default Python runs on a single thread. To increase the speed of processing in Python, code can run on multiple processes. This parallelization allows for the distribution of work across all CPU cores When running on multiple cores, long running jobs can be broken down into smaller manageable chunks. Once the individual jobs are run in parallel the results are returned, and the time to process has been cut down drastically. Multi-processing in Python is effective to speed up the processing time for long-running functions.
Multi-processing
Python has multiprocessing built into the language. With a simple import statement:
import multiprocessing
we have the capability to run different functions in parallel. This package contains multiple strategies to improve speed by running functions in parallel. In this post we will highlight some fast and easy implementations using multiprocessing
to speed up long-running code.
Pool
The multiprocessing
includes Pool
class, which allows for creation of a pool of workers. Once the pool is allocated we then have a bunch of worker threads that can processing in parallel. This usually looks like the code below:
number_of_workers = 10
with Pool(number_of_workers) as p:
# Do something with pool here
Now that the pool is allocated, workers can be given tasks.
Map
Now that the pool is allocated, work may be done in parallel. Using map
we can break apart a job into multiple processes at the same time. In the example below, we use multiprocessing to square and print a large array of numbers in parallel.
def do_something(number):
return number ** 2array_of_numbers = [x for x in range(0, 100000000000)]
with Pool(number_of_workers) as p:
print(p.map(do_something, array_of_numbers))
If this was to be done serially (without parallelization), it would take quite some time. By taking this job and splitting it into pieces we can share it among the different CPU cores to speed up the task.
Under the hood, map
takes the current Python process, pickles it, sends to another CPU core. Sometimes this nuance leads to issues. For example, if the current process size in memory is 4GB and the code is using Pool(4)
on a four core machine, that 4GB Python process will be pickled and sent to 4 workers. This can increase the memory usage by up to 4GB * 4 workers = 16GB.
Imap
A more optimized method is imap
. This method doesn’t duplicate the memory space of the original Python process to different workers. The imap
returns an iterator instead of a completed sequence, thus using less memory.
def do_something(number):
return number ** 2array_of_numbers = [x for x in range(0, 100000000000)]
with Pool(number_of_workers) as p:
print(p.imap(do_something, array_of_numbers))
The outcome of using imap
is identical to map
, but reduces memory usage.
One thing to note is that imap
and map
can only pass one parameter to the function to be parallelized.
Starmap
Another function starmap
is identical to map
in functionality in terms of memory usage. The difference is that starmap
allows for multiple arguments.
def do_something(number, another_number):
return number ** 2 + another_number ** 2)array_of_number_tuple = [(x, x + 1) for x in range(0, 100000000000)]
with Pool(number_of_workers) as p:
print(p.starmap(do_something, array_of_number_tuple))
In the code example above, we show how starmap
differs from map
and imap
. Instead of a single parameter, multiple parameters are passed to the function that is being ran in parallel.
Conclusion
The Python package multiprocessing
allows for faster execution of long-running jobs. There are more complex ways to use the package that aren’t detailed in this post which can be read about further at the Python documentation page. Using tools from the multiprocessing library, you can cut down your processing time from days to hours.
Finally, if using Python is exciting and getting the most out of your code sounds like fun, we’re always looking for experienced Python developers at Apteo. Feel free to reach out at info@apteo.co