A joblib sprint for better parallelization in Python
Contributing to the whole python ecosystem is crucial and has always been a strong will for this team. Many team members of the consortium participated in a
joblib sprint from February 2 to 4. For the record,
joblib is the parallel-computing engine used by scikit-learn. Improving it provides better capabilities of parallelization for the whole python community. If you want to know more about
joblib and its amazing functionalities, go and check this page.
During this development sprint, we spent some time discussing the aim of the library and where it should go from now on. These are some of our take-aways:
- In the long term, we hope we will be able to stop relying on process-based backend (loky in particular). With the recently published experimental nogil branch of CPython , we believe that the
threadingbackend can become much more suitable for single-machine parallelism as it has less overhead (no need for pickling and inter-process communication), it is much faster to start and stop (no need to reimport packages in workers), and has reduced memory usage (a single Python process, no need to duplicate or memory map large data arguments).
- Also, joblib’s aim is not to be the most flexible parallel computing backend, with more control on the tasks and callback. This complicates the API and other backends are already good for such tasks (
concurrent.futuresin CPython stdlib,
rayfor distributed computing). Also, for advanced usage, there is a need for a scheduler, which is out of scope for joblib.
- But there is a need for a simple API to unite them all and allow easy transition from one another. Moreover, there are several small utility tools that can be added to make the life of users easier.
Therefore, the Goal of
joblib is to provide the simplest API for embarrassingly parallel computing with fault-tolerance tools and backend switching, to go from sequential execution on a laptop to distributed runs on a cluster.
- Improve the documentation: This is probably the most important direction: clarify the scope of the library and how to use it in the doc. In particular, the
Memoryobject is often overlooked and can be hard to debug.
- Benchmark for the adoption of the nogil python if possible: Many of the problems that exist with parallel processing come from our use of process based parallelism instead of thread based one due to the GIL. If somehow, thread based parallelism can be made efficient for python code, it would open the door to simplifying joblib code to only use the
threadingbackend. As a parallel computation community, we should push to make some large benchmarks to evaluate how efficient the approach recently proposed in CPython is and report back to its developpers.
- Reducing operation: For now, the
joblibAPI does not allow to reduce the results before the end of the computations. This poses some memory overflow issues. We are working on the
return_generatorfeature, that would allow to consume the results as they are ready. We also discussed more advanced reducing operations, that would minimize the communication overhead that can be explored in the future, with the notion of reducer on each worker.
- Improve nested parallelism support: as the computing resources grow, it is becoming more and more urgent to better support multi-level parallel computing (when each task in a parallel pipeline also uses parallel computing). This raises the question on how to efficiently balance the load in each level, to avoid both workers starving and over-subscription. As a community, we should pursue the development of indication to help the parallelization system take the right decisions.
Memorydebugging tools: the disk caching can be hard to setup when discovering the
jobliblibrary. In particular, this is true for cache-miss triggered by passing almost identical numerical arrays. Improving the debugging tools for
Memorywould help the adoption of such pattern in the community. We proposed several tools:
- Debug mode: using
JOBLIB_DEBUG=1stores the list of
memory.cachecalls and a second run compares the hash and make a diff of the args.
- LSH: use locally sensitive hashing to detect near cache-miss.
- Debug mode: using
- Add checkpointing utils to
Memory: checkpointing is a classical pattern to improve fault tolerance, it could be interesting to add it to
During the sprint, we concretely worked on the following issues and PRs:
- Dropping python2.7-3.6 support in
loky(joblib/loky#304 – P. Maevskikh, P. Glaser, G. Le Maitre, J. Du BoisBerranger, O. Grisel): the code of
lokywas very complex due to numerous backports of functionalities and workarounds necessary to support both recent CPython versions to the oldest Python 2.7. As we drop this support, we could greatly simplify the code base and reduce the CI maintenance complexity.
pyodideto run sequential computations (joblib/joblib#1256 – H. Chatham, G. Varoquaux. This illustrates the mission of joblib: the same code running from the browser, to the datacenter).
- Investigate bug and craft a fix for non memory-aligned memmapped pickles (joblib/joblib#1254 – L. Esteve, O. Grisel)
- Integrating a
Parallelfor asynchronous reduction operations (joblib/joblib#588 – F. Charras, T. Moreau).
- Start to benchmark a parallel grid-search of a typical scikit-learn pipeline using the
nogilbranch of CPython and to debug the observed race-conditions with
gdbso as to be able to report them in upstream projects once we can craft minimal reproducers (work in progress – J. Jerphanion).
- Improve the dask dashboard rendering when displaying tasks spawned from nested
Parallelcalls when running with the dask backend (dask/distributed#5757 – P. Glaser)
- Numerous CI fixes (everyone).
Thanks for reading, we leave you with some green buttons.