A joblib sprint for better parallelization in Python – Scikit-Learn Consortium

Contributing to the whole python ecosystem is crucial and has always been a strong will for this team. Many team members of the consortium participated in a joblib sprint from February 2 to 4. For the record, joblib is the parallel-computing engine used by scikit-learn. Improving it provides better capabilities of parallelization for the whole python community. If you want to know more aboutjoblib and its amazing functionalities, go and check this page.

During this development sprint, we spent some time discussing the aim of the library and where it should go from now on. These are some of our take-aways:

In the long term, we hope we will be able to stop relying on process-based backend (loky in particular). With the recently published experimental nogil branch of CPython , we believe that the threading backend can become much more suitable for single-machine parallelism as it has less overhead (no need for pickling and inter-process communication), it is much faster to start and stop (no need to reimport packages in workers), and has reduced memory usage (a single Python process, no need to duplicate or memory map large data arguments).
Also, joblib’s aim is not to be the most flexible parallel computing backend, with more control on the tasks and callback. This complicates the API and other backends are already good for such tasks (concurrent.futures in CPython stdlib, dask and ray for distributed computing). Also, for advanced usage, there is a need for a scheduler, which is out of scope for joblib.
But there is a need for a simple API to unite them all and allow easy transition from one another. Moreover, there are several small utility tools that can be added to make the life of users easier.

scikit-learn consortium team members working for the joblib sprint in INRIA’s premises.

Therefore, the Goal of joblib is to provide the simplest API for embarrassingly parallel computing with fault-tolerance tools and backend switching, to go from sequential execution on a laptop to distributed runs on a cluster.

Improve the documentation: This is probably the most important direction: clarify the scope of the library and how to use it in the doc. In particular, the Memory object is often overlooked and can be hard to debug.
Benchmark for the adoption of the nogil python if possible: Many of the problems that exist with parallel processing come from our use of process based parallelism instead of thread based one due to the GIL. If somehow, thread based parallelism can be made efficient for python code, it would open the door to simplifying joblib code to only use the threading backend. As a parallel computation community, we should push to make some large benchmarks to evaluate how efficient the approach recently proposed in CPython is and report back to its developpers.
Reducing operation: For now, the joblib API does not allow to reduce the results before the end of the computations. This poses some memory overflow issues. We are working on the return_generator feature, that would allow to consume the results as they are ready. We also discussed more advanced reducing operations, that would minimize the communication overhead that can be explored in the future, with the notion of reducer on each worker.
Improve nested parallelism support: as the computing resources grow, it is becoming more and more urgent to better support multi-level parallel computing (when each task in a parallel pipeline also uses parallel computing). This raises the question on how to efficiently balance the load in each level, to avoid both workers starving and over-subscription. As a community, we should pursue the development of indication to help the parallelization system take the right decisions.
Improve Memory debugging tools: the disk caching can be hard to setup when discovering the joblib library. In particular, this is true for cache-miss triggered by passing almost identical numerical arrays. Improving the debugging tools for Memory would help the adoption of such pattern in the community. We proposed several tools:
- Debug mode: usingJOBLIB_DEBUG=1 stores the list of memory.cache calls and a second run compares the hash and make a diff of the args.
- LSH: use locally sensitive hashing to detect near cache-miss.
Add checkpointing utils to Memory: checkpointing is a classical pattern to improve fault tolerance, it could be interesting to add it to Memory.
A sprint high-time: the 4 years-old PR to add return_generator is finally green!!

During the sprint, we concretely worked on the following issues and PRs:

Dropping python2.7-3.6 support in loky (joblib/loky#304 – P. Maevskikh, P. Glaser, G. Le Maitre, J. Du BoisBerranger, O. Grisel): the code of loky was very complex due to numerous backports of functionalities and workarounds necessary to support both recent CPython versions to the oldest Python 2.7. As we drop this support, we could greatly simplify the code base and reduce the CI maintenance complexity.
Make joblib importable on pyodide to run sequential computations (joblib/joblib#1256 – H. Chatham, G. Varoquaux. This illustrates the mission of joblib: the same code running from the browser, to the datacenter).
Investigate bug and craft a fix for non memory-aligned memmapped pickles (joblib/joblib#1254 – L. Esteve, O. Grisel)
Integrating a return_generator=True in Parallel for asynchronous reduction operations (joblib/joblib#588 – F. Charras, T. Moreau).
Start to benchmark a parallel grid-search of a typical scikit-learn pipeline using the nogil branch of CPython and to debug the observed race-conditions withgdbso as to be able to report them in upstream projects once we can craft minimal reproducers (work in progress – J. Jerphanion).
Improve the dask dashboard rendering when displaying tasks spawned from nestedParallelcalls when running with the dask backend (dask/distributed#5757 – P. Glaser)
Numerous CI fixes (everyone).

Thanks for reading, we leave you with some green buttons.