What is a scikitlearn sprint you may ask? The scikitlearn sprint is a handson “hackathon” where we work on issues in the scikitlearn GitHub repository and learn to contribute to open source. This sprint included an introductory and practical workshop about contribution to open source software.
So cool to see this community gather to contribute to data scientists’ companion @scikit_learn #pariswimlds #DataScience #MachineLearning #OpenSource
— Giulia Bianchi (@Giuliabianchl) March 12, 2022
Under the guidance of no less than five people from the scikitlearn team, the participants set up their environments, learned how to use tools ranging from conda to git, black or pytest, and finally made their first pull requests!
If you don’t feel like solving issues and submitting pull request, another way to contribute to scikitlearn as a user is simply to open issues when you run into one! By the way here is a good way of doing so using a minimal reproducer. Note that there are many other ways to contribute to scikitlearn, organizing such an event being one. Feel free to contact me if you would like to do so.
For a full replay of the event, you can check Chloé Azencott’s twitter page and the #pariswimlds hashtag: she documented every step, from the teeshirt distribution in the morning to the successful pull requests, without forgetting learning how to use VS Code and of course… the and but there was also fruits , tea and fruit juice, we were not here to comform to stereotypes.
There’s one more thing developers need after pizza, before they can spend their Saturday afternoon working on pull requests at the @WiMLDS_Paris @scikit_learn sprint! #pariswimlds pic.twitter.com/8eZPb981xB
— Dr Chloé Azencott (@cazencott) March 12, 2022
A couple of numbers from this sprint:
Happy to take part in the @scikit_learn sprint with #wimlds members. Thanks @CybelAngel for hosting this nice event. pic.twitter.com/3x5ZDWDYHX
— Victoire Louis (@VictoireLouisC) March 12, 2022
If you want to do this at home, here are a couple of links to be guided and starting to contribute to scikit learn:
All the setup and guidelines were explained in a specific github repository that you can find here. It was made to be crystal clear, and guided step to step: you can rely on it if you want to start. During this sprint, a couple of issues, and more exactly metaissues (which are issues listing a problem in plenty of different places, to be fixed individually), were listed in a specific board. Although we made some progress, they are still open if you want to have a look! The maintainers and core contributors of scikit learn label some issues as “good first issue”. It’s a label to encourage people to step in with these easier issues #21350 #22406. We also worked on #11000 which is labeled as “hard” but some participants successfully tackled it.
If you are interested, scikitlearn also has a webpage explaining how to contribute.
Some participants were interviewed during the sprint…
… And you can find their reactions to this sprint here.
We would like to thank our mentors Olivier Grisel, Adrin Jalali, Maren Westermann, Béa Hernandez, and Gaël Varoquaux. Thank you for coming, sometimes from quite far, for being pedagogical, reviewing and accepting their pull requests so quickly!
The organization of this sprint was made possible by WiMLDS Paris and especially Chloé Azencott who was instrumental, making the connection between the participants and us, the scikitlearn community team.
Finally, thank you to CybelAngel for hosting us, Giulia Bianchi for presenting CybelAngel and Marie Sacksick for the perfect logistic of this event.
]]>
Longer term: Big picture tasks which require more thinking
Community: On the community side:
joblib
sprint from February 2 to 4. For the record, joblib
is the parallelcomputing engine used by scikitlearn. Improving it provides better capabilities of parallelization for the whole python community. If you want to know more aboutjoblib
and its amazing functionalities, go and check this page.
During this development sprint, we spent some time discussing the aim of the library and where it should go from now on. These are some of our takeaways:
threading
backend can become much more suitable for singlemachine parallelism as it has less overhead (no need for pickling and interprocess communication), it is much faster to start and stop (no need to reimport packages in workers), and has reduced memory usage (a single Python process, no need to duplicate or memory map large data arguments).concurrent.futures
in CPython stdlib, dask
and ray
for distributed computing). Also, for advanced usage, there is a need for a scheduler, which is out of scope for joblib.
Therefore, the Goal of joblib
is to provide the simplest API for embarrassingly parallel computing with faulttolerance tools and backend switching, to go from sequential execution on a laptop to distributed runs on a cluster.
Memory
object is often overlooked and can be hard to debug.threading
backend. As a parallel computation community, we should push to make some large benchmarks to evaluate how efficient the approach recently proposed in CPython is and report back to its developpers.joblib
API does not allow to reduce the results before the end of the computations. This poses some memory overflow issues. We are working on the return_generator
feature, that would allow to consume the results as they are ready. We also discussed more advanced reducing operations, that would minimize the communication overhead that can be explored in the future, with the notion of reducer on each worker.Memory
debugging tools: the disk caching can be hard to setup when discovering the joblib
library. In particular, this is true for cachemiss triggered by passing almost identical numerical arrays. Improving the debugging tools for Memory
would help the adoption of such pattern in the community. We proposed several tools:
JOBLIB_DEBUG=1
stores the list of memory.cache
calls and a second run compares the hash and make a diff of the args.Memory
: checkpointing is a classical pattern to improve fault tolerance, it could be interesting to add it to Memory
.
During the sprint, we concretely worked on the following issues and PRs:
loky
(joblib/loky#304 – P. Maevskikh, P. Glaser, G. Le Maitre, J. Du BoisBerranger, O. Grisel): the code of loky
was very complex due to numerous backports of functionalities and workarounds necessary to support both recent CPython versions to the oldest Python 2.7. As we drop this support, we could greatly simplify the code base and reduce the CI maintenance complexity.joblib
importable on pyodide
to run sequential computations (joblib/joblib#1256 – H. Chatham, G. Varoquaux. This illustrates the mission of joblib: the same code running from the browser, to the datacenter).return_generator=True
in Parallel
for asynchronous reduction operations (joblib/joblib#588 – F. Charras, T. Moreau).nogil
branch of CPython and to debug the observed raceconditions withgdb
so as to be able to report them in upstream projects once we can craft minimal reproducers (work in progress – J. Jerphanion).Parallel
calls when running with the dask backend (dask/distributed#5757 – P. Glaser)Thanks for reading, we leave you with some green buttons.
]]>June 2, 2021
9.10 am – 10 am  Presentation of the technical achievements and ongoing work by O. Grisel 
10 am – 12 pm  Feedback and exposition of each partner of the consortium 
1 pm – 2 pm  Collaborative drafting session of the updated roadmap 
2 pm – 5 pm  Afternoon discussions on Discord 
Scikitlearn @ Fondation Inria  Alexandre Gramfort (Inria, advisory committee of the consortium)
Olivier Grisel (consortium engineer) Guillaume Lemaître (consortium engineer) Jérémie Du Boisberranger (consortium engineer) Chiara Marmo (consortium COO) Loic Esteve (Inria engineer) Mathis Batoul (Inria intern) Julien Jerphanion (Inria engineer) Gaël Varoquaux (Consortium Director) 
Consortium partners  AXA:
Dataiku:
Fujitsu:
Microsoft:
BNPParibasCardiff:

Scikitlearn community  Adrin Jalali (community member of the advisory board, at Zalando)
Joel Nothman (community member of the advisory board, at university of Sydney) 
From the discussion during the technical committee, the scikitlearn Consortium at Inria defined the following list of priorities for the coming year:
On the community side:
Exposure of partners’ comments and priorities
Fujitsu:
AXA:
BNPParibasCardiff:
Microsoft:
Dataiku:
The members of the Consortium provide their financial support without any service counterpart. The definition of a development and general activities roadmap is an important step in building trust between the members of the Consortium. It represents our effort to focusing together on the right path for the success of the library, in term of impact on the software market and sustainability for the Community. The roadmap sets our direction.
The roadmap only binds the developers staffed at the Consortium. Also, it should take into account the needs of the Community, in order to avoid conflicts of interest or waste of efforts.
The Technical Committee of the scikitlearn Consortium is composed of:
They have the responsibility to elaborate the roadmap.
During Technical Committees, the technical staff of the Consortium summarizes the work done throughout the previous months. Then, the members describe how they use scikitlearn and which features they would like to see improved or added to the library. Discussions will follow aimed to prioritize the proposed items and find common strategies to address them. When a feature is already in the scikitlearn roadmap it is very likely to be spotlighted. Sometime our sponsors are able to propose a contributor to start work on proof of concepts: this helps in moving forward with a proposed feature.
The Consortium roadmap contains the recommended technical features: if relevant, the links to issues or pull requests already open into the issue tracker are listed. The roadmap also contains non technical suggestions, such as community support and release frequency. Consortium members are invited to propose topics for development sprints and workshops. For instance, the Consortium has recently organized an event around Interpretability and explainability issues of Machine Learning models.
]]>Like real life, real world data is most often far from normality. Still, data is often assumed, sometimes implicitly, to follow a Normal (or Gaussian) distribution. The two most important assumptions made when choosing a Normal distribution or squared error for regression tasks are^{1}:
On top, it is well known to be sensitive to outliers. Here, we want to point out that—potentially better— alternatives are available.
Typical instances of data that is not normally distributed are counts (discrete) or frequencies (counts per some unit). For these, the simple Poisson distribution might be much better suited. A few examples that come to mind are:
In what follows, we have chosen the diamonds dataset to show the nonnormality and the convenience of GLMs in modelling such targets.
The diamonds dataset consists of prices of over 50 000 round cut diamonds with a few explaining variables, also called features X, such as ‘carat’, ‘color’, ‘cut quality’, ‘clarity’ and so forth. We start with a plot of the (marginal) cumulative distribution function (CDF) of the target variable price and compare to a fitted Normal and Gamma distribution which have both two parameters each.
These plots show clearly that the Gamma distribution might be a better fit to the marginal distribution of Y than the Normal distribution.
Let’s start with a more theoretical intermezzo in the next section, we will resume to the diamonds dataset after.
GLMs are statistical models for regression tasks that aim to estimate and predict the conditional expectation of a target variable Y, i.e. E[YX]. They unify many different target types under one framework: Ordinary Least Squares, Logistic, Probit and multinomial model, Poisson regression, Gamma and many more. GLMs were formalized by John Nelder and Robert Wedderburn in 1972, long after artificial neural networks!
The basic assumptions for an instance or data row i are
where μ_{i} is the mean of the conditional distribution of Y given x_{i}.
One needs to specify:
Note that the choice of the loss or distribution function or, equivalently, the variance function is crucial. It should, at least, reflect the domain of Y.
Some typical examples combinations are:
Once you have chosen the first four points, what remains to do is to find a good feature matrix X. Unlike other machine learning algorithms such as boosted trees, there are very few hyperparameters to tune. A typical hyperparameter is the regularization strength when penalties are applied. Therefore, the biggest leverage to improve your GLM is manual feature engineering of X. This includes, among others, feature selection, encoding schemes for categorical features, interaction terms, nonlinear terms like x^{2}.
The new GLM regressors are available as
from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import GammaRegressor
from sklearn.linear_model import TweedieRegressor
The TweedieRegressor
has a parameter power
, which corresponds to the exponent of the variance function v(μ) ∼ μ^{p}. For ease of the most common use, PoissonRegressor
and GammaRegressor
are the same as TweedieRegressor(power=1)
and TweedieRegressor(power=2)
, respectively, with builtin log link. All of them also support an L2penalty on the coefficients by setting the penalization strength alpha
. The underlying optimization problem is solved via the lbfgs solver of scipy. Note that the scikitlearn release 0.23 also introduced the Poisson loss for the histogram gradient boosting regressor as HistGradientBoostingRegressor(loss='poisson')
.
After all this theory, it is time to come back to our real world dataset: diamonds.
Although, in the first section, we were analysing the marginal distribution of Y and not the conditional (on the features X) distribution, we take the plot as a hint to fit a Gamma GLM with loglink, i.e. h(x) = exp(x). Furthermore, we split the data textbooklike into 80% training set and 20% test set^{2} and use ColumnTransformer
to handle columns differently. Our feature engineering consists of selecting only the four columns ‘carat’, ‘clarity’, ‘color’ and ‘cut’, logtransforming ‘carat’ as well as onehotencoding the other three. Fitting a Gamma distribution and predicting on the test sample gives us the plot below.
Note that fitting Ordinary Least Squares on log(‘price’) works also quite well. This is to be expected, as LogNormal and Gamma are very similar distributions, both with Var[Y] ∼ E[Y]^{2} = μ^{2}.
There are several open issues and pull requests for improving GLMs and fitting of nonnormal data. Some of them have been implemented in scikitlearn 0.24 already, let’s hope the others will be merged in the near future:
By Christian Lorentzen
^{1} Algorithms and estimation methods are often well able to deal with some deviation from the Normal distribution. In addition, the central limit theorem justifies a Normal distribution when considering averages or means, and the Gauss–Markov theorem is a cornerstone for usage of least squares with linear estimators (linear in the target Y).
^{2} Rows in the diamonds dataset seem to be highly correlated as there are many rows with the same values for carat, cut, color, clarity and price, while the values for x, y and z seem to be permuted.
Therefore, we define a new group variable that is unique for ‘carat’, ‘cut’, ‘color’, ‘clarity’ and ‘price’.
Then, we split stratified by group, i.e. using a GroupShuffleSplit
.
Having correlated train and test sets invalidates the independent and identical distributed assumption and may render test scores too optimistical.
Questions and comments:
Fujitsu
Fujitsu actively participates in the Consortium remote events. Fujitsu would be glad to increase Japan contributions to scikitlearn. Fujitsu suggests organizing a sprint for Japan time zone, and starting a discussion about good practices to organize online sprints with the team there.
Microsoft
More information about the MOOC are asked.
The training activity is an Inria initiative: the MOOC consists of a 20 hours training, focusing on general issues in data analysis and Machine Learning, formalism is not developed. The used language is English. All the material will be available also outside the MOOC (github, youtube..).
People working on the MOOC give part of their time to work on the code base of scikitlearn too.
AXA
AXA is glad to participate in the interpretability workshop that will take place in the afternoon.
AXA is designing an online training series that will be launched in the fall: in principle this is dedicated to the AXA employees. It could be interesting to share content with the Inria platform and offer.
BNP ParisBas
BNP starts a similar initiative about machine learning training: the same comment stands, it could be really interesting to share content with the Consortium.
]]>November 5, 2020
Presentation of the technical achievements and ongoing work (O. Grisel)
Priority list for the consortium at Inria, year 2020–2021
From the discussion during the technical committee, the scikitlearn Consortium at Inria defined the following list of priorities for the coming year:
One the community side:
Before describing the optimization details, let’s remind the principles of the algorithm. The goal is to group data points into clusters, based on the distance from their cluster center. We start with a set of data points and a set of centers. First, the distances between all points and all centers are computed and for each point the closest center is identified: during this step a label is attached to each cluster. Then, the center of the cluster is updated to be the barycenter of its assigned data points.
A benchmark comparison with daal4py, the python wrappers for Intel’s DAAL library, showed that a significant speedup could be hoped both in sequential and in parallel runs (the discussion, initiated by François Fayard, started here). Furthermore, a preliminary profiling showed that the computation of the distances is the critical part of the algorithm but finding the labels and updating the centers is also not negligible and would quickly become the bottleneck once the first part is well optimized.
The previous implementation exposed a parameter, called precompute_distances, aimed to switch between memory and speed optimization. Favoring speed means that all distances are computed at once and stored in a distance matrix of shape (n_samples, n_clusters). Then labels are computed on this distance matrix. It’s fast, especially because it can be done using a BLAS (Basic Linear Algebra Subprogram) library which is optimized for the different families of CPU. The drawback is that it creates a potentially large temporary array which can take a lot of memory. On the other hand, favoring memory efficiency means that distances are computed one at a time and labels are updated on the fly. There is no temporary array but it’s slower, because distance computation cannot be vectorized.
Besides causing memory issues, a large temporary array does not provide optimal speed either. Indeed moving data from the RAM into the CPU and vice versa is quite slow. If we need a variable several time for our computations but we have to fetch it from the RAM each time, we are wasting a lot of time. This is what happens in the kmeans algorithm: back and forth from point to center positions to update labels and distances. Ideally we want the data to stay as close to the CPU as possible, meaning in the CPU cache, while it’s needed for the computations.
The solution we chose is to compute the distances for only a chunk of data at a time, creating a temporary array of shape (chunk_size, n_clusters).
Choosing the right chunk size is crucial. A CPU can do the same operation on several variables at once in a single instruction (this is a SIMD CPU, for Single Instruction Multiple Data). If the temporary array is too small we don’t fully exploit the vectorization capabilities of the CPU. If the temporary array is too large it does not fit in the CPU cache. We can clearly see that in the figure beside. We chose a chunk size of 256 (2⁸) samples. It guarantees that the temporary array will fit in the CPU cache which is typically a few MB, while keeping a good vectorization capability.
Overall, this new implementation is faster than both previous implementations and has a very small memory footprint (only a few MB). Also, this allowed us to simplify the API by deprecating the precompute_distances parameter. Benchmarks on single core are shown in the figure below. Timing measurements are on the left and the corresponding speedups on the right.
The new implementation also changed the parallelism scheme. Previously, a first level of parallelism, handled by the joblib library, was implemented at the outer most level. The n_jobs parameter was used to control the number of processes to run the n_init complete runs of kmeans (despite its name, n_init is actually about complete runs, not just the initialization). That meant that we couldn’t use the full capacity of a machine with more than n_init cores (the default is 10 and it is usually not useful to take a bigger value). Another level of parallelism came from the BLAS library used in the computation of the distances. However the other steps of the iteration loop are sequential which prevent good scalability.
In version 0.23, we decided to move the outer parallelism to the data level. For one chunk of data, we can compute all distances between the points and the clusters, find the labels, and even compute a partial update of the centers. Here, the parallelism is implemented using the OpenMP library in Cython. Putting the parallelism at this level gives us a much better scalability and we can now fully benefit from all the cores of the CPU, even if the user decide to use n_init=1.
The figures below show the time to fit a KMeans instance with n_init=1 (on a large dataset on the left and on a small dataset on the right) for various number of available cores. Green and blue curves concern scikitlearn 0.22. There is barely no scalability on a large dataset (time is reduced by a factor of 2 between using 1 or 16 cores) and no scalability at all on a small dataset. Red and orange curves concern scikitlearn 0.23. Scalability is much better and near perfect on large datasets if we ignore the initialization (orange curve). We discuss the scalability issues of the initialization in the last section.
In this new implementation, the parallelism at the data level is able to fully exploit all the available cores of the CPU, which means that the parallelism from the distances computation can lead to a situation of thread oversubscription, i.e. more threads than available cores are trying to run simultaneously. We had to find a way to disable this second level of parallelism coming from the BLAS library. This was the main challenge of this rework of KMeans. This challenge lead to the development of a new python library, threadpoolctl, to dynamically control, at the python level, the number of threads used by native C libraries like OpenMP and several BLAS libraries. Threadpoolctl is now a dependency of scikitlearn, and we hope that it will be used more in the wider Python ecosystem.
Latest benchmarks still show that DAAL is faster than the 0.23 scikitlearn implementation, by a factor of up to two. Improving the performances will require optimizations, essentially regarding vectorization, that we cannot apply at the Cython level.
However there’s still room for improvement regarding the initialization of the centers (kmeans++). It still has a poor scalability and since it takes a significant proportion of a run of KMeans, the whole estimator does not scale in an optimal way, as shown in the figure above. Although we think that a rework of kmeans++ might be possible: a simpler solution might be to run the initialization on a subset of the data (a discussion has been started here). We hope this would make the initialization take a negligible proportion of the whole run of KMeans, even if this does not solve the scalability issue.
February 3, 2020
Presentation of the technical achievements and ongoing work (O. Grisel)
Priority list for the consortium at Inria, year 2020–2021
From the discussion during the technical committee, the scikitlearn Consortium at Inria defined the following list of priorities for the coming year:
One the community side: