At Hugging Face, we’ve been putting a lot of effort into supporting deep learning, but we believe that machine learning as a whole can benefit from the tools we release. With statistical machine learning being essential in this field and scikitlearn dominating statistical ML, we’re excited to partner and move forward together.
As of September 2022, the Hugging Face Hub already hosts nearly 4,000 tabular classification and tabular regression model checkpoints, and we strive for this trend to continue.
Starting June 2022, Hugging Face is now an official sponsor of the scikitlearn consortium . Through this support, Hugging Face actively promotes the development and sustainability of sklearn. As a sponsor of the scikitlearn consortium hosted at the Inria foundation, we’ll now participate in the scikitlearn consortium technical committee
To help sustaining the development of the library , we’re happy to welcome Adrin Jalali and Benjamin Bossan to the Hugging Face team. Adrin is a core developer of scikitlearn as well as fairlearn, while Benjamin is the author of the skorch library and is now a contributor to scikitlearn.
Hugging Face is happy to support the development of scikitlearn through code contributions, issues, pull requests, reviews, and discussions.
« Skops » is the name of the framework being actively developed as the link between the scikitlearn and the Hugging Face ecosystems. With Skops, we hope to facilitate essential workflows:
Working at the intersection of scikitlearn and the Hub offers challenges linked to the two platforms. One of these challenges is secure persistence: the ability to serialize models in a secure, safe manner.
scikitlearn models (estimators, predictors, …) are usually saved using pickle, which is notorious for not being a secure format. Sharing scikitlearn models in this format exposes receivers to potentially malicious data which could execute arbitrary code when run.
That’s where secure persistence comes in: as the Hugging Face Hub aims to provide a platform for models, the ability to share safe, secure objects is essential. We’ve been working on adding secure persistence for scikitlearn models in skops#128 and skops#145 (doc preview). Instead of serializing using pickle, the object’s contents are put into a zip file with an accompanying schema JSON file.
Read about the Skops library in the following blog post: Introducing Skops.
Skops is an example of an integration of scikitlearn within our tools, but it is not the only example! We will strive to integrate with the rest of our ecosystem so that Hugging Face users may benefit from using scikitlearn tools and viceversa.
An example is the `evaluate` library, dedicated to efficiently evaluating machine learning models and datasets. We aim for this tool to natively support scikitlearn metrics in its API.
—
Through these efforts, we hope to kickstart a lasting relationship between the two ecosystems and provide simple, efficient bridges to lower the barrier of entry. We believe that educating and sharing models is the best way to foster inclusive machine learning from which all can benefit. We’re excited to partner with scikitlearn for this endeavor.
What is a scikitlearn sprint you may ask? The scikitlearn sprint is a handson “hackathon” where we work on issues in the scikitlearn GitHub repository and learn to contribute to open source. This sprint included an introductory and practical workshop about contribution to open source software.
So cool to see this community gather to contribute to data scientists’ companion @scikit_learn #pariswimlds #DataScience #MachineLearning #OpenSource
— Giulia Bianchi (@Giuliabianchl) March 12, 2022
Under the guidance of no less than five people from the scikitlearn team, the participants set up their environments, learned how to use tools ranging from conda to git, black or pytest, and finally made their first pull requests!
If you don’t feel like solving issues and submitting pull request, another way to contribute to scikitlearn as a user is simply to open issues when you run into one! By the way here is a good way of doing so using a minimal reproducer. Note that there are many other ways to contribute to scikitlearn, organizing such an event being one. Feel free to contact me if you would like to do so.
For a full replay of the event, you can check Chloé Azencott’s twitter page and the #pariswimlds hashtag: she documented every step, from the teeshirt distribution in the morning to the successful pull requests, without forgetting learning how to use VS Code and of course… the and but there was also fruits , tea and fruit juice, we were not here to comform to stereotypes.
There’s one more thing developers need after pizza, before they can spend their Saturday afternoon working on pull requests at the @WiMLDS_Paris @scikit_learn sprint! #pariswimlds pic.twitter.com/8eZPb981xB
— Dr Chloé Azencott (@cazencott) March 12, 2022
A couple of numbers from this sprint:
Happy to take part in the @scikit_learn sprint with #wimlds members. Thanks @CybelAngel for hosting this nice event. pic.twitter.com/3x5ZDWDYHX
— Victoire Louis (@VictoireLouisC) March 12, 2022
If you want to do this at home, here are a couple of links to be guided and starting to contribute to scikit learn:
All the setup and guidelines were explained in a specific github repository that you can find here. It was made to be crystal clear, and guided step to step: you can rely on it if you want to start. During this sprint, a couple of issues, and more exactly metaissues (which are issues listing a problem in plenty of different places, to be fixed individually), were listed in a specific board. Although we made some progress, they are still open if you want to have a look! The maintainers and core contributors of scikit learn label some issues as “good first issue”. It’s a label to encourage people to step in with these easier issues #21350 #22406. We also worked on #11000 which is labeled as « hard » but some participants successfully tackled it.
If you are interested, scikitlearn also has a webpage explaining how to contribute.
Some participants were interviewed during the sprint…
… And you can find their reactions to this sprint here.
We would like to thank our mentors Olivier Grisel, Adrin Jalali, Maren Westermann, Béa Hernandez, and Gaël Varoquaux. Thank you for coming, sometimes from quite far, for being pedagogical, reviewing and accepting their pull requests so quickly!
The organization of this sprint was made possible by WiMLDS Paris and especially Chloé Azencott who was instrumental, making the connection between the participants and us, the scikitlearn community team.
Finally, thank you to CybelAngel for hosting us, Giulia Bianchi for presenting CybelAngel and Marie Sacksick for the perfect logistic of this event.
]]>
Longer term: Big picture tasks which require more thinking
Community: On the community side:
]]>
joblib
sprint from February 2 to 4. For the record, joblib
is the parallelcomputing engine used by scikitlearn. Improving it provides better capabilities of parallelization for the whole python community. If you want to know more aboutjoblib
and its amazing functionalities, go and check this page.
During this development sprint, we spent some time discussing the aim of the library and where it should go from now on. These are some of our takeaways:
threading
backend can become much more suitable for singlemachine parallelism as it has less overhead (no need for pickling and interprocess communication), it is much faster to start and stop (no need to reimport packages in workers), and has reduced memory usage (a single Python process, no need to duplicate or memory map large data arguments).concurrent.futures
in CPython stdlib, dask
and ray
for distributed computing). Also, for advanced usage, there is a need for a scheduler, which is out of scope for joblib.
Therefore, the Goal of joblib
is to provide the simplest API for embarrassingly parallel computing with faulttolerance tools and backend switching, to go from sequential execution on a laptop to distributed runs on a cluster.
Memory
object is often overlooked and can be hard to debug.threading
backend. As a parallel computation community, we should push to make some large benchmarks to evaluate how efficient the approach recently proposed in CPython is and report back to its developpers.joblib
API does not allow to reduce the results before the end of the computations. This poses some memory overflow issues. We are working on the return_generator
feature, that would allow to consume the results as they are ready. We also discussed more advanced reducing operations, that would minimize the communication overhead that can be explored in the future, with the notion of reducer on each worker.Memory
debugging tools: the disk caching can be hard to setup when discovering the joblib
library. In particular, this is true for cachemiss triggered by passing almost identical numerical arrays. Improving the debugging tools for Memory
would help the adoption of such pattern in the community. We proposed several tools:
JOBLIB_DEBUG=1
stores the list of memory.cache
calls and a second run compares the hash and make a diff of the args.Memory
: checkpointing is a classical pattern to improve fault tolerance, it could be interesting to add it to Memory
.During the sprint, we concretely worked on the following issues and PRs:
loky
(joblib/loky#304 – P. Maevskikh, P. Glaser, G. Le Maitre, J. Du BoisBerranger, O. Grisel): the code of loky
was very complex due to numerous backports of functionalities and workarounds necessary to support both recent CPython versions to the oldest Python 2.7. As we drop this support, we could greatly simplify the code base and reduce the CI maintenance complexity.joblib
importable on pyodide
to run sequential computations (joblib/joblib#1256 – H. Chatham, G. Varoquaux. This illustrates the mission of joblib: the same code running from the browser, to the datacenter).return_generator=True
in Parallel
for asynchronous reduction operations (joblib/joblib#588 – F. Charras, T. Moreau).nogil
branch of CPython and to debug the observed raceconditions withgdb
so as to be able to report them in upstream projects once we can craft minimal reproducers (work in progress – J. Jerphanion).Parallel
calls when running with the dask backend (dask/distributed#5757 – P. Glaser)Thanks for reading, we leave you with some green buttons.
]]>June 2, 2021
9.10 am – 10 am  Presentation of the technical achievements and ongoing work by O. Grisel 
10 am – 12 pm  Feedback and exposition of each partner of the consortium 
1 pm – 2 pm  Collaborative drafting session of the updated roadmap 
2 pm – 5 pm  Afternoon discussions on Discord 
Scikitlearn @ Fondation Inria  Alexandre Gramfort (Inria, advisory committee of the consortium)
Olivier Grisel (consortium engineer) Guillaume Lemaître (consortium engineer) Jérémie Du Boisberranger (consortium engineer) Chiara Marmo (consortium COO) Loic Esteve (Inria engineer) Mathis Batoul (Inria intern) Julien Jerphanion (Inria engineer) Gaël Varoquaux (Consortium Director) 
Consortium partners  AXA:
Dataiku:
Fujitsu:
Microsoft:
BNPParibasCardiff:

Scikitlearn community  Adrin Jalali (community member of the advisory board, at Zalando)
Joel Nothman (community member of the advisory board, at university of Sydney) 
From the discussion during the technical committee, the scikitlearn Consortium at Inria defined the following list of priorities for the coming year:
On the community side:
Exposure of partners’ comments and priorities
Fujitsu:
AXA:
BNPParibasCardiff:
Microsoft:
Dataiku:
Les membres du Consortium fournissent leur soutien financier sans aucune contrepartie. La définition d’une feuille de route concernant son développement logiciel et plus en général ses activités est une étape importante dans la construction d’une rélation de confiance entre les membres du Consortium. Cela représente notre effort d’identifier ensemble le chemin qui conduit au succès de la bibliothèque, du point de vue de son impact sur le marché du logiciel et de stabilisation de sa Communauté. La feuille de route établi notre direction.
La feuille de route n’engage que les développeurs embauchés par le Consortium. De plus, elle doit tenir compte des besoins de la Communauté, afin d’éviter les conflits d’intérêt et la multiplication inutile des efforts.
Le Comité Technique du Consortium scikitlearn est composé de:
Ils ont la responsabilité d’elaborer la feuille de route.
Pendant la réunion du Comité Technique, l’équipe technique du Consortium détaille le travail des mois précédents. Ensuite, chaque Membre décrit comment scikitlearn est utilisé chez eux et quelles fonctionnalités il serait envisageable de rajouter ou améliorer dans la bibliothèque. Les discussions qui suivent on pour but de prioriser les fonctionnalités proposées et d’élaborer une stratégie commune pour les aborder. Quand une fonctionnalité est déjà dans la feuille de route scikitlearn elle est plus facilement mise en avant. Parfois nos financeurs peuvent mettre à disposition le temps d’un développeur pour commencer à travailler sur un prototype: cela est très utile pour faire avancer une fonctionnalité spécifique.
La feuille de route du Consortium contient la liste des fonctionnalités recommandées par le Comité Technique: le cas échéant, les liens vers les tickets et le pull request déjà ouvertes sont aussi signalées. La feuille de route contient aussi des recommandations non techniques, comme par exemple des moyens de soutien à la communauté ou à propos de la fréquence de publication du paquetage. Les Membres du Consortium sont invités à proposer des thèmes pour l’organisation de sprint de développement ou séminaires. À titre d’exemple, le Consortium a organisé récemment un événement autour de l’interpretabilité et l’explicabilité des modèles produits par méthodes d’apprentissage automatique.
]]>Like real life, real world data is most often far from normality. Still, data is often assumed, sometimes implicitly, to follow a Normal (or Gaussian) distribution. The two most important assumptions made when choosing a Normal distribution or squared error for regression tasks are^{1}:
On top, it is well known to be sensitive to outliers. Here, we want to point out that—potentially better— alternatives are available.
Typical instances of data that is not normally distributed are counts (discrete) or frequencies (counts per some unit). For these, the simple Poisson distribution might be much better suited. A few examples that come to mind are:
In what follows, we have chosen the diamonds dataset to show the nonnormality and the convenience of GLMs in modelling such targets.
The diamonds dataset consists of prices of over 50 000 round cut diamonds with a few explaining variables, also called features X, such as ‘carat’, ‘color’, ‘cut quality’, ‘clarity’ and so forth. We start with a plot of the (marginal) cumulative distribution function (CDF) of the target variable price and compare to a fitted Normal and Gamma distribution which have both two parameters each.
These plots show clearly that the Gamma distribution might be a better fit to the marginal distribution of Y than the Normal distribution.
Let’s start with a more theoretical intermezzo in the next section, we will resume to the diamonds dataset after.
GLMs are statistical models for regression tasks that aim to estimate and predict the conditional expectation of a target variable Y, i.e. E[YX]. They unify many different target types under one framework: Ordinary Least Squares, Logistic, Probit and multinomial model, Poisson regression, Gamma and many more. GLMs were formalized by John Nelder and Robert Wedderburn in 1972, long after artificial neural networks!
The basic assumptions for an instance or data row i are
where μ_{i} is the mean of the conditional distribution of Y given x_{i}.
One needs to specify:
Note that the choice of the loss or distribution function or, equivalently, the variance function is crucial. It should, at least, reflect the domain of Y.
Some typical examples combinations are:
Once you have chosen the first four points, what remains to do is to find a good feature matrix X. Unlike other machine learning algorithms such as boosted trees, there are very few hyperparameters to tune. A typical hyperparameter is the regularization strength when penalties are applied. Therefore, the biggest leverage to improve your GLM is manual feature engineering of X. This includes, among others, feature selection, encoding schemes for categorical features, interaction terms, nonlinear terms like x^{2}.
The new GLM regressors are available as
from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import GammaRegressor
from sklearn.linear_model import TweedieRegressor
The TweedieRegressor
has a parameter power
, which corresponds to the exponent of the variance function v(μ) ∼ μ^{p}. For ease of the most common use, PoissonRegressor
and GammaRegressor
are the same as TweedieRegressor(power=1)
and TweedieRegressor(power=2)
, respectively, with builtin log link. All of them also support an L2penalty on the coefficients by setting the penalization strength alpha
. The underlying optimization problem is solved via the lbfgs solver of scipy. Note that the scikitlearn release 0.23 also introduced the Poisson loss for the histogram gradient boosting regressor as HistGradientBoostingRegressor(loss='poisson')
.
After all this theory, it is time to come back to our real world dataset: diamonds.
Although, in the first section, we were analysing the marginal distribution of Y and not the conditional (on the features X) distribution, we take the plot as a hint to fit a Gamma GLM with loglink, i.e. h(x) = exp(x). Furthermore, we split the data textbooklike into 80% training set and 20% test set^{2} and use ColumnTransformer
to handle columns differently. Our feature engineering consists of selecting only the four columns ‘carat’, ‘clarity’, ‘color’ and ‘cut’, logtransforming ‘carat’ as well as onehotencoding the other three. Fitting a Gamma distribution and predicting on the test sample gives us the plot below.
Note that fitting Ordinary Least Squares on log(‘price’) works also quite well. This is to be expected, as LogNormal and Gamma are very similar distributions, both with Var[Y] ∼ E[Y]^{2} = μ^{2}.
There are several open issues and pull requests for improving GLMs and fitting of nonnormal data. Some of them have been implemented in scikitlearn 0.24 already, let’s hope the others will be merged in the near future:
By Christian Lorentzen
^{1} Algorithms and estimation methods are often well able to deal with some deviation from the Normal distribution. In addition, the central limit theorem justifies a Normal distribution when considering averages or means, and the Gauss–Markov theorem is a cornerstone for usage of least squares with linear estimators (linear in the target Y).
^{2} Rows in the diamonds dataset seem to be highly correlated as there are many rows with the same values for carat, cut, color, clarity and price, while the values for x, y and z seem to be permuted.
Therefore, we define a new group variable that is unique for ‘carat’, ‘cut’, ‘color’, ‘clarity’ and ‘price’.
Then, we split stratified by group, i.e. using a GroupShuffleSplit
.
Having correlated train and test sets invalidates the independent and identical distributed assumption and may render test scores too optimistical.
Questions and comments:
Fujitsu
Fujitsu actively participates in the Consortium remote events. Fujitsu would be glad to increase Japan contributions to scikitlearn. Fujitsu suggests organizing a sprint for Japan time zone, and starting a discussion about good practices to organize online sprints with the team there.
Microsoft
More information about the MOOC are asked.
The training activity is an Inria initiative: the MOOC consists of a 20 hours training, focusing on general issues in data analysis and Machine Learning, formalism is not developed. The used language is English. All the material will be available also outside the MOOC (github, youtube..).
People working on the MOOC give part of their time to work on the code base of scikitlearn too.
AXA
AXA is glad to participate in the interpretability workshop that will take place in the afternoon.
AXA is designing an online training series that will be launched in the fall: in principle this is dedicated to the AXA employees. It could be interesting to share content with the Inria platform and offer.
BNP ParisBas
BNP starts a similar initiative about machine learning training: the same comment stands, it could be really interesting to share content with the Consortium.
]]>November 5, 2020
Presentation of the technical achievements and ongoing work (O. Grisel)
Priority list for the consortium at Inria, year 2020–2021
From the discussion during the technical committee, the scikitlearn Consortium at Inria defined the following list of priorities for the coming year:
One the community side:
Before describing the optimization details, let’s remind the principles of the algorithm. The goal is to group data points into clusters, based on the distance from their cluster center. We start with a set of data points and a set of centers. First, the distances between all points and all centers are computed and for each point the closest center is identified: during this step a label is attached to each cluster. Then, the center of the cluster is updated to be the barycenter of its assigned data points.
A benchmark comparison with daal4py, the python wrappers for Intel’s DAAL library, showed that a significant speedup could be hoped both in sequential and in parallel runs (the discussion, initiated by François Fayard, started here). Furthermore, a preliminary profiling showed that the computation of the distances is the critical part of the algorithm but finding the labels and updating the centers is also not negligible and would quickly become the bottleneck once the first part is well optimized.
The previous implementation exposed a parameter, called precompute_distances, aimed to switch between memory and speed optimization. Favoring speed means that all distances are computed at once and stored in a distance matrix of shape (n_samples, n_clusters). Then labels are computed on this distance matrix. It’s fast, especially because it can be done using a BLAS (Basic Linear Algebra Subprogram) library which is optimized for the different families of CPU. The drawback is that it creates a potentially large temporary array which can take a lot of memory. On the other hand, favoring memory efficiency means that distances are computed one at a time and labels are updated on the fly. There is no temporary array but it’s slower, because distance computation cannot be vectorized.
Besides causing memory issues, a large temporary array does not provide optimal speed either. Indeed moving data from the RAM into the CPU and vice versa is quite slow. If we need a variable several time for our computations but we have to fetch it from the RAM each time, we are wasting a lot of time. This is what happens in the kmeans algorithm: back and forth from point to center positions to update labels and distances. Ideally we want the data to stay as close to the CPU as possible, meaning in the CPU cache, while it’s needed for the computations.
The solution we chose is to compute the distances for only a chunk of data at a time, creating a temporary array of shape (chunk_size, n_clusters).
Choosing the right chunk size is crucial. A CPU can do the same operation on several variables at once in a single instruction (this is a SIMD CPU, for Single Instruction Multiple Data). If the temporary array is too small we don’t fully exploit the vectorization capabilities of the CPU. If the temporary array is too large it does not fit in the CPU cache. We can clearly see that in the figure beside. We chose a chunk size of 256 (2⁸) samples. It guarantees that the temporary array will fit in the CPU cache which is typically a few MB, while keeping a good vectorization capability.
Overall, this new implementation is faster than both previous implementations and has a very small memory footprint (only a few MB). Also, this allowed us to simplify the API by deprecating the precompute_distances parameter. Benchmarks on single core are shown in the figure below. Timing measurements are on the left and the corresponding speedups on the right.
The new implementation also changed the parallelism scheme. Previously, a first level of parallelism, handled by the joblib library, was implemented at the outer most level. The n_jobs parameter was used to control the number of processes to run the n_init complete runs of kmeans (despite its name, n_init is actually about complete runs, not just the initialization). That meant that we couldn’t use the full capacity of a machine with more than n_init cores (the default is 10 and it is usually not useful to take a bigger value). Another level of parallelism came from the BLAS library used in the computation of the distances. However the other steps of the iteration loop are sequential which prevent good scalability.
In version 0.23, we decided to move the outer parallelism to the data level. For one chunk of data, we can compute all distances between the points and the clusters, find the labels, and even compute a partial update of the centers. Here, the parallelism is implemented using the OpenMP library in Cython. Putting the parallelism at this level gives us a much better scalability and we can now fully benefit from all the cores of the CPU, even if the user decide to use n_init=1.
The figures below show the time to fit a KMeans instance with n_init=1 (on a large dataset on the left and on a small dataset on the right) for various number of available cores. Green and blue curves concern scikitlearn 0.22. There is barely no scalability on a large dataset (time is reduced by a factor of 2 between using 1 or 16 cores) and no scalability at all on a small dataset. Red and orange curves concern scikitlearn 0.23. Scalability is much better and near perfect on large datasets if we ignore the initialization (orange curve). We discuss the scalability issues of the initialization in the last section.
In this new implementation, the parallelism at the data level is able to fully exploit all the available cores of the CPU, which means that the parallelism from the distances computation can lead to a situation of thread oversubscription, i.e. more threads than available cores are trying to run simultaneously. We had to find a way to disable this second level of parallelism coming from the BLAS library. This was the main challenge of this rework of KMeans. This challenge lead to the development of a new python library, threadpoolctl, to dynamically control, at the python level, the number of threads used by native C libraries like OpenMP and several BLAS libraries. Threadpoolctl is now a dependency of scikitlearn, and we hope that it will be used more in the wider Python ecosystem.
Latest benchmarks still show that DAAL is faster than the 0.23 scikitlearn implementation, by a factor of up to two. Improving the performances will require optimizations, essentially regarding vectorization, that we cannot apply at the Cython level.
However there’s still room for improvement regarding the initialization of the centers (kmeans++). It still has a poor scalability and since it takes a significant proportion of a run of KMeans, the whole estimator does not scale in an optimal way, as shown in the figure above. Although we think that a rework of kmeans++ might be possible: a simpler solution might be to run the initialization on a subset of the data (a discussion has been started here). We hope this would make the initialization take a negligible proportion of the whole run of KMeans, even if this does not solve the scalability issue.