June 2, 2021
9.10 am – 10 am  Presentation of the technical achievements and ongoing work by O. Grisel 
10 am – 12 pm  Feedback and exposition of each partner of the consortium 
1 pm – 2 pm  Collaborative drafting session of the updated roadmap 
2 pm – 5 pm  Afternoon discussions on Discord 
Scikitlearn @ Fondation Inria  Alexandre Gramfort (Inria, advisory committee of the consortium)
Olivier Grisel (consortium engineer) Guillaume Lemaître (consortium engineer) Jérémie Du Boisberranger (consortium engineer) Chiara Marmo (consortium COO) Loic Esteve (Inria engineer) Mathis Batoul (Inria intern) Julien Jerphanion (Inria engineer) Gaël Varoquaux (Consortium Director) 
Consortium partners  AXA:
Dataiku:
Fujitsu:
Microsoft:
BNPParibasCardiff:

Scikitlearn community  Adrin Jalali (community member of the advisory board, at Zalando)
Joel Nothman (community member of the advisory board, at university of Sydney) 
From the discussion during the technical committee, the scikitlearn Consortium at Inria defined the following list of priorities for the coming year:
On the community side:
Exposure of partners’ comments and priorities
Fujitsu:
AXA:
BNPParibasCardiff:
Microsoft:
Dataiku:
Les membres du Consortium fournissent leur soutien financier sans aucune contrepartie. La définition d’une feuille de route concernant son développement logiciel et plus en général ses activités est une étape importante dans la construction d’une rélation de confiance entre les membres du Consortium. Cela représente notre effort d’identifier ensemble le chemin qui conduit au succès de la bibliothèque, du point de vue de son impact sur le marché du logiciel et de stabilisation de sa Communauté. La feuille de route établi notre direction.
La feuille de route n’engage que les développeurs embauchés par le Consortium. De plus, elle doit tenir compte des besoins de la Communauté, afin d’éviter les conflits d’intérêt et la multiplication inutile des efforts.
Le Comité Technique du Consortium scikitlearn est composé de:
Ils ont la responsabilité d’elaborer la feuille de route.
Pendant la réunion du Comité Technique, l’équipe technique du Consortium détaille le travail des mois précédents. Ensuite, chaque Membre décrit comment scikitlearn est utilisé chez eux et quelles fonctionnalités il serait envisageable de rajouter ou améliorer dans la bibliothèque. Les discussions qui suivent on pour but de prioriser les fonctionnalités proposées et d’élaborer une stratégie commune pour les aborder. Quand une fonctionnalité est déjà dans la feuille de route scikitlearn elle est plus facilement mise en avant. Parfois nos financeurs peuvent mettre à disposition le temps d’un développeur pour commencer à travailler sur un prototype: cela est très utile pour faire avancer une fonctionnalité spécifique.
La feuille de route du Consortium contient la liste des fonctionnalités recommandées par le Comité Technique: le cas échéant, les liens vers les tickets et le pull request déjà ouvertes sont aussi signalées. La feuille de route contient aussi des recommandations non techniques, comme par exemple des moyens de soutien à la communauté ou à propos de la fréquence de publication du paquetage. Les Membres du Consortium sont invités à proposer des thèmes pour l’organisation de sprint de développement ou séminaires. À titre d’exemple, le Consortium a organisé récemment un événement autour de l’interpretabilité et l’explicabilité des modèles produits par méthodes d’apprentissage automatique.
]]>Like real life, real world data is most often far from normality. Still, data is often assumed, sometimes implicitly, to follow a Normal (or Gaussian) distribution. The two most important assumptions made when choosing a Normal distribution or squared error for regression tasks are^{1}:
On top, it is well known to be sensitive to outliers. Here, we want to point out that—potentially better— alternatives are available.
Typical instances of data that is not normally distributed are counts (discrete) or frequencies (counts per some unit). For these, the simple Poisson distribution might be much better suited. A few examples that come to mind are:
In what follows, we have chosen the diamonds dataset to show the nonnormality and the convenience of GLMs in modelling such targets.
The diamonds dataset consists of prices of over 50 000 round cut diamonds with a few explaining variables, also called features X, such as ‘carat’, ‘color’, ‘cut quality’, ‘clarity’ and so forth. We start with a plot of the (marginal) cumulative distribution function (CDF) of the target variable price and compare to a fitted Normal and Gamma distribution which have both two parameters each.
These plots show clearly that the Gamma distribution might be a better fit to the marginal distribution of Y than the Normal distribution.
Let’s start with a more theoretical intermezzo in the next section, we will resume to the diamonds dataset after.
GLMs are statistical models for regression tasks that aim to estimate and predict the conditional expectation of a target variable Y, i.e. E[YX]. They unify many different target types under one framework: Ordinary Least Squares, Logistic, Probit and multinomial model, Poisson regression, Gamma and many more. GLMs were formalized by John Nelder and Robert Wedderburn in 1972, long after artificial neural networks!
The basic assumptions for an instance or data row i are
where μ_{i} is the mean of the conditional distribution of Y given x_{i}.
One needs to specify:
Note that the choice of the loss or distribution function or, equivalently, the variance function is crucial. It should, at least, reflect the domain of Y.
Some typical examples combinations are:
Once you have chosen the first four points, what remains to do is to find a good feature matrix X. Unlike other machine learning algorithms such as boosted trees, there are very few hyperparameters to tune. A typical hyperparameter is the regularization strength when penalties are applied. Therefore, the biggest leverage to improve your GLM is manual feature engineering of X. This includes, among others, feature selection, encoding schemes for categorical features, interaction terms, nonlinear terms like x^{2}.
The new GLM regressors are available as
from sklearn.linear_model import PoissonRegressor
from sklearn.linear_model import GammaRegressor
from sklearn.linear_model import TweedieRegressor
The TweedieRegressor
has a parameter power
, which corresponds to the exponent of the variance function v(μ) ∼ μ^{p}. For ease of the most common use, PoissonRegressor
and GammaRegressor
are the same as TweedieRegressor(power=1)
and TweedieRegressor(power=2)
, respectively, with builtin log link. All of them also support an L2penalty on the coefficients by setting the penalization strength alpha
. The underlying optimization problem is solved via the lbfgs solver of scipy. Note that the scikitlearn release 0.23 also introduced the Poisson loss for the histogram gradient boosting regressor as HistGradientBoostingRegressor(loss='poisson')
.
After all this theory, it is time to come back to our real world dataset: diamonds.
Although, in the first section, we were analysing the marginal distribution of Y and not the conditional (on the features X) distribution, we take the plot as a hint to fit a Gamma GLM with loglink, i.e. h(x) = exp(x). Furthermore, we split the data textbooklike into 80% training set and 20% test set^{2} and use ColumnTransformer
to handle columns differently. Our feature engineering consists of selecting only the four columns ‘carat’, ‘clarity’, ‘color’ and ‘cut’, logtransforming ‘carat’ as well as onehotencoding the other three. Fitting a Gamma distribution and predicting on the test sample gives us the plot below.
Note that fitting Ordinary Least Squares on log(‘price’) works also quite well. This is to be expected, as LogNormal and Gamma are very similar distributions, both with Var[Y] ∼ E[Y]^{2} = μ^{2}.
There are several open issues and pull requests for improving GLMs and fitting of nonnormal data. Some of them have been implemented in scikitlearn 0.24 already, let’s hope the others will be merged in the near future:
By Christian Lorentzen
^{1} Algorithms and estimation methods are often well able to deal with some deviation from the Normal distribution. In addition, the central limit theorem justifies a Normal distribution when considering averages or means, and the Gauss–Markov theorem is a cornerstone for usage of least squares with linear estimators (linear in the target Y).
^{2} Rows in the diamonds dataset seem to be highly correlated as there are many rows with the same values for carat, cut, color, clarity and price, while the values for x, y and z seem to be permuted.
Therefore, we define a new group variable that is unique for ‘carat’, ‘cut’, ‘color’, ‘clarity’ and ‘price’.
Then, we split stratified by group, i.e. using a GroupShuffleSplit
.
Having correlated train and test sets invalidates the independent and identical distributed assumption and may render test scores too optimistical.
Questions and comments:
Fujitsu
Fujitsu actively participates in the Consortium remote events. Fujitsu would be glad to increase Japan contributions to scikitlearn. Fujitsu suggests organizing a sprint for Japan time zone, and starting a discussion about good practices to organize online sprints with the team there.
Microsoft
More information about the MOOC are asked.
The training activity is an Inria initiative: the MOOC consists of a 20 hours training, focusing on general issues in data analysis and Machine Learning, formalism is not developed. The used language is English. All the material will be available also outside the MOOC (github, youtube..).
People working on the MOOC give part of their time to work on the code base of scikitlearn too.
AXA
AXA is glad to participate in the interpretability workshop that will take place in the afternoon.
AXA is designing an online training series that will be launched in the fall: in principle this is dedicated to the AXA employees. It could be interesting to share content with the Inria platform and offer.
BNP ParisBas
BNP starts a similar initiative about machine learning training: the same comment stands, it could be really interesting to share content with the Consortium.
]]>November 5, 2020
Presentation of the technical achievements and ongoing work (O. Grisel)
Priority list for the consortium at Inria, year 2020–2021
From the discussion during the technical committee, the scikitlearn Consortium at Inria defined the following list of priorities for the coming year:
One the community side:
Before describing the optimization details, let’s remind the principles of the algorithm. The goal is to group data points into clusters, based on the distance from their cluster center. We start with a set of data points and a set of centers. First, the distances between all points and all centers are computed and for each point the closest center is identified: during this step a label is attached to each cluster. Then, the center of the cluster is updated to be the barycenter of its assigned data points.
A benchmark comparison with daal4py, the python wrappers for Intel’s DAAL library, showed that a significant speedup could be hoped both in sequential and in parallel runs (the discussion, initiated by François Fayard, started here). Furthermore, a preliminary profiling showed that the computation of the distances is the critical part of the algorithm but finding the labels and updating the centers is also not negligible and would quickly become the bottleneck once the first part is well optimized.
The previous implementation exposed a parameter, called precompute_distances, aimed to switch between memory and speed optimization. Favoring speed means that all distances are computed at once and stored in a distance matrix of shape (n_samples, n_clusters). Then labels are computed on this distance matrix. It’s fast, especially because it can be done using a BLAS (Basic Linear Algebra Subprogram) library which is optimized for the different families of CPU. The drawback is that it creates a potentially large temporary array which can take a lot of memory. On the other hand, favoring memory efficiency means that distances are computed one at a time and labels are updated on the fly. There is no temporary array but it’s slower, because distance computation cannot be vectorized.
Besides causing memory issues, a large temporary array does not provide optimal speed either. Indeed moving data from the RAM into the CPU and vice versa is quite slow. If we need a variable several time for our computations but we have to fetch it from the RAM each time, we are wasting a lot of time. This is what happens in the kmeans algorithm: back and forth from point to center positions to update labels and distances. Ideally we want the data to stay as close to the CPU as possible, meaning in the CPU cache, while it’s needed for the computations.
The solution we chose is to compute the distances for only a chunk of data at a time, creating a temporary array of shape (chunk_size, n_clusters).
Choosing the right chunk size is crucial. A CPU can do the same operation on several variables at once in a single instruction (this is a SIMD CPU, for Single Instruction Multiple Data). If the temporary array is too small we don’t fully exploit the vectorization capabilities of the CPU. If the temporary array is too large it does not fit in the CPU cache. We can clearly see that in the figure beside. We chose a chunk size of 256 (2⁸) samples. It guarantees that the temporary array will fit in the CPU cache which is typically a few MB, while keeping a good vectorization capability.
Overall, this new implementation is faster than both previous implementations and has a very small memory footprint (only a few MB). Also, this allowed us to simplify the API by deprecating the precompute_distances parameter. Benchmarks on single core are shown in the figure below. Timing measurements are on the left and the corresponding speedups on the right.
The new implementation also changed the parallelism scheme. Previously, a first level of parallelism, handled by the joblib library, was implemented at the outer most level. The n_jobs parameter was used to control the number of processes to run the n_init complete runs of kmeans (despite its name, n_init is actually about complete runs, not just the initialization). That meant that we couldn’t use the full capacity of a machine with more than n_init cores (the default is 10 and it is usually not useful to take a bigger value). Another level of parallelism came from the BLAS library used in the computation of the distances. However the other steps of the iteration loop are sequential which prevent good scalability.
In version 0.23, we decided to move the outer parallelism to the data level. For one chunk of data, we can compute all distances between the points and the clusters, find the labels, and even compute a partial update of the centers. Here, the parallelism is implemented using the OpenMP library in Cython. Putting the parallelism at this level gives us a much better scalability and we can now fully benefit from all the cores of the CPU, even if the user decide to use n_init=1.
The figures below show the time to fit a KMeans instance with n_init=1 (on a large dataset on the left and on a small dataset on the right) for various number of available cores. Green and blue curves concern scikitlearn 0.22. There is barely no scalability on a large dataset (time is reduced by a factor of 2 between using 1 or 16 cores) and no scalability at all on a small dataset. Red and orange curves concern scikitlearn 0.23. Scalability is much better and near perfect on large datasets if we ignore the initialization (orange curve). We discuss the scalability issues of the initialization in the last section.
In this new implementation, the parallelism at the data level is able to fully exploit all the available cores of the CPU, which means that the parallelism from the distances computation can lead to a situation of thread oversubscription, i.e. more threads than available cores are trying to run simultaneously. We had to find a way to disable this second level of parallelism coming from the BLAS library. This was the main challenge of this rework of KMeans. This challenge lead to the development of a new python library, threadpoolctl, to dynamically control, at the python level, the number of threads used by native C libraries like OpenMP and several BLAS libraries. Threadpoolctl is now a dependency of scikitlearn, and we hope that it will be used more in the wider Python ecosystem.
Latest benchmarks still show that DAAL is faster than the 0.23 scikitlearn implementation, by a factor of up to two. Improving the performances will require optimizations, essentially regarding vectorization, that we cannot apply at the Cython level.
However there’s still room for improvement regarding the initialization of the centers (kmeans++). It still has a poor scalability and since it takes a significant proportion of a run of KMeans, the whole estimator does not scale in an optimal way, as shown in the figure above. Although we think that a rework of kmeans++ might be possible: a simpler solution might be to run the initialization on a subset of the data (a discussion has been started here). We hope this would make the initialization take a negligible proportion of the whole run of KMeans, even if this does not solve the scalability issue.
February 3, 2020
Presentation of the technical achievements and ongoing work (O. Grisel)
Priority list for the consortium at Inria, year 2020–2021
From the discussion during the technical committee, the scikitlearn Consortium at Inria defined the following list of priorities for the coming year:
One the community side:
We had three days of coding in excellent company, introduced by a beginner’s workshop aimed to lower the entry cost to the first Pull Request on scikitlearn. As a side effect the team had the opportunity to check the effectiveness of the documentation on scikitlearn installation… apparently OSes are like humans: no one is alike.
Almost 40 people joined the workshop, around 30 took part in the development sprint.
More than 40 Pull Requests have been merged thanks to new or confirmed contributors. The scikitlearn core developers did a great job in reviewing and followingup all this contributions. Some of them attended the sprint in person, some were ready to take over remotely on the other side of the oceans (both Atlantic and Pacific!).
Contributions consisted for example in:
A dedicated introduction is really helpful for first time contributors: thanks to the first day introduction, they were able to start coding with a clean development environment and focus on their contribution, not on the installation problems.
Having a number of core developers available on site also is an asset: preliminary discussions allow to better understand the perimeter of an issue, making the contribution more relevant and speeding up the review process.
Mining the scikitlearn issues is still not an easy task for first time contributors. A systematic effort in labelling and triaging the issues will probably help to lower the barrier. However, this effort should be automated as much as possible, to avoid individual letdown and to guarantee consistency.
The first release of scikitlearn dates back to February 1st 2010. The sprint had been scheduled to celebrate the anniversary of the project: maybe, new core developers are hidden among all those contributors who participated to this sprint and the others, past and future.
Ten years of scikitlearn should be remembered for the constant effort of making Machine Learning accessible to the most. This is the key of its success.
Today, the goal of the Inria scikitlearn Consortium is to help secure the future of scikitlearn so that the work of all the contributors is not lost. Thanks to Inria and to all our partners a first step has been done to move scikitlearn from common to public goods.
The Consortium would like to help the community to growth healthy and happy until adulthood… teens are a so difficult age… without the community there will be no future for scikitlearn: this sprint and all those that will follow are our way to testify that.
Models fitted by Machine Learning algorithms need to be interpreted and well understood if they have to be applied at a large scale and trusted by users.
Visualisation is an important step of data analysis and an essential one to the understanding of your dataset.
It allows to have a first insight into the data and provides suggestions on which methods are suitable to a deeper investigation.
The 0.22 scikitlearn version defines a simple API for visualisation.
The key feature of this API is to allow for quick plotting and visual adjustments without recalculation.
For each available plotting function a correspondent object is defined storing the necessary information to be graphically rendered.
Interpretability defines the level of comprehension we have of a general model and of its application to a dataset.
Dive deeper into interpretability of the fitted model makes Machine Learning more understandable.
This was also a recommendation from the Partners of the scikitlearn Consortium.
The 0.22 version improves the inspection module.
A new feature, the permutation feature importance, has been added. It measures how the score of a specific model decreases when a feature is not available.
The permutation importance is calculated on the training set to show how much the model relies on each feature during training.
Also Partial Dependency analysis has been improved in particular increasing interoperability with Pandas objects and the new plotting capabilities.
When dealing with big amount of data there are just as big chances that some entries are incomplete.
Multiple reasons, from instrument failures to bad format conversions to human errors, could be the causes of missing values in the dataset.
Ideally, Machine Learning algorithms would know what to do with them. When this is not the case a number of socalled imputation algorithms could be used to make assumptions on the missing data. Those imputers should be used carefully: good quality imputation does not always imply good predictions, sometime the lack of information is a predictor itself.
For scikitlearn, version 0.22 brings the HistogramGradientBooster algorithm to manage missing data without need of any imputation.
For those estimators that still need missing data to be imputed the impute module has now a new kNearest Neighbors imputer, for which a Euclidean distance has been defined in the metric module taking missing values into account.
Big amount of data need to be efficiently and accurately manipulated: interoperability is the key for a safe data mining.
No matter which software you are using, format and structure manipulations need to be automatised and user do not have to care about that.
Pandas is a principal actor of the Pydata ecosystem: scikitlearn 0.22 improves input and output interoperability with pandas on a method by method basis.
In particular fetch_openml can now return pandas dataframes and thus properly handle datasets with heterogeneous data.
Sometime datasets could not be modelled using just one predictor. Different ranges of variables seem to obey to different laws. Version 0.22 of scikitlearn provides the option of stacking the outputs of several learners with a single model, improving the final prediction performance.
Even if in Python there is no really private objects and methods, this 0.22 version aims to clean the public API space.
Be aware that this could change some of your import.
Private API are not meant to be documented and you should not rely on their stability.
Managing logs is not an obvious task: if you are in a production or development environment, if you are managing a lot of dependencies or just running a small script, you may want to monitor different behaviours, looking for different levels of verbosity.
Python defines a standard behaviour for warnings, defining also the level of the warning filter needed to avoid them.
The scikitlearn approach has always been to make the user aware of object deprecations, as the code could be updated as soon as possible to avoid future failures.
But this was done in a non standard way, overriding user preferences in the __init__.py file.
Our Elves have received some coal in the past for this.
They are happy to share that scikitlearn 0.22 is compliant with Python recommendation.
Deprecations are now identified with FutureWarnings, always thrown in the Python scheme.
The 0.22 release comes with a lot more improvements and bug fixes.
Check the Changelog to have them in a glance.
As often, choices have to be made and compromises between the amazing feature you would have been happy to see in the code and the time availability of a community based project: so, please don’t be too upset if your Santa’s list is not completely covered.
The Elves are already working on the next step … to 0.23 …
and beyond!
Pour plus d’informations le communiqué de presse Inria et Fujitsu sont en ligne.
]]>