Les membres du Consortium fournissent leur soutien financier sans aucune contrepartie. La définition d’une feuille de route concernant son développement logiciel et plus en général ses activités est une étape importante dans la construction d’une rélation de confiance entre les membres du Consortium. Cela représente notre effort d’identifier ensemble le chemin qui conduit au succès de la bibliothèque, du point de vue de son impact sur le marché du logiciel et de stabilisation de sa Communauté. La feuille de route établi notre direction.

La feuille de route n’engage que les développeurs embauchés par le Consortium. De plus, elle doit tenir compte des besoins de la Communauté, afin d’éviter les conflits d’intérêt et la multiplication inutile des efforts.

Le Comité Technique du Consortium scikit-learn est composé de:

- representants de l’équipe technique embauchée par le Consortium,
- un representant pour chaque membre du Consortium,
- jusqu’à autant de representants de la Communauté scikit-learn que de membres du Consortium.

Ils ont la responsabilité d’elaborer la feuille de route.

Pendant la réunion du Comité Technique, l’équipe technique du Consortium détaille le travail des mois précédents. Ensuite, chaque Membre décrit comment scikit-learn est utilisé chez eux et quelles fonctionnalités il serait envisageable de rajouter ou améliorer dans la bibliothèque. Les discussions qui suivent on pour but de prioriser les fonctionnalités proposées et d’élaborer une stratégie commune pour les aborder. Quand une fonctionnalité est déjà dans la feuille de route scikit-learn elle est plus facilement mise en avant. Parfois nos financeurs peuvent mettre à disposition le temps d’un développeur pour commencer à travailler sur un prototype: cela est très utile pour faire avancer une fonctionnalité spécifique.

La feuille de route du Consortium contient la liste des fonctionnalités recommandées par le Comité Technique: le cas échéant, les liens vers les tickets et le *pull request *déjà ouvertes sont aussi signalées. La feuille de route contient aussi des recommandations non techniques, comme par exemple des moyens de soutien à la communauté ou à propos de la fréquence de publication du paquetage. Les Membres du Consortium sont invités à proposer des thèmes pour l’organisation de sprint de développement ou séminaires. À titre d’exemple, le Consortium a organisé récemment un événement autour de l’interpretabilité et l’explicabilité des modèles produits par méthodes d’apprentissage automatique.

Like *real life*, real world data is most often far from *normality*. Still, data is often assumed, sometimes implicitly, to follow a Normal (or Gaussian) distribution. The two most important assumptions made when choosing a Normal distribution or squared error for regression tasks are^{1}:

- The data is distributed symmetrically around the expectation. Hence, expectation and median are the same.
- The variance of the data does not depend on the expectation.

On top, it is well known to be sensitive to outliers. Here, we want to point out that—potentially better— alternatives are available.

Typical instances of data that is not normally distributed are counts (discrete) or frequencies (counts per some unit). For these, the simple Poisson distribution might be much better suited. A few examples that come to mind are:

- number of clicks per second in a Geiger counter
- number of patients per day in a hospital
- number of persons per day using their bike
- number of goals scored per game and player
- number of smiles per day and person … Would love to have those data! Think about making their distribution more normal!

In what follows, we have chosen the diamonds dataset to show the non-normality and the convenience of GLMs in modelling such targets.

The diamonds dataset consists of prices of over 50 000 round cut diamonds with a few explaining variables, also called features *X*, such as ‘*carat’*, ‘*color’*, ‘*cut quality’*, ‘*clarity’* and so forth. We start with a plot of the (marginal) cumulative distribution function (CDF) of the target variable *price* and compare to a fitted Normal and Gamma distribution which have both two parameters each.

These plots show clearly that the Gamma distribution might be a better fit to the marginal distribution of *Y* than the Normal distribution.

Let’s start with a more theoretical intermezzo in the next section, we will resume to the diamonds dataset after.

GLMs are statistical models for regression tasks that aim to estimate and predict the conditional expectation of a target variable *Y*, i.e. *E[Y|X]*. They unify many different target types under one framework: Ordinary Least Squares, Logistic, Probit and multinomial model, Poisson regression, Gamma and many more. GLMs were formalized by John Nelder and Robert Wedderburn in 1972, long after artificial neural networks!

The basic assumptions for an instance or data row *i* are

*E[Y*,_{i}|x_{i}] = μ_{i}= h(x_{i}· β)*Var[Y*,_{i}|x_{i}] ∼ v(μ_{i}) / w_{i}

where *μ _{i}* is the mean of the conditional distribution of

One needs to specify:

- the target variable
*Y*,_{i} - the inverse link function
*h*, which maps real numbers to the range of*Y*(or more precisely the range of*E[Y]*), - optionally, sample weights
*w*,_{i} - the variance function
*v(μ)*, which is equivalent to specifying a loss function or a specific distribution from the family of the exponential dispersion model (EDM), - the feature matrix
*X*with row vectors*x*,_{i} - the coefficients or weights beta to be estimated from the data.

Note that the choice of the loss or distribution function or, equivalently, the variance function is crucial. It should, at least, reflect the domain of *Y*.

Some typical examples combinations are:

- measurement errors are described as real numbers, with a Normal distribution and an identity link function;
- insurance claims are represented by positive numbers, with a Gamma distribution and a log link function;
- Geiger counts are non-negative numbers, with a Poisson distribution and a log link function;
- the probability of success of a challenge are numbers in the
*[0, 1]*interval, with a Binomial distribution and a logit link function

Once you have chosen the first four points, what remains to do is to find a good feature matrix *X*. Unlike other machine learning algorithms such as boosted trees, there are very few hyperparameters to tune. A typical hyperparameter is the regularization strength when penalties are applied. Therefore, the biggest leverage to improve your GLM is manual feature engineering of *X*. This includes, among others, feature selection, encoding schemes for categorical features, interaction terms, non-linear terms like *x ^{2}*.

- Very well understood and established, proven over and over in practice, e.g. stability, see next point.
- Very stable: slight changes of training data do not alter the fitted model much (counter example: decision trees).
- Versatile as to model different targets with different link and loss functions.

As an example, Log link gives a multiplicative structure and effects are interpreted on a relative scale.

Together with a Poisson distribution, this still works even when some target values are exactly zero. - Mathematical tractable which means a good theoretical understanding and a fast fitting even for large datasets.
- Ease of interpretation.
- As flexible as the building of the feature matrix
*X*. - Some losses, like Poisson loss, can handle a certain amount of excess of zeros.

- Feature matrix
*X*has to be built manually, in particular interaction terms and non-linear effects. - Unbiaseness depends on (correct) specification of
*X*and on combination of link and loss function. - Predictive performance often worse than for boosted tree models or neural networks.

The new GLM regressors are available as

from sklearn.linear_model import PoissonRegressor

from sklearn.linear_model import GammaRegressor

from sklearn.linear_model import TweedieRegressor

The `TweedieRegressor`

has a parameter `power`

, which corresponds to the exponent of the variance function *v(μ) ∼ μ ^{p}*. For ease of the most common use,

`PoissonRegressor`

and `GammaRegressor`

are the same as `TweedieRegressor(power=1)`

and `TweedieRegressor(power=2)`

, respectively, with built-in log link. All of them also support an L2-penalty on the coefficients by setting the penalization strength `alpha`

. The underlying optimization problem is solved via the l-bfgs solver of scipy. Note that the scikit-learn release 0.23 also introduced the Poisson loss for the histogram gradient boosting regressor as `HistGradientBoostingRegressor(loss='poisson')`

.After all this theory, it is time to come back to our real world dataset: diamonds.

Although, in the first section, we were analysing the marginal distribution of *Y* and not the conditional (on the features *X*) distribution, we take the plot as a hint to fit a Gamma GLM with log-link, i.e. *h(x) = exp(x)*. Furthermore, we split the data textbook-like into 80% training set and 20% test set^{2} and use `ColumnTransformer`

to handle columns differently. Our feature engineering consists of selecting only the four columns ‘*carat’*, ‘*clarity’*, ‘*color’* and ‘*cut’*, log-transforming ‘*carat’* as well as one-hot-encoding the other three. Fitting a Gamma distribution and predicting on the test sample gives us the plot below.

Note that fitting Ordinary Least Squares on log(‘*price’*) works also quite well. This is to be expected, as Log-Normal and Gamma are very similar distributions, both with *Var[Y] ∼ E[Y] ^{2} = μ^{2}*.

There are several open issues and pull requests for improving GLMs and fitting of non-normal data. Some of them have been implemented in scikit-learn 0.24 already, let’s hope the others will be merged in the near future:

- Poisson splitting criterion for decision trees (PR #17386) made it in v0.24.
- Spline Transformer (PR #18368) will be available in 1.0.
- L1 penalty and coordinate descent solver (Issue #16637),
- IRLS solver if benchmarks show improvement over l-bfgs (Issue #16634),
- Better support for interaction terms (Issue #15263),
- Native categorical support (Issue #18893),
- Feature names Enhancement Proposal (SLEP015) is under active discussion.

*By Christian Lorentzen*

^{1} Algorithms and estimation methods are often well able to deal with some deviation from the Normal distribution. In addition, the central limit theorem justifies a Normal distribution when considering averages or means, and the Gauss–Markov theorem is a cornerstone for usage of least squares with linear estimators (linear in the target *Y*).

^{2} Rows in the diamonds dataset seem to be highly correlated as there are many rows with the same values for carat, cut, color, clarity and price, while the values for x, y and z seem to be permuted.

Therefore, we define a new group variable that is unique for ‘*carat’*, ‘*cut’*, ‘*color’*, ‘*clarity’* and ‘*price’*.

Then, we split stratified by group, i.e. using a `GroupShuffleSplit`

.

Having correlated train and test sets invalidates the independent and identical distributed assumption and may render test scores too optimistical.

*Questions and comments*:

**Fujitsu**

Fujitsu actively participates in the Consortium remote events. Fujitsu would be glad to increase Japan contributions to scikit-learn. Fujitsu suggests organizing a sprint for Japan time zone, and starting a discussion about good practices to organize online sprints with the team there.

**Microsoft**

More information about the MOOC are asked.

The training activity is an Inria initiative: the MOOC consists of a 20 hours training, focusing on general issues in data analysis and Machine Learning, formalism is not developed. The used language is English. All the material will be available also outside the MOOC (github, youtube..).

People working on the MOOC give part of their time to work on the code base of scikit-learn too.

**AXA**

AXA is glad to participate in the interpretability workshop that will take place in the afternoon.

AXA is designing an online training series that will be launched in the fall: in principle this is dedicated to the AXA employees. It could be interesting to share content with the Inria platform and offer.

**BNP ParisBas**

BNP starts a similar initiative about machine learning training: the same comment stands, it could be really interesting to share content with the Consortium.

]]>Before describing the optimization details, let’s remind the principles of the algorithm. The goal is to group data points into clusters, based on the distance from their cluster center. We start with a set of data points and a set of centers. First, the distances between all points and all centers are computed and for each point the closest center is identified: during this step a label is attached to each cluster. Then, the center of the cluster is updated to be the barycenter of its assigned data points.

A benchmark comparison with daal4py, the python wrappers for Intel’s DAAL library, showed that a significant speed-up could be hoped both in sequential and in parallel runs (the discussion, initiated by François Fayard, started here). Furthermore, a preliminary profiling showed that the computation of the distances is the critical part of the algorithm but finding the labels and updating the centers is also not negligible and would quickly become the bottleneck once the first part is well optimized.

The previous implementation exposed a parameter, called *precompute_distances*, aimed to switch between memory and speed optimization. Favoring speed means that all distances are computed at once and stored in a distance matrix of shape *(n_samples, n_clusters).* Then labels are computed on this distance matrix. It’s fast, especially because it can be done using a BLAS (Basic Linear Algebra Subprogram) library which is optimized for the different families of CPU. The drawback is that it creates a potentially large temporary array which can take a lot of memory. On the other hand, favoring memory efficiency means that distances are computed one at a time and labels are updated on the fly. There is no temporary array but it’s slower, because distance computation cannot be vectorized.

Besides causing memory issues, **a large temporary array does not provide optimal speed** either. Indeed moving data from the RAM into the CPU and vice versa is quite slow. If we need a variable several time for our computations but we have to fetch it from the RAM each time, we are wasting a lot of time. This is what happens in the k-means algorithm: back and forth from point to center positions to update labels and distances. Ideally we want the data to stay as close to the CPU as possible, meaning in the CPU cache, while it’s needed for the computations.

The solution we chose is to **compute the distances for only a chunk of data at a time**, creating a temporary array of shape *(chunk_size, n_clusters).*

Choosing the right chunk size is crucial. A CPU can do the same operation on several variables at once in a single instruction (this is a SIMD CPU, for Single Instruction Multiple Data). If the temporary array is too small we don’t fully exploit the vectorization capabilities of the CPU. If the temporary array is too large it does not fit in the CPU cache. We can clearly see that in the figure beside. We chose a chunk size of 256 (2⁸) samples. It guarantees that the temporary array will fit in the CPU cache which is typically a few MB, while keeping a good vectorization capability.

Overall, this new implementation is faster than both previous implementations and has a very small memory footprint (only a few MB). Also, this allowed us to simplify the API by deprecating the *precompute_distances* parameter. Benchmarks on single core are shown in the figure below. Timing measurements are on the left and the corresponding speed-ups on the right.

The new implementation also changed the parallelism scheme. Previously, a first level of parallelism, handled by the joblib library, was implemented at the outer most level. The *n_jobs* parameter was used to control the number of processes to run the *n_init* complete runs of k-means (despite its name, *n_init* is actually about complete runs, not just the initialization). That meant that we couldn’t use the full capacity of a machine with more than *n_init* cores (the default is 10 and it is usually not useful to take a bigger value). Another level of parallelism came from the BLAS library used in the computation of the distances. However the other steps of the iteration loop are sequential which prevent good scalability.

In version 0.23, we decided to move the outer parallelism to the data level. For one chunk of data, we can compute all distances between the points and the clusters, find the labels, and even compute a partial update of the centers. Here, the parallelism is implemented using the OpenMP library in Cython. Putting the parallelism at this level gives us a much better scalability and we can now fully benefit from all the cores of the CPU, even if the user decide to use *n_init=1*.

The figures below show the time to fit a KMeans instance with *n_init=1* (on a large dataset on the left and on a small dataset on the right) for various number of available cores. Green and blue curves concern scikit-learn 0.22. There is barely no scalability on a large dataset (time is reduced by a factor of 2 between using 1 or 16 cores) and no scalability at all on a small dataset. Red and orange curves concern scikit-learn 0.23. Scalability is much better and near perfect on large datasets if we ignore the initialization (orange curve). We discuss the scalability issues of the initialization in the last section.

In this new implementation, the parallelism at the data level is able to fully exploit all the available cores of the CPU, which means that the parallelism from the distances computation can lead to a situation of thread **oversubscription**, i.e. more threads than available cores are trying to run simultaneously. We had to find a way to disable this second level of parallelism coming from the BLAS library. This was the main challenge of this rework of KMeans. This challenge lead to the **development of a new python library, threadpoolctl, to dynamically control, at the python level, the number of threads used by native C libraries** like OpenMP and several BLAS libraries. Threadpoolctl is now a dependency of scikit-learn, and we hope that it will be used more in the wider Python ecosystem.

Latest benchmarks still show that DAAL is faster than the 0.23 scikit-learn implementation, by a factor of up to two. Improving the performances will require optimizations, essentially regarding vectorization, that we cannot apply at the Cython level.

However there’s still room for improvement regarding the initialization of the centers (*k-means++)*. It still has a poor scalability and since it takes a significant proportion of a run of KMeans, the whole estimator does not scale in an optimal way, as shown in the figure above. Although we think that a rework of *k-means++* might be possible: a simpler solution might be to run the initialization on a subset of the data (a discussion has been started here). We hope this would make the initialization take a negligible proportion of the whole run of KMeans, even if this does not solve the scalability issue.

We had three days of coding in excellent company, introduced by a beginner’s workshop aimed to lower the entry cost to the first Pull Request on scikit-learn. As a side effect the team had the opportunity to check the effectiveness of the documentation on scikit-learn installation… apparently OSes are like humans: no one is alike.

Almost 40 people joined the workshop, around 30 took part in the development sprint.

More than 40 Pull Requests have been merged thanks to new or confirmed contributors. The scikit-learn core developers did a great job in reviewing and following-up all this contributions. Some of them attended the sprint in person, some were ready to take over remotely on the other side of the oceans (both Atlantic and Pacific!).

Contributions consisted for example in:

- documentation fixes and clarifications,
- filtering the warnings in the examples run during the documentation build,
- checking consistency regarding attribute through submodules,
- replacing the Boston dataset with other datasets without ethical issues.

A dedicated introduction is really helpful for first time contributors: thanks to the first day introduction, they were able to start coding with a clean development environment and focus on their contribution, not on the installation problems.

Having a number of core developers available on site also is an asset: preliminary discussions allow to better understand the perimeter of an issue, making the contribution more relevant and speeding up the review process.

Mining the scikit-learn issues is still not an easy task for first time contributors. A systematic effort in labelling and triaging the issues will probably help to lower the barrier. However, this effort should be automated as much as possible, to avoid individual let-down and to guarantee consistency.

The first release of scikit-learn dates back to February 1st 2010. The sprint had been scheduled to celebrate the anniversary of the project: maybe, new core developers are hidden among all those contributors who participated to this sprint and the others, past and future.

Ten years of scikit-learn should be remembered for the constant effort of making Machine Learning accessible to the most. This is the key of its success.

Today, the goal of the Inria scikit-learn Consortium is to help secure the future of scikit-learn so that the work of all the contributors is not lost. Thanks to Inria and to all our partners a first step has been done to move scikit-learn from common to public goods.

The Consortium would like to help the community to growth healthy and happy until adulthood… teens are a so difficult age… without the community there will be no future for scikit-learn: this sprint and all those that will follow are our way to testify that.

Just a bit earlier than Santa visiting, this past month some special Elves have worked really hard to keep the target of releasing scikit-learn twice a year.

Come take a look at some of the many surprises this remarkable package contains.

Models fitted by Machine Learning algorithms need to be interpreted and well understood if they have to be applied at a large scale and trusted by users.

Visualisation is an important step of data analysis and an essential one to the understanding of your dataset.

It allows to have a first insight into the data and provides suggestions on which methods are suitable to a deeper investigation.

The 0.22 scikit-learn version defines a simple API for visualisation.

The key feature of this API is to allow for quick plotting and visual adjustments without recalculation.

For each available plotting function a correspondent object is defined storing the necessary information to be graphically rendered.

Interpretability defines the level of comprehension we have of a general model and of its application to a dataset.

Dive deeper into interpretability of the fitted model makes Machine Learning more understandable.

This was also a recommendation from the Partners of the scikit-learn Consortium.

The 0.22 version improves the inspection module.

A new feature, the permutation feature importance, has been added. It measures how the score of a specific model decreases when a feature is not available.

The permutation importance is calculated on the training set to show how much the model relies on each feature during training.

Also Partial Dependency analysis has been improved in particular increasing interoperability with Pandas objects and the new plotting capabilities.

When dealing with big amount of data there are just as big chances that some entries are incomplete.

Multiple reasons, from instrument failures to bad format conversions to human errors, could be the causes of missing values in the dataset.

Ideally, Machine Learning algorithms would know what to do with them. When this is not the case a number of so-called imputation algorithms could be used to make assumptions on the missing data. Those imputers should be used carefully: good quality imputation does not always imply good predictions, sometime the lack of information is a predictor itself.

For scikit-learn, version 0.22 brings the HistogramGradientBooster algorithm to manage missing data without need of any imputation.

For those estimators that still need missing data to be imputed the impute module has now a new k-Nearest Neighbors imputer, for which a Euclidean distance has been defined in the metric module taking missing values into account.

Big amount of data need to be efficiently and accurately manipulated: interoperability is the key for a safe data mining.

No matter which software you are using, format and structure manipulations need to be automatised and user do not have to care about that.

Pandas is a principal actor of the Pydata ecosystem: scikit-learn 0.22 improves input and output interoperability with pandas on a method by method basis.

In particular fetch_openml can now return pandas dataframes and thus properly handle datasets with heterogeneous data.

Sometime datasets could not be modelled using just one predictor. Different ranges of variables seem to obey to different laws. Version 0.22 of scikit-learn provides the option of stacking the outputs of several learners with a single model, improving the final prediction performance.

Even if in Python there is no really private objects and methods, this 0.22 version aims to clean the public API space.

Be aware that this could change some of your import.

Private API are not meant to be documented and you should not rely on their stability.

Managing logs is not an obvious task: if you are in a production or development environment, if you are managing a lot of dependencies or just running a small script, you may want to monitor different behaviours, looking for different levels of verbosity.

Python defines a standard behaviour for warnings, defining also the level of the warning filter needed to avoid them.

The scikit-learn approach has always been to make the user aware of object deprecations, as the code could be updated as soon as possible to avoid future failures.

But this was done in a non standard way, overriding user preferences in the __init__.py file.

Our Elves have received some coal in the past for this.

They are happy to share that scikit-learn 0.22 is compliant with Python recommendation.

Deprecations are now identified with FutureWarnings, always thrown in the Python scheme.

The 0.22 release comes with a lot more improvements and bug fixes.

Check the Changelog to have them in a glance.

As often, choices have to be made and compromises between the amazing feature you would have been happy to see in the code and the time availability of a community based project: so, please don’t be too upset if your Santa’s list is not completely covered.

The Elves are already working on the next step … to 0.23 …

and beyond!

Pour plus d’informations le communiqué de presse Inria et Fujitsu sont en ligne.

]]>de vous inviter au

**–** Accueil dès 9h30 **–**

BNP Paribas Cardif – 8 rue du Port, Nanterre

Plan d’accès

Les inscriptions sont dorénavant closes.

*[ Toutes les présentations seront en anglais ] *

Allocution de bienvenue de **Stanislas Chevalet**, Directeur Général Adjoint – Responsable Transformation & Développement, **BNP Paribas Cardif**

10h05

**Gaël Varoquaux / Inria**

*Avant-propos : les réalisations du consortium scikit-learn à ce jour
*

10h20

**Roman Yurchak / Inria – Symerio**

*Tutorial: s**cikit-learn new features. Preprocessing and imputation methods, estimators for clustering, supervised learning
[Voir la présentation et retrouver le notebook]
*

11h40

**Xavier Dupré / Microsoft**

*ONNX: machine learning model persistence
[Voir la présentation]
*

12h00

**Jérémie du Boisberranger / Inria**

*Tutorial: **questions of performance in scikit-learn. Parallelism, memory, low-level optimizations
[Retrouver le notebook]
*

13h00

Déjeuner

14h00

**Xavier Renard / AXA**

*Whitening ML black boxes: where do we stand?
[Voir la présentation]
*

14h20

**Guillaume Lemaître / Inria**

*Tutorial: scikit-learn interpretability, linear and tree-based models
[Retrouver le noteboook_1 et le notebook_2]
*

15h40

**Peter Entschev / Nvidia**

Nvidia Distributed GPU Machine Learning with RAPIDS and Dask

[Voir la présentation]

16h00

**Tung Lam Dang / BNP Paribas Cardif**

*Scikit-learn: a tool for better model risk governance @ BNPP Cardif
[Voir la présentation]
*

16h20

**Laurent Duhem / Intel**

*Speeding up scikit-learn on Intel architectures
[Voir la présentation]*

16h40

**Léo Dreyfus-Schmidt & Samuel Ronsin / Dataiku**

*LeaveNoOneOut: Building a ML platform for everyone
[Voir la présentation]
*

17h00

**Anton Bossenbroek / BCG**

*Rapid and collaborative fair data science deployment in strategy consulting*

17h20

**Olivier Grisel / Inria**

*Wrap up and perspectives
[Voir la présentation]
*

17h30

**Round-table: Open source, a model for AI and data science?**

18h00

Cocktail de clôture

]]>

Where present:

- Chaouki Boutharouite – AXA
- Sébastien Conort – BNP Paribas Cardif
- Sylvain Duranton – BCG
- David Margery – Inria Foundation
- Olivier Trébucq – Inria
- Gaël Varoquaux – Inria

Excused:

- Josh Patterson and Guillaume Barat – Nvidia
- Laurent Duhem – Intel

Two engineers have already been recruited and the COO should be selected in the coming days.

Gaël introduced the reactions from the community to our roadmap:

https://docs.google.com/document/d/1m5Ijebe123sSaRKIfdBzh5YwJvcQimX0z8SB-vWm518/edit?usp=sharing

The event of September 17th was quite successful, in terms of both attendance and media impact

This event and other information relating to the scikit-learn consortium are displayed on the Web site https://scikit-learn.fondation-inria.fr/

In 2019, two main types of meetings will be organized:

- Monthly sprints with partners, for addressing technical and implementation issues
- A yearly event for training and networking with the whole scikit-learn community

Both types of event could be alternately hosted by partners (AXA and BNP Paribas Cardif already offered making their auditorium available)

The yearly event should take place before summer

Two applications have been received after the 17 September event

- One from Criteo (Elvis Dohmatob)
- One from H&M (Arti Zeighami)

Both companies have been informed of the conditions for joining the consortium

All the participants have declared that they support these applications.

** October 10, 2018**

__Priority list for the consortium at Inria, year 2018–2019__

From the points discussed at the meeting, the Technical Committee is proposing the following list of priorities for the actions of the consortium, to be used by the management team for allocating consortium resources:

**Faster release cycle & dedicate resources for maintenance**.**Benchmark and compliance tests**(Intel & Nvidia): scikit-learn-benchmarks → collaboration on PR from Intel to accelerate the set of benchmarks to implement.

- Historical benchmark data of the scikit-learn code base will be used to detect performance regression or highlight improvements on the scikit-learn master branch. An action point across different actors is to collaboratively assemble and centralize benchmarks.
- The asv (air speed velocity) test suite will also be re-used to collect the metrics for several implementations (e.g. leveraging the alternative implementations in daal4py for intel or cuML for Nvidia), possibly on alternative hardware. The built-in reporting tool of asv does not appear to make it possible to contrast 2 runs of the same benchmark suite with different implementations / hardware but the data is stored in JSON so it should be easy enough to come with our reporting code for this task (e.g. bar plots with matplotlib).

Remark: We will impose that to be able to push to the benchmark dashboard, the code should first pass or not the compatibility test (just check_estimator?)

**Tools to compare validity of model between scikit-learn versions**(retraining with the same parameters on the same dataset should –in general– yield the same fitted attributes and predictions on a validation set)

→ useful for to check that two alternative implementations of the same model can behave as drop-in replacements. This will also be useful to help automate model lifecycle and productization: handling library upgrade safely (numpy, scipy, BLAS, scikit-learn) and check correctness of exported models (e.g. ONNX runtimes).

**Interpretability**:- Improving the documentation: tutorial on common pitfalls and good practices.
- Make it explicit that interpretability can have different meaning in different contexts for different purpose (ease of debugging and checking bias in training set, transparency of the decision function, gaining knowledge on the underlying data generating process (scientific inference, business insights), explainability of individual decisions for end-users that are impacted by the model decisions…)
- Generic meta-estimator for model agnostic aggregate feature importance: remove one feature at a time and do all this in parallel -> permutation tests.
- Partial dependence plots for more (all?) models.
- Document available 3rd party tools for things that are outside of the scope of scikit-learn (open area for research): shap, LIME, yellowbrick, ELI5.
- Add an “invertible” option to hashing vectorizer, so that it is less black box

**Confidence interval for predictions**(in support) of the algorithms:- For a model-agnostic solution: Bootstrap 632+ [Efron 1997] + cross_val_predict.

- In specific models it can be implemented most efficiently. Right now, the only API is return_std for some model, which may not be well suited.

**New GBT model**:- Binning-based fast training algorithm

- Richer losses (Poisson, Gamma, GLM+GBRT).
- Quantile regression
- Assuming that the new code is simpler that our existing implementation, it would make it possible for advanced users to build business specific regularization logic into the code of the trees (but no guarantee of stability of the internal API).

**Quantile regression**for linear and GBRT models (check reference on implicit quantile networks)**Better missing data handling**.- Special bin for missing data in KBinDiscretizer

- A specific bin in decision trees
- More generally, attention to handling missing data here and there in scikit-learn

**Better categorical encoding**:- Target encoding/Impact encoding.

- Stateless one-hot encoder using hashing.

**Callback and logging**(interruption) monitor progress.- General algorithm
**computational optimization**:- Better Cython (using prange and direct BLAS call to scipy).

- Continuous improvement of parallel single machine (e.g. better oversubscription handling when composing thread runtimes).
- Distributed computing: ongoing effort with dask / dask-ml developers to improve the ability to run scikit-learn efficiently on shared and distributed computing resource and progressively explore how to best deal with high data volumes scenarii (out-of-core / cluster partitioned data)

**Sample_props**.**Feature names**: get_feature_name as added sugar to coefficient and feature importance.**Fostering community and community**: in parallel with the effort listed above, fostering technical communication across the community, including developers in partners team is crucial. For this, the foundation should dedicate resources to organize technical sprints, open to any contributor with enough technical expertise and time to contribute meaningfully. A**monthly open sprint**in Paris will be considered, with presence of at least one of the foundation engineers. Additionally, we will give**open technical training**(or tutorials) to develop a technically competent community. We will distinguish three kind of sprints:

**Technical contribution sprints**, where the goal is to contribute to the scikit-learn codebase. Participation to such a sprint will be limited to people with experience in contribution to Python data tools**Training sprints**,**Usecase sprints**,

]]>