TECHNICAL COMMITTEE / October 11, 2022 – Scikit-Learn Consortium

From the discussion during the technical committee, the scikit-learn Consortium at Inria defined the following list of priorities for the coming year:

High priority: Continue effort helping with project maintenance to keep the target to release twice a year (+ bugfix releases).
- 1.2 in progress (planned for november 2022)
- Refactor common tests to avoid missing estimators unintentionally and split the hard constraints from the scikit-learn internal conventions
High priority Improve the feature names story:
- Better handle feature names handling between meta- and base-estimators https://github.com/scikit-learn/scikit-learn/issues/21599#issuecomment-976480476
- Use feature names to specify categorical variables in HistGradientBoostingClassifier/Regressor: related to https://github.com/scikit-learn/scikit-learn/issues/18894
- Feature names propagation embedded in the datastructure returned by the transformers in pipelines (Pandas-in / pandas-out): https://github.com/scikit-learn/enhancement_proposals/pull/48

Finalize the implementation of Impact Coding in https://github.com/scikit-learn/scikit-learn/pull/17323 and better document the pros and cons of each approach in examples.
Continue developments of the model inspection tools (and related features):

Medium priority Write a blog post comparing external feature importances methods: https://github.com/scikit-learn/blog/pull/139, https://github.com/glemaitre/eupython_2022/blob/main/notebooks/plot_shap.ipynb (or contribute doc / fixes to SHAP?)
- TODO: evaluate the feasibility of contributing to help maintenance of SHAP or whether the best technical solution is to reimplement a subset of SHAP in scikit-learn (as SHAP needs to access private attributes of estimators, which burdens its maintenance)
Extend PDP tools to support categorical variables (almost ready): https://github.com/scikit-learn/scikit-learn/issues/14969, https://github.com/scikit-learn/scikit-learn/pull/18298
Expose the computation of Mean Decrease in Impurity and decision effects in trees on test samples to be able to efficiently compute both local and global explanation for trees (alternative to TreeSHAP and SAGE): see Local MDI (https://arxiv.org/abs/2111.02218) and Sabaas’ method (https://github.com/andosa/treeinterpreter)
Cross-reference inspection tools from external libraries such as SHAP, ELI5, InterpretML, from the documentation and examples in scikit-learn.
[Idea to be explored] Relate to permutation importance and test set MDI to SAGE for global explanations in an example or in the documentation to highlight pros / cons / pitfalls (assuming SAGE is easy / stable enough to install and be a dependency of our doc CI)

Improve documentation with extra examples and topic-based discussions:

Operationalization of models / MLOps
- Other packages to use
- Good practices (we are an entry point of the community)
  - Model serialization options (pickle vs skops vs ONNX)
  - Version control for reproducible retraining
  - Automate (CI / CD)
  - Long blocks of code should be in importable Python modules with tests, not in notebooks
- Add side boxes in our doc / examples on the production logic (it may differ from exploration mode)
- [Idea to be explored] Declarative construction of pipelines (pros / cons)
Annotation regarding models’ hyperparameter: https://github.com/scikit-learn/scikit-learn/pull/17929
- First step implemented via programmatic hyperparameter declaration.

Improving performance and scalability

Comparative benchmark to identify how scikit-learn’s most popular models perform compared to other libraries (scikit-learn-intelex, XGBoost, lightgbm, ONNXRuntime, etc.)
- https://github.com/mbatoul/sklearn_benchmarks (in progress)
Continue work on making scikit-learn estimators more scalable w.r.t. number of CPUs, in particular with Cython optimization for the pairwise distances together with argmin/argpartition pattern used for KNearestNeighbors and many scikit-learn estimators

DOC and tools: safer recommendation for the right metrics for a given y_test.
Improve support for quantification of uncertainties in predictions and calibration measures

High priority: Expected Calibration Error for classification: https://github.com/scikit-learn/scikit-learn/pull/11096
Improve QuantileRegression for Linear Models (sparse support, alternative penalty and related solver):
- Sparse input data (in progress) https://github.com/scikit-learn/scikit-learn/pull/21086
Calibration reporting tool and documentation for quantile regression models (quantile loss in sklearn.metrics, average quantile coverage)
Isotonic Calibration should not impact rank-based metrics: https://github.com/scikit-learn/scikit-learn/issues/16321
- New lead, centered isotonic regression: https://github.com/scikit-learn/scikit-learn/pull/21454
Start to brainstorm on how to represent a distribution on predictions

Compliance tests for ONNX exports from https://github.com/onnx/sklearn-onnx

Improve the default solver in linear models:

Investigate the possibility to automatically use a data-derived preconditioner for the solver in particular for smooth solvers in LogisticRegression / PoissonRegression, etc:
https://github.com/scikit-learn/scikit-learn/pull/15583
Re-evaluate the choice of the default value of max_iter and tol if necessary.

More flexible support for alternative input data container types:

Move forward with Array API spec (https://data-apis.org) interoperability in more scikit-learn models
- PCA
- feature-wise preprocessing
- scoring metrics
Evaluate impact of Array API spec on:
- handling in-device cross-validation / RNG
- handling lazy evaluation for dask and jax
- extend effort to dataframe interop API? https://data-apis.org/blog/dataframe_protocol_rfc/
Document the limitations of quantifying statistical association vs causation by introducing the concept of interventional distributions vs observational distributions
- https://github.com/scikit-learn/scikit-learn/pull/20451
Monotonicity constraints for Decision Trees (still in progress) https://github.com/scikit-learn/scikit-learn/pull/13649
Dealing with Imbalanced Classification problem
- Document the pitfalls of different evaluation metrics and link to imbalanced-learn
- Implementation of Balanced Random Forest and Balanced Bagging Classifier: https://github.com/scikit-learn/scikit-learn/pull/13227. This is in progress, currently solving lower level issues with the scorer API. Still some internal refactoring is needed: https://github.com/scikit-learn/scikit-learn/pull/18589. To move forward we need to compare with a classifier that can change its decision threshold: https://github.com/scikit-learn/scikit-learn/pull/16525. Once done, it should be reconsidered if we need additional tools such as a generic resample API: SLEP005.

Quantification of fairness issues and potential mitigation
- Document fairness assessment metrics

Example to raise awareness on the issues of fairness, possibly adapted from https://github.com/TwsThomas/fairness/blob/master/fairness%20adult_sgd.ipynb
Work on sample props (SLEP 0006) as a prerequisite to go further (under review + prototype implementation).

Developer API: Making it easier for 3rd party developers by separating out non-user facing API that is not private (tested + backward compatibility)
Tutorial / guide on various strategies to assess certainty in predictions: impact of the choice of the loss function (e.g. mse, pinball, poisson), stability to resampling (e.g. using the Bagging meta-estimators), Gaussian process regression with covariance estimation and points to other external resources (e.g. conformal predictions, explicit bayesian posterior modeling…)
Programmatically defining good value / starting points for hyperparameter grids:
Consider whether survival analysis and training models on censored data should be tackled in scikit-learn:
- Organize a workshop to move understanding of the current ecosystem.
- Document that this problem exists in the documentation
- Maybe: an example to educate on the biases introduced by censoring and point to proper tools, such as the lifelines and scikit-survival projects.
- Consider contributing to the wider ecosystem on survival or make an example in scikit-learn using the “Poisson trick”
Programmatic way to specify hyperparameter search without param name string mangling with `__`
- Discussed in core devs meeting
- SLEP required https://github.com/scikit-learn/scikit-learn/pull/21784
- Related to `__sk_clone__` proposal: https://github.com/scikit-learn/scikit-learn/issues/21838
Callback and logging (interruption) monitor and checkpoint fitting loop. One application would be to better integrate with (internal and external) hyper-parameter search strategies that can leverage checkpointing to make model selection more resource efficient.
- This is can be useful for teaching and general UX (progress bars that work on parallel sub-tasks even in notebook)
- This can be important for MLOps (monitoring, snapshotting model for inspection, etc.)
- It can help us gain agility during development (write better tests for the impact of convergence criteria, easier convergence debugging)
- Related PR: https://github.com/scikit-learn/scikit-learn/pull/16925
  New prototype in
Consider allowing users to pass custom loss functions, in particular for Histogram Gradient-Boosting (maybe without guarantees on backward compat).
- A common loss module is a first step: https://github.com/scikit-learn/scikit-learn/pull/20811 (under review)

Longer term: Big picture tasks which require more thinking

MLOps: Model auditing and data auditing

Improve UX via HTML repr for model with diagnostics with recorded fit-time warnings.
Model auditing tools that output HTML reprs.
Talk to various people to understand their needs and practices
Recommendation for statistical tests for distribution drift
Connect with skops for Model cards generation / documentation / template: https://github.com/skops-dev/skops

included in scikit-learn (see https://data.princeton.edu/wws509/notes/c7.pdf) or if they should be considered out of the scope of the project.

Survival analysis tools need to go beyond point wise predictions and this might be more generally useful in scikit-learn, possibly uncertainty quantification in predictions.

Time series forecasting
- Make an example of basic time-series windowed feature engineering and cross-validation based model evaluation, using Olivier’s tutorial
- https://github.com/ogrisel/euroscipy-2022-time-series

Explore API to simplify data wrangling (outside of scikit-learn)

Community: On the community side

Continue regular technical sprints and topic focused workshops (possibly by inviting past sprint contributors to try to foster a long term relationship and hopefully recruit new maintainers).
- Better preparation for issues
- Plan with greater advance
- Fewer people in the sprints (to be able to provide better mentoring)
Make the consortium meetings more transparent and inclusive:
- Invite Adrin and other advisory board people to the meetings
- Make the weekly tasking more visible
Renew the organization of beginners’ workshops for prospective contributors, probably before a sprint.
Organize a workshop on statistical topics (causal inference and calibration) and possibly followed by 2 days of sprint
Organize a workshop on our software-engineering practices, some ideas of topics:
- CI and CD practices, e.g.:
  - optional testing on float32 and robin round seed setting
  - nightly build and version pinning rationales
- local development practices, e.g.:
  - pre-commit config
- code review guidelines
- performance troubleshooting and improvements
  - profiling, benchmarking
Conduct a new edition of its 2013 survey among all scikit-learn users.