From the discussion during the technical committee of December 3rd 2021, the scikit-learn Consortium defined the following list of priorities for 2022:

  • Continue developments of the model inspection tools (and related features):
  • Operationalization of models / MLOps
    • Other packages to use
    • Good practices (we are an entry point of the community)
      • Version control
      • Automate (CI / CD)
      • Long blocks of code should be in importable Python modules with tests, not in notebooks
    • [Idea to be explored] Declarative construction of pipelines (pros / cons)
    • Add side boxes in our doc / examples on the production logic (it may differ from exploration mode)
  • Annotation regarding models’ hyperparameter: https://github.com/scikit-learn/scikit-learn/pull/17929
  • Improving performance and scalability
  • Comparative benchmark to identify how scikit-learn’s most popular models perform compared to other libraries (scikit-learn-intelex, XGBoost, lightgbm, ONNXRuntime, etc.)
  • Continue work on making scikit-learn estimators more scalable w.r.t. number of CPUs, in particular with Cython optimization for the pairwise distances together with argmin/argpartition pattern used for KNearestNeighbors and many scikit-learn estimators
  • DOC and tools: safer recommendation for the right metrics for a given y_test.
  • Improve support for quantification of uncertainties in predictions and calibration measures
  • Improve the default solver in linear models:
  • Investigate the possibility to automatically use a data-derived preconditioner for the solver in particular for smooth solvers in LogisticRegression / PoissonRegression, etc:
    https://github.com/scikit-learn/scikit-learn/pull/15583
  • Re-evaluate the choice of the default value of max_iter if necessary.
  • Quantification of fairness issues and potential mitigation
    • Document fairness assessment metrics

 

Longer term: Big picture tasks which require more thinking

  • MLOps: Model auditing and data auditing
  • Improve UX via HTML repr for model with diagnostics with recorded fit-time warnings.
  • Model auditing tools that output HTML reprs.
  • Talk to various people to understand their needs and practices
  • Recommendation for statistical tests for distribution drift
  • Tool for Model cards generation / documentation / template
  • Consider whether survival analysis and training models on censored data should be tackled in scikit-learn:
    • Organize a workshop to move understanding of the current ecosystem.
  • Document that this problem exists in the documentation
  • Maybe: an example to educate on the biases introduced by censoring and point to proper tools, such as the lifelines and scikit-survival projects.
  • Open a discussion on whether techniques such as survival forests / adapted loss functions for gradient boosting (need for sample props?) should be included in scikit-learn (see https://data.princeton.edu/wws509/notes/c7.pdf) or if they should be considered out of the scope of the project.
  • Survival analysis tools need to go beyond point wise predictions and this might  be more generally useful in scikit-learn, possibly uncertainty quantification in predictions.
  • Time series forecasting
    • Organize a workshop to better understand the API and technical choices on different Python libraries to know which one we should illustrate and point to in examples and doc.
    • Maybe basic time-series windowed feature engineering and cross-validation based model evaluation

 

Community: On the community side:

  • Continue regular technical sprints and topic focused workshops (possibly by inviting past sprint contributors to try to foster a long term relationship and hopefully recruit new maintainers).
    • Better preparation for issues
    • Plan with greater advance
    • Fewer people in the sprints (to be able to provide better mentoring)
  • Make the consortium meetings more transparent:
    • Invite Adrin and other advisory board people to the meetings
    • Make the weekly tasking more visible
  • Renew the organization of beginners’ workshops for prospective contributors, probably before a sprint.
  • Organize a workshop on uncertainty quantification and calibration and possibly followed by 2 days of sprint
  • Organize a workshop on our software-engineering practices
  • Conduct a new edition of its 2013 survey among all scikit-learn users.