Technical Committee

July 4, 2019

 

Priority list for the consortium at Inria, year 2019–2020

 

From the discussion during the technical committee, the scikit-learn consortium at Inria defined the following list of priorities for the coming year:

    1. Continue effort to help with project maintenance to keep the target to release twice a year.

  1. Development of the “inspect” module:
    1. Help finalize the pull requests for the newly introduced methods for model inspections.
      https://github.com/scikit-learn/scikit-learn/issues/14969
    2. Better interoperability with pandas dataframe and scikit-learn pipeline, in particular with the feature names.
      https://github.com/scikit-learn/scikit-learn/pull/14028
    3. Document with User Guide and examples to give intuitions, insights, caveats of the different methods.
    4. Cross-reference more bleeding-edge inspection tools from external libraries such as SHAP and ELI5 from the documentation and examples in scikit-learn. Possibly add an example in scikit-learn to compare SHAP values to scikit-learn permutation importances.
    5. Integrate guidelines on interpretation and uncertainty of coefficients of linear models in scikit-learn as a scikit-learn example or tutorial in the main repo.
  2. Implementation of Poisson, Gamma, Tweedie regression losses for linear models and gradient boosting trees.
    https://github.com/scikit-learn/scikit-learn/pull/14300
  3. Improvement of machine learning pipeline with feature names.
  4. Help finalize the missing features for the new implementation of Gradient Boosting Trees (i.e., native support for missing values, categorical data, and sparse data).
  5. Continue effort on benchmark and compliance tests:
    1. Integration of ONNX models.
    2. Make it easier to reuse the scikit-learn benchmark suite to compare with alternative implementation of the models from DAAL and RAPIDS.
  6. Develop a new resampler meta-estimator for imbalanced classification problems, by working with the community on SLEP005.
  7. Propagation of feature names (e.g. within scikit-learn pipeline), by working with the community on SLEP008.
  8. Improve documentation with extra-examples: time series applications, model inspection, quantification of predictive uncertainty and uncertainties on linear model coefficients, “anti-pattern” examples.
    https://github.com/scikit-learn/scikit-learn/issues/14081
  9. Evaluate interoperability with other type of array (e.g. dask arrays and CuPy arrays) for some preprocessing methods, pipeline / column transformer, cross-validation and parameter search. Possibly by leveraging the newly introduced __array_function__ protocol.

We recall the list of priorities of the previous year which would be considered for this year as well. Note that some of these points are ongoing work:

    1. Tools to compare validity of model between scikit-learn versions.
    2. Confidence interval for predictions: look at the bootstrapping, and review literature on conformal predictions (in collaboration with Léo Dreyfus-Schmidt at Dataiku who will work on benchmarking the literature). See https://arxiv.org/abs/1604.04173 and https://cdsamii.github.io/cds-demos/conformal/conformal-tutorial.html for instance.
    3. Quantile regression for linear and GBRT models.

  1. Better missing data handling.
  2. Better categorical encoding.
  3. Callback and logging (interruption) monitor progress.
  4. Organise monthly technical sprint.