At the end of January 2020 a scikit-learn sprint took place in the Paris offices of Dataiku. Sonia and Léo from Dataiku deserve a special thanks for so nicely taking care of us!Credits @alonsosilva

We had three days of coding in excellent company, introduced by a beginner’s workshop aimed to lower the entry cost to the first Pull Request on scikit-learn. As a side effect the team had the opportunity to check the effectiveness of the documentation on scikit-learn installation… apparently OSes are like humans: no one is alike.

Almost 40 people joined the workshop, around 30 took part in the development sprint.

Credits @marmochiaMore than 40 Pull Requests have been merged thanks to new or confirmed contributors. The scikit-learn core developers did a great job in reviewing and following-up all this contributions. Some of them attended the sprint in person, some were ready to take over remotely on the other side of the oceans (both Atlantic and Pacific!).

Contributions consisted for example in:

  • documentation fixes and clarifications,
  • filtering the warnings in the examples run during the documentation build,
  • checking consistency regarding attribute through submodules,
  • replacing the Boston dataset with other datasets without ethical issues.

What did we learn?

A dedicated introduction is really helpful for first time contributors: thanks to the first day introduction, they were able to start coding with a clean development environment and focus on their contribution, not on the installation problems.

Having a number of core developers available on site also is an asset: preliminary discussions allow to better understand the perimeter of an issue, making the contribution more relevant and speeding up the review process.

Mining the scikit-learn issues is still not an easy task for first time contributors. A systematic effort in labelling and triaging the issues will probably help to lower the barrier. However, this effort should be automated as much as possible, to avoid individual let-down and to guarantee consistency.

Did you say « Sprint of the Decade »?

Credits @marmochia
The first release of scikit-learn dates back to February 1st 2010. The sprint had been scheduled to celebrate the anniversary of the project: maybe, new core developers are hidden among all those contributors who participated to this sprint and the others, past and future.

Ten years of scikit-learn should be remembered for the constant effort of making Machine Learning accessible to the most. This is the key of its success.Credits @alonsosilva

Today, the goal of the Inria scikit-learn Consortium is to help secure the future of scikit-learn so that the work of all the contributors is not lost. Thanks to Inria and to all our partners a first step has been done to move scikit-learn from common to public goods.

The Consortium would like to help the community to growth healthy and happy until adulthood… teens are a so difficult age… without the community there will be no future for scikit-learn: this sprint and all those that will follow are our way to testify that.