Methodology for agents, by the scikit-learn maintainers.
probabl-ai/skills is a collection of thirteen skills that bring scikit-learn, skrub, and Skore methodology into any agentic coding tool. The agent not only writes a pipeline, but it also builds, evaluates, and iterates with the rigor your team would otherwise enforce by hand.
npx skills add github.com/probabl-ai/skills Thirteen skills, organized the way the best data science teams already work, improved daily.
From data source to an evaluated, tested, audited learner.
The five skills that bound a single experiment end to end: declare it, evaluate it, prove the score is real, then read the report back — no leaky shortcuts.
build-ml-pipeline→Declare the pipeline from data source to predictor as a skrub DataOps graph. Stops at the declared object — no fit, split, tuning, or persistence.
evaluate-ml-pipeline→
Evaluate a single sklearn-compatible learner: pick the right entry point (skore.evaluate first), the right cross-validator, and consume report metadata.
test-ml-pipeline→
Router that owns the tests/ folder of an ML workspace and the experiment ↔
test pairing rule. Dispatches to a per-category subskill.
smoke-test-ml-pipeline→Diagnostic-by-construction pytest that catches the "load → featurize → split" anti-pattern by predicting on a disjoint, no-buffer slice of the real data source.
audit-ml-pipeline→
Owns the audit/ folder: one # %% file per experiment that loads
its skore report read-only and streams a markdown digest. Read-only — never calls evaluate or put.
Source the next experiment, two ways.
A driver skill that owns journal/JOURNAL.md, plus two sourcing
strategies. Pick where the next idea comes from — the audit digest, or the user.
iterate-ml-experiment→
Drives the iteration loop on top of an ML workspace — owns journal/JOURNAL.md and per-experiment design notes, and dispatches to a sourcing strategy below.
iterate-from-skore→
Source the next experiment by reading the audit digest at scratch/audit/<stem>/audit.md — every issue / tip row drives a backlog item, following the row's documentation link
for the mitigation.
iterate-from-user→Source the next experiment from the user directly — free text, a scientific article URL, or a resource link (GitHub issue, spec, or reference repo).
Keep the repo clean and organized while the agent runs.
Where files live, how they're styled, which env manager is used, and the curated stack the maintainers actually reach for. The boring discipline that makes the rest possible.
organize-ml-workspace→
Decide where files live: reusable code, per-experiment scripts (jupytext-style # %%), reports. One file per experiment.
python-code-style→
Place the project's ruff.toml template and run ruff (lint + format) on touched
files. numpydoc for docstrings.
python-env-manager→Detect the project's env manager (pixi / uv / poetry / hatch / conda / pip+venv) and issue the right install command. Defaults to pixi when bootstrapping.
data-science-python-stack→ Opinionated one-library-per-job Python stack, organized into mandatory / user-choice / optional / transitive tiers.
Any library, indexed on demand.
One skill that discovers the public API of any installed package — so the agent reads real signatures instead of hallucinating them.
Install the pack, then prompt the way you already do.
The skills are plain markdown files. The agent reads them as part of its session and reaches for the right one when the task fits.
Install the pack — One command into your agent’s skills directory: npx skills add github.com/probabl-ai/skills. BSD-3-Clause — fork it if you want.
Prompt the workflow, not the tool — “Build a churn model on this CSV and tell me what’s weakest.” The agent routes through the right skills.
Get a Skore report, not a notebook — Structured evaluation, fold-level diagnostics, per-slice metrics. The same report object whether you ran it or the agent did.
Iterate, audit, ship — Use the iterate-from-* skills to source the next experiment. Sync to Skore Hub or MLflow when it’s ready for production.
Agentic AI is fast. Methodology is what makes it trustworthy.
AI assistants ship scikit-learn pipelines in seconds — downloads doubled from 100M to 200M monthly in nine months. The bottleneck isn't compute, it's the absence of shared standards. Skills are how we fix that.
Want this wired into your team's workflow?
Probabl runs Forward Deployed Engineering engagements for teams putting agentic ML on rails. We'll audit your pipeline, integrate the skill pack alongside your existing stack, and pair with your team to ship production-grade methodology.