PerturbDB — YC Hackathon Infrastructure Track Winner

// situation

The data exists. It just isn't usable.

Thousands of single-cell datasets, CRISPR screens, and perturbation studies are openly available — describing how cells respond to genetic knockouts and chemical compounds at unprecedented scale. The science is there. The access isn't.

// complication

It was never built for AI agents.

1Fragmented — papers reference datasets loosely. Tracing a publication to its raw files requires expert knowledge most agents don't have.
2Siloed — most datasets capture one layer (RNA, protein, or chromatin). Cross-omic connections are rarely established.
3Unreconciled — overlapping experiments across labs aren't merged. Harmonizing them takes months of manual work.
4Expert-gated — finding the right dataset requires knowing the right labs, databases, and search terms.
5Error-compounding — long agent reasoning chains amplify bad data sourcing. Errors at step 1 degrade everything downstream.

// why now

Models are ready. The data infrastructure isn't.

AI models are getting dramatically more capable in biology — but intelligence alone isn't enough. The next wave of biological discovery needs data that is structured, vetted, and packaged for reasoning. Agents can now do the reconciliation labor that blocked this before — in minutes, not months.

// solution

Source, vet, preprocess, consume. In one workflow.

We built the infrastructure layer for making public biological data useful to agents. New database paradigms, embedding-rich context, and structured links between publications and the datasets they describe.

// how it works

Key capabilities.

01 — Data sourcing & derisking

Every paper is evaluated by 3 independent AI assessors in parallel — statistical rigor, biological relevance, data quality. A convergence check flags disagreement. Red flags reduce scores automatically. You know what you're building on before you build on it.

02 — Callable data endpoints

After preprocessing, the system registers a live REST endpoint for the dataset — queryable by agents or scientists directly.

GET /api/datasets/query_vorinostat_cells?limit=50&is_negative_control=false

03 — Agent skills layer

21 composable skills loaded on demand — scanpy, scvi-tools, DESeq2, RNA velocity, UMAP, gene resolution, molecule lookup, and more. Some custom-built; others extend the life sciences knowledge embedded in Claude. The agent reasons with domain depth, not general knowledge.

04 — Full preprocessing pipeline

QC → normalization → log transform → highly variable genes → PCA → UMAP → clustering → differential expression. Analysis-ready .h5ad matrices, clean metadata, delivered.