✦ YC Hackathon — Infrastructure Track Winner

Public bio data,
ready for agents.

The infrastructure layer that was missing for AI-driven biological discovery.

// demo

121M cells indexed
45 datasets
126k perturbations
15 papers vetted

The data exists. It just isn't usable.

Thousands of single-cell datasets, CRISPR screens, and perturbation studies are openly available — describing how cells respond to genetic knockouts and chemical compounds at unprecedented scale. The science is there. The access isn't.


It was never built for AI agents.

  • 1Fragmented — papers reference datasets loosely. Tracing a publication to its raw files requires expert knowledge most agents don't have.
  • 2Siloed — most datasets capture one layer (RNA, protein, or chromatin). Cross-omic connections are rarely established.
  • 3Unreconciled — overlapping experiments across labs aren't merged. Harmonizing them takes months of manual work.
  • 4Expert-gated — finding the right dataset requires knowing the right labs, databases, and search terms.
  • 5Error-compounding — long agent reasoning chains amplify bad data sourcing. Errors at step 1 degrade everything downstream.

Models are ready. The data infrastructure isn't.

AI models are getting dramatically more capable in biology — but intelligence alone isn't enough. The next wave of biological discovery needs data that is structured, vetted, and packaged for reasoning. Agents can now do the reconciliation labor that blocked this before — in minutes, not months.


Source, vet, preprocess, consume. In one workflow.

We built the infrastructure layer for making public biological data useful to agents. New database paradigms, embedding-rich context, and structured links between publications and the datasets they describe.


Key capabilities.

01 — Data sourcing & derisking

Every paper is evaluated by 3 independent AI assessors in parallel — statistical rigor, biological relevance, data quality. A convergence check flags disagreement. Red flags reduce scores automatically. You know what you're building on before you build on it.

02 — Callable data endpoints

After preprocessing, the system registers a live REST endpoint for the dataset — queryable by agents or scientists directly.

GET /api/datasets/query_vorinostat_cells?limit=50&is_negative_control=false

03 — Agent skills layer

21 composable skills loaded on demand — scanpy, scvi-tools, DESeq2, RNA velocity, UMAP, gene resolution, molecule lookup, and more. Some custom-built; others extend the life sciences knowledge embedded in Claude. The agent reasons with domain depth, not general knowledge.

04 — Full preprocessing pipeline

QC → normalization → log transform → highly variable genes → PCA → UMAP → clustering → differential expression. Analysis-ready .h5ad matrices, clean metadata, delivered.