- Database · Repository · LinkedIn · Tweet
- ⸻ 2026-02-21
A programmatically queryable CELLxGENE LaminDB instance
¶
CZ CELLxGENE hosts one of the largest standardized collections of single-cell RNA-seq datasets.
Its Census provides efficient access via TileDB-SOMA, and individual datasets are available as .h5ad files on S3.
However, programmatically querying across datasets by arbitrary metadata combinations — cell types, tissues, diseases, assays, collections, donor information — has required writing custom data wrangling code.
We maintain laminlabs/cellxgene, a public LaminDB instance that mirrors CELLxGENE data with curated, queryable metadata.
It enables you to:
Query across datasets using biological ontologies: filter
.h5adartifacts by cell type, tissue, disease, assay, organism, and more — all with a single API call.Access individual datasets: cache, load into memory, or stream array slices without downloading everything.
Query the concatenated Census: slice the
tiledbsomastore by metadata filters, directly from LaminDB.Train ML models on collections using
MappedCollectionortiledbsomaPyTorch dataloaders.Integrate with in-house data: transfer CELLxGENE data into your own LaminDB instance and combine it with private datasets.
Here, we explain how we curate and maintain the instance, and how you can use it.
Connecting to the instance¶
Getting started takes two lines:
import lamindb as ln
db = ln.DB("laminlabs/cellxgene")
The db object gives you access to all registries: artifacts, collections, genes, cell types, tissues, diseases, and more.
Querying across datasets¶
Every individual .h5ad file from CELLxGENE is registered as an artifact in the instance.
Each artifact is annotated with ontology-backed metadata parsed from its obs fields — cell types, tissues, diseases, assays, developmental stages, ethnicities, and organisms — all linked to Bionty registries.
This means you can query across all ~1,000 datasets with expressive filters:
cell_types = db.bionty.CellType.lookup()
users = db.User.lookup()
db.Artifact.filter(
suffix=".h5ad",
description__contains="immune",
size__gt=1e9,
cell_types__name__in=["B cell", "T cell"],
created_by=users.sunnyosun,
).order_by("created_at").to_dataframe(
include=["cell_types__name", "created_by__handle"]
)
Under the hood, cell_types__name__in performs a join between the Artifact and bionty.CellType registries, matching on CellType.name.
This is the same Django ORM-style syntax used throughout LaminDB, which means queries compose naturally across any metadata dimension.
How we curate the instance¶
Each CELLxGENE Census LTS release (published every six months) triggers an update of laminlabs/cellxgene.
The curation process:
Register artifacts: Each
.h5adfile from the Census release is registered as anartifact, pointing to its S3 location ons3://cellxgene-data-public. No data is copied — LaminDB references the original storage.Parse and link metadata: For each artifact, we parse the
obsfields and link them to ontology-backed registries. Cell types are linked to the Cell Ontology, tissues to Uberon, diseases to Mondo, assays to EFO, and so on. This is what enables cross-dataset queries.Register collections: CELLxGENE organizes datasets into collections (typically corresponding to a publication). We mirror this structure: each CELLxGENE collection maps to a
Collectionin LaminDB, grouping the relevant artifacts. Collections are versioned across Census releases.Register the Census store: The concatenated
tiledbsomaarray is registered as a single artifact, enabling direct queries viaartifact.open().
The scripts that perform this curation live in cellxgene-lamin, and the schema definition is part of lamindb.
Inspecting a dataset¶
You can inspect any artifact’s full metadata context:
artifact = db.Artifact.get(description="Mature kidney dataset: immune")
artifact.describe()
This shows the dataset features (20 obs columns, 2 var columns), linked ontology labels (cell types, tissues, diseases, developmental stages), external features (e.g., number of donors), storage path, and provenance (who created it, which script, when).
Accessing data¶
Three ways to access the underlying array data:
# 1. Cache on disk and return local path
path = artifact.cache()
# 2. Cache and load into memory
adata = artifact.load()
# 3. Stream via a cloud-backed accessor
with artifact.open() as adata_backed:
slice = adata_backed[adata_backed.obs.cell_type == "B cell"]
adata_slice = slice.to_memory()
All three run faster from within AWS us-west-2, where the data is hosted.
Querying within a collection¶
You can filter artifacts within a specific collection by combining metadata:
organisms = db.bionty.Organism.lookup()
tissues = db.bionty.Tissue.lookup()
experimental_factors = db.bionty.ExperimentalFactor.lookup()
suspension_types = db.ULabel.filter(type__name="SuspensionType").lookup()
census = db.Collection.get(key="cellxgene-census", version="2025-01-30")
census.artifacts.filter(
organisms=organisms.human,
cell_types__in=[cell_types.dendritic_cell, cell_types.neutrophil],
tissues=tissues.kidney,
ulabels=suspension_types.cell,
experimental_factors=experimental_factors.ln_10x_3_v2,
).order_by("size").to_dataframe()
Slicing the concatenated Census¶
For queries that span all datasets, you can slice the TileDB-SOMA store directly:
features = db.Feature.lookup(return_field="name")
assays = db.bionty.ExperimentalFactor.lookup(return_field="name")
census_artifact = db.Artifact.get(key="cell-census/2025-01-30/soma")
with census_artifact.open() as store:
cell_metadata = (
store["census_data"]["homo_sapiens"]
.obs.read(
value_filter=(
f'{features.tissue} == "brain"'
f' and {features.cell_type} in ["microglial cell", "neuron"]'
f' and {features.suspension_type} == "cell"'
f' and {features.assay} == "{assays.ln_10x_3_v3}"'
)
)
.concat()
.to_pandas()
)
Training ML models¶
On a collection of .h5ad files¶
Collection.mapped() creates a map-style PyTorch dataset that virtually concatenates the collection’s artifacts.
This supports weighted random sampling and scales across multiple GPUs (see our array loader benchmarks):
from torch.utils.data import DataLoader
census_collection = db.Collection.get(key="cellxgene-census", version="2025-01-30")
dataset = census_collection.mapped(obs_keys=[features.cell_type], join="outer")
dataloader = DataLoader(dataset, batch_size=128, shuffle=True)
for batch in dataloader:
pass
dataset.close()
On the concatenated tiledbsoma store¶
import cellxgene_census.experimental.ml as census_ml
from tiledbsoma import AxisQuery
store = census_artifact.open()
experiment = store["census_data"]["homo_sapiens"]
experiment_datapipe = census_ml.ExperimentDataPipe(
experiment,
measurement_name="RNA",
X_name="raw",
obs_query=AxisQuery(value_filter=value_filter),
obs_column_names=[features.cell_type],
batch_size=128,
shuffle=True,
soma_chunk_size=10000,
)
dataloader = census_ml.experiment_dataloader(experiment_datapipe)
for batch in dataloader:
pass
store.close()
Curating your own data against the CELLxGENE schema¶
If you want to contribute data to CELLxGENE or simply ensure your datasets follow the same schema, you can validate and annotate your AnnData objects against the laminlabs/cellxgene registries.
See the curation guide for details.
Integrating with in-house data¶
A key advantage of the LaminDB instance over querying CELLxGENE directly is composability with private data.
You can transfer any artifact from laminlabs/cellxgene into your own instance without copying data, and then query across both public and private datasets using the same API.
Outlook¶
We update laminlabs/cellxgene with each Census LTS release.
Going forward, we plan to extend the instance with spatial transcriptomics datasets as they become available through Census.
Code & data availability¶
All code used in this blog post is free & open-source.
CELLxGENE registration utility code: github.com/laminlabs/cellxgene-lamin
CELLxGENE LaminDB schema code: github.com/laminlabs/lamindb
The public instance: lamin.ai/laminlabs/cellxgene
Documentation: docs.lamin.ai/cellxgene
Citation¶
Sun S, Heumos L & Wolf A (2026). A programmatically queryable CELLxGENE LaminDB instance. Lamin Blog.
https://blog.lamin.ai/cellxgene-lamindb