Hubmap: scRNA-seq

The HubMAP (Human BioMolecular Atlas Program) consortium is an initiative mapping human cells to create a comprehensive atlas, with its Data Portal serving as the platform where researchers can access, visualize, and download (single-cell) tissue data.

Lamin mirrors most of the datasets for simplified access here: laminlabs/hubmap.

If you use the data academically, please cite the original publication Jain et al. 2023.

Here, we show how the HubMAP instance is structured and how datasets and be queried and accessed.

HubMAP associates several data products, which are the single raw datasets, into higher level datasets. For example, the dataset HBM983.LKMP.544 has three data products:

  1. raw_expr.h5ad

  2. expr.h5ad

  3. secondary_analysis.h5ad

  4. scvelo_annotated.h5ad

The laminlabs/hubmap instance registers these data products as ln.Artifact that jointly form a ln.Collection.

Connect to the source instance:

# pip install 'lamindb[jupyter,bionty,wetlab]'
!lamin connect laminlabs/hubmap
Hide code cell output
 connected lamindb: laminlabs/hubmap


If you want to transfer artifacts or metadata into your own instance, use .using("laminlabs/hubmap") when accessing registries and then .save() (Transfer data).

import lamindb as ln
Hide code cell output
 connected lamindb: laminlabs/hubmap

Getting HubMAP datasets and data products

The key attribute of ln.Artifact and ln.Collection corresponds to the IDs of the URLs. For example, the id in the URL is the key of the corresponding collection:

small_intenstine_collection = ln.Collection.get(key="20ee458e5ee361717b68ca72caf6044e")
Hide code cell output
Collection(uid='QjQSiso1qPlnX6iX0000', is_latest=True, key='20ee458e5ee361717b68ca72caf6044e', description='RNAseq data from the small intestine of a 67.0-year-old white female', hash='jF6aG3Nd4qQHBvY8v8Q8dg', space_id=1, created_by_id=3, run_id=11, created_at=2025-01-28 14:17:01 UTC)

We can get all associated data products like:

Hide code cell output
! no run & transform got linked, call `ln.track()` & re-run
uid key description suffix kind otype size hash n_files n_observations _hash_type _key_is_virtual _overwrite_versions space_id storage_id schema_id version is_latest run_id created_at created_by_id _aux _branch_code
28 AzqCWQAKLMV3iTMA0000 f6eb890063d13698feb11d39fa61e45a/raw_expr.h5ad RNAseq data from the small intestine of a 67.0... .h5ad None AnnData 67867992 of_TeLP6cet2JBj3o_kZmQ None 6000 md5-etag False False 1 2 None None True 11 2025-01-28 14:16:35.355582+00:00 3 None 1
29 fWN781TxuZibkBOR0000 f6eb890063d13698feb11d39fa61e45a/secondary_ana... RNAseq data from the small intestine of a 67.0... .h5ad None AnnData 888111371 ian3P5CN68AAvoDMC6sZLw None 5956 md5-etag False False 1 2 None None True 11 2025-01-28 14:16:39.348589+00:00 3 None 1
30 enXVzwjw4voS8UCb0000 f6eb890063d13698feb11d39fa61e45a/expr.h5ad RNAseq data from the small intestine of a 67.0... .h5ad None AnnData 139737320 kR476u81gwXI6rEbXzNBvQ None 6000 md5-etag False False 1 2 None None True 11 2025-01-28 14:16:43.385980+00:00 3 None 1

Note the key of these three Artifacts which corresponds to the assets URL. For example, is the direct URL to the expr.h5ad data product.

Artifacts can be directly loaded:

small_intenstine_af = (
adata = small_intenstine_af.load()
Hide code cell output
AnnData object with n_obs × n_vars = 6000 × 98000
    var: 'hugo_symbol'

Querying single-cell datasets

Currently, only the Artifacts of the raw_expr.h5ad data products are labeled with metadata. The available metadata includes ln.Reference, bt.Tissue, bt.Disease, bt.ExperimentalFactor, and many more. Please have a look at the instance for more details.

# Get one dataset with a specific type of heart failure
heart_failure_adata = (
    ln.Artifact.filter(diseases__name="heart failure with reduced ejection fraction")
Hide code cell output
AnnData object with n_obs × n_vars = 52534 × 60286
    obs: 'cell_id'
    var: 'hugo_symbol'
    layers: 'spliced', 'spliced_unspliced_sum', 'unspliced'