Arc Virtual Cell Atlas: scRNA-seq¶
The Arc Virtual Cell Atlas hosts one of the biggest collections of scRNA-seq datasets.
Lamin mirrors the dataset for simplified access here: laminlabs/arc-virtual-cell-atlas.
If you use the data academically, please cite the original publications, Youngblut et al. (2025) and Zhang et al. (2025).
Connect to the source instance.
# pip install 'lamindb[jupyter,bionty,wetlab,gcp]'
!lamin connect laminlabs/arc-virtual-cell-atlas
Show code cell output
→ connected lamindb: laminlabs/arc-virtual-cell-atlas
Note
If you want to transfer artifacts or metadata into your own instance, use .using("laminlabs/arc-virtual-cell-atlas")
when accessing registries and then .save()
(Transfer data).
import lamindb as ln
import bionty as bt
import wetlab as wl
import pyarrow.compute as pc
import anndata as ad
Show code cell output
→ connected lamindb: laminlabs/arc-virtual-cell-atlas
Tahoe-100M¶
project_tahoe = ln.Project.get(name="Tahoe-100M")
project_tahoe
Project(uid='H5MwZwyA62rG', name='Tahoe-100M', is_type=False, url='https://arcinstitute.org/tools/virtualcellatlas', space_id=1, created_by_id=1, created_at=2025-02-26 16:03:40 UTC)
# one collection in this project
project_tahoe.collections.df()
uid | key | description | hash | reference | reference_type | space_id | meta_artifact_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||
1 | BpavRL4ntRTzWEE50000 | tahoe100 | None | GCLk4ZgQxgWspjmEUk3gIg | None | None | 1 | None | 2025-02-25 | True | 3 | 2025-02-26 13:51:22.787537+00:00 | 1 | None | 1 |
Every individual dataset in the atlas is an .h5ad
file that is registered as an artifact in LaminDB.
Artifact level metadata are registered and can be explored as follows:
# get the collection: https://lamin.ai/laminlabs/arc-virtual-cell-atlas/collection/BpavRL4ntRTzWEE5
collection_tahoe = ln.Collection.get(key="tahoe100")
# 14 artifacts in this collection, each correspond to a plate
artifacts_tahoe = collection_tahoe.artifacts.distinct()
artifacts_tahoe.df()
Show code cell output
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1362 | 56uA9lPPmJ4zLUcr0000 | 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 26536400717 | j1FXsX7hs7u+eBqnWnmNHw | None | 8044908 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:17.849980+00:00 | 1 | None | 1 |
1365 | 9L9HZ55HqUL0aqaR0000 | 2025-02-25/h5ad/plate13_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 28071589885 | RKOiaay+CHvv+Ukk/N+28A | None | 8501658 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:18.977981+00:00 | 1 | None | 1 |
1372 | aAHQ3zbD7n1asyYr0000 | 2025-02-25/h5ad/plate6_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 28934897078 | NYvQEqVClziHm0ozWhOw1w | None | 7545393 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:21.629962+00:00 | 1 | None | 1 |
1367 | aJIqo7bNyJAs9z0r0000 | 2025-02-25/h5ad/plate1_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 19070623904 | 9iCNcouMqfNS3HA/2GUWOA | None | 5481420 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:19.737995+00:00 | 1 | None | 1 |
1375 | BDttiuV3Te8VB0dU0000 | 2025-02-25/h5ad/plate9_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 18791302576 | 4kHbVbmreg6akW6ZgsjxaA | None | 5866669 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:22.759201+00:00 | 1 | None | 1 |
1374 | czC19UpUEszVH2bU0000 | 2025-02-25/h5ad/plate8_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 30390935958 | ilAzEPIh4FlDeTFaJ1dILw | None | 8880979 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:22.387666+00:00 | 1 | None | 1 |
1373 | DC5cacdJr1VoEXnl0000 | 2025-02-25/h5ad/plate7_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 16514746341 | NOS4MY6eYYPOnAB8ViyWYg | None | 5692117 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:22.009157+00:00 | 1 | None | 1 |
1371 | EZATJLC4jE7pmwo40000 | 2025-02-25/h5ad/plate5_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 19763140865 | VMBKFzOI5cj7UC1UDENP4A | None | 6419498 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:21.255154+00:00 | 1 | None | 1 |
1363 | omn7JStfJMzy8m6O0000 | 2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 23230802756 | N2mzoYlMLEl6PdecaYyDvw | None | 7435869 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:18.229629+00:00 | 1 | None | 1 |
1364 | S2h2rPLCaUhZAM9u0000 | 2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 37495736876 | VjAkWVFGVpzAMi9Innusuw | None | 10487057 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:18.600910+00:00 | 1 | None | 1 |
1370 | tKTeff0ugWqAm4P70000 | 2025-02-25/h5ad/plate4_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 23292672278 | BkBXznbSovNWXtzPFITPcQ | None | 7004356 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:20.879928+00:00 | 1 | None | 1 |
1366 | vn5cUJCHbjpPPsZx0000 | 2025-02-25/h5ad/plate14_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 22427932564 | FrnStRehP16siRGG35ou+g | None | 6518806 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:19.357999+00:00 | 1 | None | 1 |
1369 | XVSrkq9pyF1OBLgG0000 | 2025-02-25/h5ad/plate3_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 13173722269 | Jnrt7DaSUCGn8D8LS2itaw | None | 4705402 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:20.497965+00:00 | 1 | None | 1 |
1368 | ZFeVfd0ugAHeWCxm0000 | 2025-02-25/h5ad/plate2_filt_Vevo_Tahoe100M_WSe... | None | .h5ad | dataset | AnnData | 29037152127 | usxviuqGbuw0RYnECCVCWw | None | 8064658 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:20.113956+00:00 | 1 | None | 1 |
50 cell lines.
artifacts_tahoe.list("cell_lines__name")[:5]
['A-172', 'A-427', 'A498', 'A549', 'AN3 CA']
380 compounds.
artifacts_tahoe.list("compounds__name")[:5]
['18β-Glycyrrhetinic acid',
'4EGI-1',
'5-Azacytidine',
'5-Fluorouracil',
'8-Hydroxyquinoline']
1,138 perturbations.
artifacts_tahoe.list("compound_perturbations__name")[:5]
["[('18β-Glycyrrhetinic acid', 0.05, 'uM')]",
"[('18β-Glycyrrhetinic acid', 0.5, 'uM')]",
"[('18β-Glycyrrhetinic acid', 5.0, 'uM')]",
"[('4EGI-1', 0.05, 'uM')]",
"[('4EGI-1', 0.5, 'uM')]"]
# check the curated metadata of the first artifact
artifact1 = artifacts_tahoe[0]
artifact1.describe()
Show code cell output
Artifact .h5ad/AnnData ├── General │ ├── .uid = '56uA9lPPmJ4zLUcr0000' │ ├── .key = '2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad' │ ├── .size = 26536400717 │ ├── .hash = 'j1FXsX7hs7u+eBqnWnmNHw' │ ├── .n_observations = 8044908 │ ├── .path = gs://arc-ctc-tahoe100/2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WServicesFrom_ParseGigalab.h5ad │ ├── .created_by = sunnyosun (Sunny Sun) │ ├── .created_at = 2025-02-25 23:22:17 │ └── .transform = 'Register Tahoe-100M' ├── Dataset features/.feature_sets │ ├── var • 62710 [bionty.Gene.stable_id] │ │ TSPAN6 float │ │ TNMD float │ │ DPM1 float │ │ SCYL3 float │ │ C1orf112 float │ │ FGR float │ │ CFH float │ │ FUCA2 float │ │ GCLC float │ │ NFYA float │ │ STPG1 float │ │ NIPAL3 float │ │ LAS1L float │ │ ENPP4 float │ │ SEMA3F float │ │ CFTR float │ │ ANKIB1 float │ │ CYP51A1 float │ │ KRIT1 float │ │ RAD52 float │ └── obs • 16 [Feature] │ cell_line cat[bionty.CellLine.desc… A-172, A-427, A498, A549, AN3 CA, AsPC-1… │ cell_name cat[bionty.CellLine] A-172, A-427, A498, A549, AN3 CA, AsPC-1… │ drug cat[wetlab.Compound] 5-Azacytidine, 5-Fluorouracil, Abiratero… │ drugname_drugconc cat[wetlab.CompoundPertu… [('5-Azacytidine', 0.05, 'uM')], [('5-Fl… │ pass_filter cat[ULabel[PassFilter]] full, minimal │ phase cat[ULabel[Phase]] G1, G2M, S │ plate cat[ULabel[Plate]] plate10 │ sample cat[wetlab.Biosample] smp_2359, smp_2360, smp_2361, smp_2362, … │ gene_count int │ tscp_count int │ mread_count int │ pcnt_mito float │ S_score float │ G2M_score float │ sublibrary str │ BARCODE str └── Labels └── .references Reference Tahoe-100M: A Giga-Scale Single-Cell Per… .projects Project Tahoe-100M .compounds wetlab.Compound Acetazolamide, Neratinib, Tazarotene, 5-… .compound_perturbations wetlab.CompoundPerturbat… [('5-Azacytidine', 0.05, 'uM')], [('Iver… .biosamples wetlab.Biosample smp_2430, smp_2365, smp_2360, smp_2369, … .organisms bionty.Organism human .cell_lines bionty.CellLine NCI-H1573, NCI-H460, hTERT-HPNE, SW48, H… .ulabels ULabel tahoe-100, plate10, G1, G2M, S, full, mi…
16 obs metadata features.
artifact1.features["obs"].df()
Show code cell output
/tmp/ipykernel_3635/2428349911.py:1: FutureWarning: Use slots[slot].members instead of __getitem__, __getitem__ will be removed in the future.
artifact1.features["obs"].df()
uid | name | dtype | is_type | unit | description | array_rank | array_size | array_shape | proxy_dtype | synonyms | _expect_many | _curation | space_id | type_id | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||||||||||
9 | bujDkB4Nd1S5 | S_score | float | None | None | Inferred S phase score | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:31:22.144135+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
3 | PVpyJhciLdCQ | pass_filter | cat[ULabel[PassFilter]] | None | None | "Full" filters are more stringent on gene_coun... | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:25:30.918235+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
7 | PZDiL36nJSFv | mread_count | int | None | None | Number of reads per cell | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:30:31.810331+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
4 | vshELphl73qp | cell_line | cat[bionty.CellLine.description] | None | None | Cell line information (if applicable) | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:27:22.393997+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
1 | YRSYWdIiesqL | plate | cat[ULabel[Plate]] | None | None | Plate identifier | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:03:51.786985+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
19 | gQE1h3fIBiSf | sample | cat[wetlab.Biosample] | None | None | Unique treatment identifier, distinguishes rep... | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-26 10:59:36.743558+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
5 | IjSP1lCY3Hyw | gene_count | int | None | None | Number of genes with at least one count | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:30:30.668750+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
6 | LHUmmYKjIGPl | tscp_count | int | None | None | Number of transcripts, aka UMI count | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:30:31.236532+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
18 | fLwdFKBUhBY9 | drugname_drugconc | cat[wetlab.CompoundPerturbation] | None | None | Drug name, concentration, and concentration unit | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 23:04:17.541812+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
17 | Q0cj2JR5Juwn | drug | cat[wetlab.Compound] | None | None | Drug name, parsed out from the drugname_drugco... | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 23:02:05.717794+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
15 | 3X4d0QEUuprp | sublibrary | str | None | None | Sublibrary ID (related to library prep and seq... | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:35:14.673178+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
16 | dQELv2sIVnJX | BARCODE | str | None | None | Barcode ID | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:35:15.627971+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
8 | X640W5tBUPOQ | pcnt_mito | float | None | None | Percentage of mitochondrial reads | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:31:21.581885+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
10 | CF0O0e0WZxFz | G2M_score | float | None | None | Inferred G2M score | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:31:22.708895+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
2 | QboQ1Q1Yxsjn | phase | cat[ULabel[Phase]] | None | None | Inferred cell cycle phase | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:21:56.935262+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
11 | KPT70T8xJLIt | cell_name | cat[bionty.CellLine] | None | None | Commonly-used cell name (related to the cell_l... | 0 | 0 | None | None | None | True | None | 1 | None | 3 | 2025-02-25 22:32:56.082195+00:00 | 1 | {'af': {'0': None, '1': True}} | 1 |
Query artifacts of interest based on metadata¶
Since all metadata are registered in the sql database, we can explore the datasets without accessing them.
Let’s find which datasets contain A549 cells perturbed with Piroxicam.
# lookup objects give you pythonic access to the values
cell_lines = bt.CellLine.lookup("ontology_id")
drugs = wl.Compound.lookup()
artifacts_a549_piroxicam = artifacts_tahoe.filter(
cell_lines=cell_lines.cvcl_0023, compounds=drugs.piroxicam
)
artifacts_a549_piroxicam.df()
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
1362 | 56uA9lPPmJ4zLUcr0000 | 2025-02-25/h5ad/plate10_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 26536400717 | j1FXsX7hs7u+eBqnWnmNHw | None | 8044908 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:17.849980+00:00 | 1 | None | 1 |
1363 | omn7JStfJMzy8m6O0000 | 2025-02-25/h5ad/plate11_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 23230802756 | N2mzoYlMLEl6PdecaYyDvw | None | 7435869 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:18.229629+00:00 | 1 | None | 1 |
1364 | S2h2rPLCaUhZAM9u0000 | 2025-02-25/h5ad/plate12_filt_Vevo_Tahoe100M_WS... | None | .h5ad | dataset | AnnData | 37495736876 | VjAkWVFGVpzAMi9Innusuw | None | 10487057 | md5 | False | False | 1 | 2 | 3 | None | True | 1 | 2025-02-25 23:22:18.600910+00:00 | 1 | None | 1 |
You can download an .h5ad
into your local cache:
artifact1.cache()
Or stream it:
artifact1.open()
Open the obs metadata parquet file as a PyArrow Dataset¶
Open the obs metadata file (2.29G) with PyArrow.Dataset
.
obs_metadata = ln.Artifact.filter(
key__endswith="obs_metadata.parquet", projects=project_tahoe
).one()
obs_metadata
Artifact(uid='y1TTR9wbrmZEwpOa0000', is_latest=True, key='2025-02-25/metadata/obs_metadata.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=2293981573, hash='qEWOpGw9CmQVzaElyMWT1Q', n_observations=100648790, space_id=1, storage_id=2, run_id=1, created_by_id=1, created_at=2025-02-25 19:33:42 UTC)
obs_metadata_ds = obs_metadata.open()
obs_metadata_ds.schema
Show code cell output
plate: string
BARCODE_SUB_LIB_ID: string
sample: string
gene_count: int64
tscp_count: int64
mread_count: int64
drugname_drugconc: string
drug: string
cell_line: dictionary<values=string, indices=int32, ordered=0>
sublibrary: string
BARCODE: string
pcnt_mito: float
S_score: double
G2M_score: double
phase: dictionary<values=string, indices=int32, ordered=0>
pass_filter: dictionary<values=string, indices=int32, ordered=0>
cell_name: dictionary<values=string, indices=int32, ordered=0>
__index_level_0__: int64
-- schema metadata --
pandas: '{"index_columns": ["__index_level_0__"], "column_indexes": [{"na' + 2487
Which A549 cells are perturbed with Piroxicam.
filter_expr = (pc.field("cell_name") == cell_lines.cvcl_0023.name) & (
pc.field("drug") == drugs.piroxicam.name
)
obs_metadata_df = obs_metadata_ds.scanner(filter=filter_expr).to_table().to_pandas()
obs_metadata_df.value_counts("plate")
plate
plate12 2818
plate10 2812
plate11 2279
Name: count, dtype: int64
obs_metadata_df.head()
plate | BARCODE_SUB_LIB_ID | sample | gene_count | tscp_count | mread_count | drugname_drugconc | drug | cell_line | sublibrary | BARCODE | pcnt_mito | S_score | G2M_score | phase | pass_filter | cell_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29314 | plate10 | 50_030_183-lib_1681 | smp_2408 | 644 | 863 | 1024 | [('Piroxicam', 0.05, 'uM')] | Piroxicam | CVCL_0023 | lib_1681 | 50_030_183 | 0.101970 | -0.282297 | -0.165568 | G1 | full | A549 |
29337 | plate10 | 50_035_135-lib_1681 | smp_2408 | 1130 | 1570 | 1827 | [('Piroxicam', 0.05, 'uM')] | Piroxicam | CVCL_0023 | lib_1681 | 50_035_135 | 0.077070 | -0.335042 | -0.280220 | G1 | full | A549 |
29338 | plate10 | 50_035_171-lib_1681 | smp_2408 | 1058 | 1534 | 1809 | [('Piroxicam', 0.05, 'uM')] | Piroxicam | CVCL_0023 | lib_1681 | 50_035_171 | 0.124511 | -0.402028 | -0.404579 | G1 | full | A549 |
29352 | plate10 | 50_038_157-lib_1681 | smp_2408 | 1265 | 1883 | 2240 | [('Piroxicam', 0.05, 'uM')] | Piroxicam | CVCL_0023 | lib_1681 | 50_038_157 | 0.147106 | -0.455343 | -0.311355 | G1 | full | A549 |
29355 | plate10 | 50_039_078-lib_1681 | smp_2408 | 1355 | 1914 | 2258 | [('Piroxicam', 0.05, 'uM')] | Piroxicam | CVCL_0023 | lib_1681 | 50_039_078 | 0.070010 | -0.349396 | 0.186264 | G2M | full | A549 |
Retrieve the corresponding cells from h5ad files.
plate_cells = df.groupby("plate")["BARCODE_SUB_LIB_ID"].apply(list)
adatas = []
for artifact in artifacts_a549_piroxicam:
plate = artifact.features.get_values()["plate"]
idxs = plate_cells.get(plate)
print(f"Loading {len(idxs)} cells from plate {plate}")
with artifact.open() as store:
adata = store[idxs].to_memory() # can also subst genes here
adatas.append(adata)
scBaseCamp¶
project_scbasecamp = ln.Project.get(name="scBaseCamp")
project_scbasecamp
Project(uid='vdK00t9DGwHP', name='scBaseCamp', is_type=False, url='https://arcinstitute.org/tools/virtualcellatlas', space_id=1, created_by_id=1, created_at=2025-02-26 16:04:08 UTC)
This project has 105 collections (21 organisms x 5 count features):
project_scbasecamp.collections.df()
Show code cell output
uid | key | description | hash | reference | reference_type | space_id | meta_artifact_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||
87 | QyeOMM8Qu2Yc637f0000 | scBaseCamp/Velocyto/Schistosoma_mansoni | None | 7XZzjMBlIJQMqrcOhYFQYQ | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:07:36.194395+00:00 | 1 | None | 1 |
71 | rForlsvLjM8zEgbO0000 | scBaseCamp/GeneFull_ExonOverIntron/Oryza_sativa | None | SqNuN0qVtQskeDnAZPRLrQ | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:06:15.137130+00:00 | 1 | None | 1 |
68 | wXctL2347aWNGnf90000 | scBaseCamp/Gene/Oryza_sativa | None | LTqCz0GuUi1CnbHM_zi9qw | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:06:00.109765+00:00 | 1 | None | 1 |
51 | nJV1L9cV1nev1OmF0000 | scBaseCamp/GeneFull_ExonOverIntron/Heterocepha... | None | T6J_WY2k420oM5BE_I0rpA | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:03:47.412575+00:00 | 1 | None | 1 |
80 | nBrtxyYP9yzufHe70000 | scBaseCamp/GeneFull_Ex50pAS/Pan_troglodytes | None | JF1_XDO5EFM13xRBxDCSaQ | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:07:01.132150+00:00 | 1 | None | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
55 | BLamUQZhqBTnHG4K0000 | scBaseCamp/GeneFull_Ex50pAS/Homo_sapiens | None | SLBug97gNkMCZ3Gd2Bp1Aw | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:04:28.695376+00:00 | 1 | None | 1 |
27 | 2wPZaiNxigodW7X60000 | scBaseCamp/Velocyto/Danio_rerio | None | ceCKmkcgKyk_bRHhjGodTQ | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:01:45.771604+00:00 | 1 | None | 1 |
23 | kXjTL9XbRysx3A8P0000 | scBaseCamp/Gene/Danio_rerio | None | TOhVCAQMVTRO8VD27SF6WQ | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:01:25.162863+00:00 | 1 | None | 1 |
58 | TMcFueJifRSFVrSq0000 | scBaseCamp/Gene/Macaca_mulatta | None | OuNCmFSkmfKiLjvGEbBVKw | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:05:04.524140+00:00 | 1 | None | 1 |
8 | ttGkPgXxLDO4sSXF0000 | scBaseCamp/Gene/Bos_taurus | None | jn1Nhcdt0lpB1I3hQ4SgFw | None | None | 1 | None | 2025-02-25 | True | 10 | 2025-03-03 11:00:09.130314+00:00 | 1 | None | 1 |
105 rows × 15 columns
Query artifacts of interest based on metadata¶
Often you might not want to access all the h5ads in a collection, but rather filter them by metadata:
organisms = bt.Organism.lookup()
tissues = bt.Tissue.lookup()
efos = bt.ExperimentalFactor.lookup()
feature_counts = ln.ULabel.filter(type__name="STARsolo count features").lookup()
h5ads_brain = ln.Artifact.filter(
suffix=".h5ad",
projects=project_scbasecamp,
organisms=organisms.human,
ulabels=feature_counts.genefull_ex50pas,
tissues=tissues.brain,
experimental_factors=efos.single_cell,
experiments__name__contains="CRISPRi", # `perturbation` column is registered in `wetlab.Experiment`
).distinct()
h5ads_brain.df()
Show code cell output
uid | key | description | suffix | kind | otype | size | hash | n_files | n_observations | _hash_type | _key_is_virtual | _overwrite_versions | space_id | storage_id | schema_id | version | is_latest | run_id | created_at | created_by_id | _aux | _branch_code | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
id | |||||||||||||||||||||||
104180 | 1AlmBH0wFzUqosGV0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 3448668 | A0k605SWKyxecLUFjNqS8A | None | 6164 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
104186 | 24rg7gDQqP0EQRq30000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 35229865 | EA3jW7rwaZhIwtZpLLNCQQ | None | 7463 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
104204 | 2vZHojPycv8uPoXp0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 35133716 | Ud5Je3ue2dQcG53leo1nhA | None | 4709 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
104174 | 3EbJEIJnCGqnEMUI0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 5727864 | nddvJ0NRE3/rTAfQgyubow | None | 7376 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
104205 | 3JlzQ4PcN58pOxM50000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 35877513 | elUEIdXpHR1xfltqUYPBgw | None | 4718 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
104197 | Wg6YBPWCwfU4Vr960000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 38354054 | JJCCXbqWTaIeV5vJvOllzw | None | 7627 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
104170 | YqiNrGCXc1cM9Dg90000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 5494309 | kMbDZo5QMSt3WzLKZjsdCg | None | 7383 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
104219 | zAxkTKnxCUEBAibd0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 37935375 | D/xXUsmFZ14802xqd5cWaw | None | 7616 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
104206 | ZgGYpGntv2sF92Wg0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 36858036 | fUND8GyVTUu3KrDEhmYYLg | None | 9128 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
104166 | ZmSJbhRC4WeK1nyA0000 | 2025-02-25/h5ad/GeneFull_Ex50pAS/Homo_sapiens/... | None | .h5ad | dataset | AnnData | 40518635 | gdcEf34j7wAVvxcUby9UDw | None | 7114 | None | False | True | 1 | 3 | 55 | None | True | 10 | 2025-02-28 16:46:25.771217+00:00 | 1 | None | 1 |
64 rows × 23 columns
Load the h5ad files with obs metadata¶
Load the h5ads as a single AnnData:
adatas = []
for artifact in h5ads_brain[:5]: # only load the first 5 artifacts to save CI time
adatas.append(artifact.load())
# the obs metadatas are present in the parquet files
adata_concat = ad.concat(adatas)
adata_concat
Show code cell output
/opt/hostedtoolcache/Python/3.12.9/x64/lib/python3.12/site-packages/anndata/_core/anndata.py:1756: UserWarning: Observation names are not unique. To make them unique, call `.obs_names_make_unique`.
utils.warn_names_duplicates("obs")
AnnData object with n_obs × n_vars = 38206 × 36601
obs: 'gene_count', 'umi_count', 'SRX_accession'
Open the sample metadata:
sample_meta = ln.Artifact.filter(
key__endswith="sample_metadata.parquet",
projects=project_scbasecamp,
organisms=organisms.human,
ulabels=feature_counts.genefull_ex50pas,
).one()
sample_meta
Artifact(uid='WCHkcyWN8L6pDI4E0000', is_latest=True, key='2025-02-25/metadata/GeneFull_Ex50pAS/Homo_sapiens/sample_metadata.parquet', suffix='.parquet', kind='dataset', otype='DataFrame', size=531878, hash='4QrqW8DQVRl6bKNYiJhq3g', n_observations=16077, space_id=1, storage_id=3, run_id=2, created_by_id=1, created_at=2025-02-25 20:41:32 UTC)
sample_meta_dataset = sample_meta.open()
sample_meta_dataset.schema
Show code cell output
entrez_id: int64
srx_accession: string
file_path: string
obs_count: int64
lib_prep: string
tech_10x: string
cell_prep: string
organism: string
tissue: string
disease: string
perturbation: string
cell_line: string
czi_collection_id: string
czi_collection_name: string
-- schema metadata --
pandas: '{"index_columns": [], "column_indexes": [], "columns": [{"name":' + 1755
Fetch corresponding sample metadata:
filter_expr = pc.field("srx_accession").isin(
adata_concat.obs["SRX_accession"].astype(str)
)
df = sample_meta_dataset.scanner(filter=filter_expr).to_table().to_pandas()
Add the sample metadata to the AnnData:
adata_concat.obs = adata_concat.obs.merge(
df, left_on="SRX_accession", right_on="srx_accession"
)
adata_concat
AnnData object with n_obs × n_vars = 38206 × 36601
obs: 'gene_count', 'umi_count', 'SRX_accession', 'entrez_id', 'srx_accession', 'file_path', 'obs_count', 'lib_prep', 'tech_10x', 'cell_prep', 'organism', 'tissue', 'disease', 'perturbation', 'cell_line', 'czi_collection_id', 'czi_collection_name'
adata_concat.obs.head()
Show code cell output
gene_count | umi_count | SRX_accession | entrez_id | srx_accession | file_path | obs_count | lib_prep | tech_10x | cell_prep | organism | tissue | disease | perturbation | cell_line | czi_collection_id | czi_collection_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2748 | 5134.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |
1 | 2351 | 4639.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |
2 | 2184 | 4293.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |
3 | 2469 | 5307.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |
4 | 4144 | 9340.0 | SRX10606628 | 14083632 | SRX10606628 | gs://arc-ctc-scbasecamp/2025-02-25/h5ad/GeneFu... | 7641 | 10x_Genomics | 3_prime_gex | single_cell | Homo sapiens | brain | Down syndrome | CRISPR/Cas9, CRISPRi, or small-molecule inhibi... | DS1 | None | None |