lamindb.Feature¶

Bases: Record, CanCurate, TracksRun, TracksUpdates

Dataset dimensions.

A feature represents a dimension of a dataset, such as a column in a DataFrame. The Feature registry organizes metadata of features.

The Feature registry helps you organize and query datasets based on their features and corresponding label annotations. For instance, when working with a “T cell” label, it could be measured through different features such as "cell_type_by_expert" where an expert manually classified the cell, or "cell_type_by_model" where a computational model made the classification.

The two most important metadata of a feature are its name and the dtype. In addition to typical data types, LaminDB has a "num" dtype to concisely denote the union of all numerical types.

Parameters:

name – str Name of the feature, typically. column name.
dtype – FeatureDtype | Registry | list[Registry] | FieldAttr See FeatureDtype. For categorical types, can define from which registry values are sampled, e.g., ULabel or [ULabel, bionty.CellType].
unit – str | None = None Unit of measure, ideally SI ("m", "s", "kg", etc.) or "normalized" etc.
description – str | None = None A description.
synonyms – str | None = None Bar-separated synonyms.
nullable – bool = True Whether the feature can have null-like values (None, pd.NA, NaN, etc.), see nullable.
default_value – Any | None = None Default value for the feature.
cat_filters – dict[str, str] | None = None Subset a registry by additional filters to define valid categories.

Note

For more control, you can use bionty registries to manage simple biological entities like genes, proteins & cell markers. Or you define custom registries to manage high-level derived features like gene sets.

See also

from_df(): Create feature records from DataFrame.
features: Feature manager of an artifact or collection.
ULabel: Universal labels.
Schema: Feature sets.

Example

A simple "str" feature.

>>> ln.Feature(
...     name="sample_note",
...     dtype="str",
... ).save()

A dtype "cat[ULabel]" can be more easily passed as below.

>>> ln.Feature(
...     name="project",
...     dtype=ln.ULabel,
... ).save()

A dtype "cat[ULabel|bionty.CellType]" can be more easily passed as below.

>>> ln.Feature(
...     name="cell_type",
...     dtype=[ln.ULabel, bt.CellType],
... ).save()

Hint

Features and labels denote two ways of using entities to organize data:

A feature qualifies what is measured, i.e., a numerical or categorical random variable
A label is a measured value, i.e., a category

Consider annotating a dataset by that it measured expression of 30k genes: genes relate to the dataset as feature identifiers through a feature set with 30k members. Now consider annotating the artifact by whether that it measured the knock-out of 3 genes: here, the 3 genes act as labels of the dataset.

Re-shaping data can introduce ambiguity among features & labels. If this happened, ask yourself what the joint measurement was: a feature qualifies variables in a joint measurement. The canonical data matrix lists jointly measured variables in the columns.

Attributes¶

property default_value: Any¶

A default value that overwrites missing values (default None).

This takes effect when you call Curator.standardize().

If default_value = None, missing values like pd.NA or np.nan are kept.

property nullable: bool¶

Indicates whether the feature can have nullable values (default True).

Example:

import lamindb as ln
import pandas as pd

disease = ln.Feature(name="disease", dtype=ln.ULabel, nullable=False).save()
schema = ln.Schema(features=[disease]).save()
dataset = {"disease": pd.Categorical([pd.NA, "asthma"])}
df = pd.DataFrame(dataset)
curator = ln.curators.DataFrameCurator(df, schema)
try:
    curator.validate()
except ln.errors.ValidationError as e:
    assert str(e).startswith("non-nullable series 'disease' contains null values")

Simple fields¶

uid: str¶: Universal id, valid across DB instances.

name: str¶: Name of feature (hard unique constraint unique=True).

dtype: FeatureDtype | None¶

Data type (FeatureDtype).

For categorical types, can define from which registry values are sampled, e.g., 'cat[ULabel]' or 'cat[bionty.CellType]'. Unions are also allowed if the feature samples from two registries, e.g., 'cat[ULabel|bionty.CellType]'

is_type: bool¶: Distinguish types from instances of the type.

unit: str | None¶: Unit of measure, ideally SI (m, s, kg, etc.) or ‘normalized’ etc. (optional).

description: str | None¶: A description.

array_rank: int¶

Rank of feature.

Number of indices of the array: 0 for scalar, 1 for vector, 2 for matrix.

Is called .ndim in numpy and pytorch but shouldn’t be confused with the dimension of the feature space.

array_size: int¶

Number of elements of the feature.

Total number of elements (product of shape components) of the array.

A number or string (a scalar): 1
A 50-dimensional embedding: 50
A 25 x 25 image: 625

array_shape: list[int] | None¶

Shape of the feature.

A number or string (a scalar): [1]
A 50-dimensional embedding: [50]
A 25 x 25 image: [25, 25]

Is stored as a list rather than a tuple because it’s serialized as JSON.

proxy_dtype: FeatureDtype | None¶

Proxy data type.

If the feature is an image it’s often stored via a path to the image file. Hence, while the dtype might be image with a certain shape, the proxy dtype would be str.

synonyms: str | None¶: Bar-separated (|) synonyms (optional).

created_at: datetime¶: Time of creation of record.

updated_at: datetime¶: Time of last update to record.

Relational fields¶

space: Space¶: The space in which the record lives.

created_by: User¶: Creator of record.

run: Run | None¶: Run that created record.

type: Feature | None¶

Type of feature (e.g., ‘Readout’, ‘Metric’, ‘Metadata’, ‘ExpertAnnotation’, ‘ModelPrediction’).

Allows to group features by type, e.g., all read outs, all metrics, etc.

schemas: Schema¶: Feature sets linked to this feature.

records: Feature¶: Records of this type.

values: FeatureValue¶: Values for this feature.

projects¶

Accessor to the related objects manager on the forward and reverse sides of a many-to-many relation.

In the example:

class Pizza(Model):
    toppings = ManyToManyField(Topping, related_name='pizzas')

Pizza.toppings and Topping.pizzas are ManyToManyDescriptor instances.

Most of the implementation is delegated to a dynamically defined manager class built by create_forward_many_to_many_manager() defined below.

Class methods¶

classmethod df(include=None, features=False, limit=100)¶

Convert to pd.DataFrame.

By default, shows all direct fields, except updated_at.

Use arguments include or feature to include other data.

Parameters:

include (str | list[str] | None, default: None) – Related fields to include as columns. Takes strings of form "ulabels__name", "cell_types__name", etc. or a list of such strings.
features (bool | list[str], default: False) – If True, map all features of the Feature registry onto the resulting DataFrame. Only available for Artifact.
limit (int, default: 100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.

Return type:

DataFrame

Examples

Include the name of the creator in the DataFrame:

>>> ln.ULabel.df(include="created_by__name"])

Include display of features for Artifact:

>>> df = ln.Artifact.df(features=True)
>>> ln.view(df)  # visualize with type annotations

Only include select features:

>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])

classmethod filter(*queries, **expressions)¶

Query records.

Parameters:

queries – One or multiple Q objects.
expressions – Fields and values passed as Django query expressions.

Return type:

QuerySet

Returns:

A QuerySet.

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ln.ULabel(name="my label").save()
>>> ln.ULabel.filter(name__startswith="my").df()

classmethod from_df(df, field=None)¶

Create Feature records for columns.

Return type:: RecordList

classmethod from_values(values, field=None, create=False, organism=None, source=None, mute=False)¶

Bulk create validated records by parsing values for an identifier such as a name or an id).

Parameters:

values (list[str] | Series | array) – A list of values for an identifier, e.g. ["name1", "name2"].
field (str | DeferredAttribute | None, default: None) – A Record field to look up, e.g., bt.CellMarker.name.
create (bool, default: False) – Whether to create records if they don’t exist.
organism (Record | str | None, default: None) – A bionty.Organism name or record.
source (Record | None, default: None) – A bionty.Source record to validate against to create records for.
mute (bool, default: False) – Whether to mute logging.

Return type:

RecordList

Returns:

A list of validated records. For bionty registries. Also returns knowledge-coupled records.

Notes

For more info, see tutorial: Manage biological registries.

Examples

Bulk create from non-validated values will log warnings & returns empty list:

>>> ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"], field="name")
>>> assert len(ulabels) == 0

Bulk create records from validated values returns the corresponding existing records:

>>> ln.save([ln.ULabel(name=name) for name in ["benchmark", "prediction", "test"]])
>>> ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"], field="name")
>>> assert len(ulabels) == 3

Bulk create records from public reference:

>>> import bionty as bt
>>> records = bt.CellType.from_values(["T cell", "B cell"], field="name")
>>> records

classmethod get(idlike=None, **expressions)¶

Get a single record.

Parameters:

idlike (int | str | None, default: None) – Either a uid stub, uid or an integer id.
expressions – Fields and values passed as Django query expressions.

Return type:

Record

Returns:

A record.

Raises:

lamindb.errors.DoesNotExist – In case no matching record is found.

See also

Guide: Query & search registries
Django documentation: Queries

Examples

>>> ulabel = ln.ULabel.get("FvtpPJLJ")
>>> ulabel = ln.ULabel.get(name="my-label")

classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶

Inspect if values are mappable to a field.

Being mappable means that an exact match exists.

Parameters:

values (list[str] | Series | array) – Values that will be checked against the field.
field (str | DeferredAttribute | None, default: None) – The field of values. Examples are 'ontology_id' to map against the source ID or 'name' to map against the ontologies field names.
mute (bool, default: False) – Whether to mute logging.
organism (str | Record | None, default: None) – An Organism name or record.
source (Record | None, default: None) – A bionty.Source record that specifies the version to inspect against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.

Return type:

InspectResult

See also

validate()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol"))
>>> gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
>>> result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol)
>>> result.validated
['A1CF', 'A1BG']
>>> result.non_validated
['FANCD1', 'FANCD20']

classmethod lookup(field=None, return_field=None)¶

Return an auto-complete object for a field.

Parameters:

field (str | DeferredAttribute | None, default: None) – The field to look up the values for. Defaults to first string field.
return_field (str | DeferredAttribute | None, default: None) – The field to return. If None, returns the whole record.

Return type:

NamedTuple

Returns:

A NamedTuple of lookup information of the field values with a dictionary converter.

See also

search()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> bt.Gene.from_source(symbol="ADGB-DT").save()
>>> lookup = bt.Gene.lookup()
>>> lookup.adgb_dt
>>> lookup_dict = lookup.dict()
>>> lookup_dict['ADGB-DT']
>>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id")
>>> genes.ensg00000002745
>>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")

classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶

Search.

Parameters:

string (str) – The input string to match against the field ontology values.
field (str | DeferredAttribute | None, default: None) – The field or fields to search. Search all string fields by default.
limit (int | None, default: 20) – Maximum amount of top results to return.
case_sensitive (bool, default: False) – Whether the match is case sensitive.

Return type:

QuerySet

Returns:

A sorted DataFrame of search results with a score in column score. If return_queryset is True. QuerySet.

See also

filter() lookup()

Examples

>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name")
>>> ln.save(ulabels)
>>> ln.ULabel.search("ULabel2")

classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, public_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶

Maps input synonyms to standardized names.

Parameters:

values (Iterable) – Identifiers that will be standardized.
field (str | DeferredAttribute | None, default: None) – The field representing the standardized names.
return_field (str | DeferredAttribute | None, default: None) – The field to return. Defaults to field.
return_mapper (bool, default: False) – If True, returns {input_value: standardized_name}.
case_sensitive (bool, default: False) – Whether the mapping is case sensitive.
mute (bool, default: False) – Whether to mute logging.
public_aware (bool, default: True) – Whether to standardize from Bionty reference. Defaults to True for Bionty registries.
keep (Literal['first', 'last', False], default: 'first') –
When a synonym maps to multiple names, determines which duplicates to mark as pd.DataFrame.duplicated:
- "first": returns the first mapped standardized name
- "last": returns the last mapped standardized name
- False: returns all mapped standardized name.
When keep is False, the returned list of standardized names will contain nested lists in case of duplicates.

When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
synonyms_field (str, default: 'synonyms') – A field containing the concatenated synonyms.
organism (str | Record | None, default: None) – An Organism name or record.
source (Record | None, default: None) – A bionty.Source record that specifies the version to validate against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.

Return type:

list[str] | dict[str, str]

Returns:

If return_mapper is False – a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.

See also

add_synonym(): Add synonyms.
remove_synonym(): Remove synonyms.

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol"))
>>> gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
>>> standardized_names = bt.Gene.standardize(gene_synonyms)
>>> standardized_names
['A1CF', 'A1BG', 'BRCA2', 'FANCD20']

classmethod using(instance)¶

Use a non-default LaminDB instance.

Parameters:: instance (str | None) – An instance identifier of form “account_handle/instance_name”.
Return type:: QuerySet

Examples

>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name")
            uid    score
name
ULabel7  g7Hk9b2v  100.0
ULabel5  t4Jm6s0q   75.0
ULabel6  r2Xw8p1z   75.0

classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶

Validate values against existing values of a string field.

Note this is strict_source validation, only asserts exact matches.

Parameters:

values (list[str] | Series | array) – Values that will be validated against the field.
field (str | DeferredAttribute | None, default: None) – The field of values. Examples are 'ontology_id' to map against the source ID or 'name' to map against the ontologies field names.
mute (bool, default: False) – Whether to mute logging.
organism (str | Record | None, default: None) – An Organism name or record.
source (Record | None, default: None) – A bionty.Source record that specifies the version to validate against.
strict_source (bool, default: False) – Determines the validation behavior against records in the registry. - If False, validation will include all records in the registry, ignoring the specified source. - If True, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.

Return type:

ndarray

Returns:

A vector of booleans indicating if an element is validated.

See also

inspect()

Examples

>>> import bionty as bt
>>> bt.settings.organism = "human"
>>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol"))
>>> gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"]
>>> bt.Gene.validate(gene_symbols, field=bt.Gene.symbol)
array([ True,  True, False, False])

Methods¶

add_synonym(synonym, force=False, save=None)¶

Add synonyms to a record.

Parameters:

synonym (str | list[str] | Series | array) – The synonyms to add to the record.
force (bool, default: False) – Whether to add synonyms even if they are already synonyms of other records.
save (bool | None, default: None) – Whether to save the record to the database.

See also

remove_synonym(): Remove synonyms.

Examples

>>> import bionty as bt
>>> bt.CellType.from_source(name="T cell").save()
>>> lookup = bt.CellType.lookup()
>>> record = lookup.t_cell
>>> record.synonyms
'T-cell|T lymphocyte|T-lymphocyte'
>>> record.add_synonym("T cells")
>>> record.synonyms
'T cells|T-cell|T-lymphocyte|T lymphocyte'

delete()¶

Delete.

Return type:: None

remove_synonym(synonym)¶

Remove synonyms from a record.

Parameters:: synonym (str | list[str] | Series | array) – The synonym values to remove.

See also

add_synonym(): Add synonyms

Examples

>>> import bionty as bt
>>> bt.CellType.from_source(name="T cell").save()
>>> lookup = bt.CellType.lookup()
>>> record = lookup.t_cell
>>> record.synonyms
'T-cell|T lymphocyte|T-lymphocyte'
>>> record.remove_synonym("T-cell")
'T lymphocyte|T-lymphocyte'

save(*args, **kwargs)¶

Save.

Return type:: Feature

set_abbr(value)¶

Set value for abbr field and add to synonyms.

Parameters:: value (str) – A value for an abbreviation.

See also

add_synonym()

Examples

>>> import bionty as bt
>>> bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save()
>>> scrna = bt.ExperimentalFactor.get(name="single-cell RNA sequencing")
>>> scrna.abbr
None
>>> scrna.synonyms
'single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing'
>>> scrna.set_abbr("scRNA")
>>> scrna.abbr
'scRNA'
>>> scrna.synonyms
'scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq'
>>> scrna.save()