lamindb.Feature¶
- class lamindb.Feature(name: str, dtype: FeatureDtype | Registry | list[Registry] | FieldAttr, type: Feature | None = None, is_type: bool = False, unit: str | None = None, description: str | None = None, synonyms: str | None = None, nullable: bool = True, default_value: str | None = None, cat_filters: dict[str, str] | None = None)¶
Bases:
Record,CanCurate,TracksRun,TracksUpdatesDataset dimensions.
A feature represents a dimension of a dataset, such as a column in a
DataFrame. TheFeatureregistry organizes metadata of features.The
Featureregistry helps you organize and query datasets based on their features and corresponding label annotations. For instance, when working with a “T cell” label, it could be measured through different features such as"cell_type_by_expert"where an expert manually classified the cell, or"cell_type_by_model"where a computational model made the classification.The two most important metadata of a feature are its
nameand thedtype. In addition to typical data types, LaminDB has a"num"dtypeto concisely denote the union of all numerical types.- Parameters:
name –
strName of the feature, typically. column name.dtype –
FeatureDtype | Registry | list[Registry] | FieldAttrSeeFeatureDtype. For categorical types, can define from which registry values are sampled, e.g.,ULabelor[ULabel, bionty.CellType].unit –
str | None = NoneUnit of measure, ideally SI ("m","s","kg", etc.) or"normalized"etc.description –
str | None = NoneA description.synonyms –
str | None = NoneBar-separated synonyms.nullable –
bool = TrueWhether the feature can have null-like values (None,pd.NA,NaN, etc.), seenullable.default_value –
Any | None = NoneDefault value for the feature.cat_filters –
dict[str, str] | None = NoneSubset a registry by additional filters to define valid categories.
Note
For more control, you can use
biontyregistries to manage simple biological entities like genes, proteins & cell markers. Or you define custom registries to manage high-level derived features like gene sets.See also
Example
A simple
"str"feature.>>> ln.Feature( ... name="sample_note", ... dtype="str", ... ).save()
A dtype
"cat[ULabel]"can be more easily passed as below.>>> ln.Feature( ... name="project", ... dtype=ln.ULabel, ... ).save()
A dtype
"cat[ULabel|bionty.CellType]"can be more easily passed as below.>>> ln.Feature( ... name="cell_type", ... dtype=[ln.ULabel, bt.CellType], ... ).save()
Hint
Features and labels denote two ways of using entities to organize data:
A feature qualifies what is measured, i.e., a numerical or categorical random variable
A label is a measured value, i.e., a category
Consider annotating a dataset by that it measured expression of 30k genes: genes relate to the dataset as feature identifiers through a feature set with 30k members. Now consider annotating the artifact by whether that it measured the knock-out of 3 genes: here, the 3 genes act as labels of the dataset.
Re-shaping data can introduce ambiguity among features & labels. If this happened, ask yourself what the joint measurement was: a feature qualifies variables in a joint measurement. The canonical data matrix lists jointly measured variables in the columns.
Attributes¶
- property default_value: Any¶
A default value that overwrites missing values (default
None).This takes effect when you call
Curator.standardize().If
default_value = None, missing values likepd.NAornp.nanare kept.
- property nullable: bool¶
Indicates whether the feature can have nullable values (default
True).Example:
import lamindb as ln import pandas as pd disease = ln.Feature(name="disease", dtype=ln.ULabel, nullable=False).save() schema = ln.Schema(features=[disease]).save() dataset = {"disease": pd.Categorical([pd.NA, "asthma"])} df = pd.DataFrame(dataset) curator = ln.curators.DataFrameCurator(df, schema) try: curator.validate() except ln.errors.ValidationError as e: assert str(e).startswith("non-nullable series 'disease' contains null values")
Simple fields¶
- uid: str¶
Universal id, valid across DB instances.
- name: str¶
Name of feature (hard unique constraint
unique=True).
- dtype: FeatureDtype | None¶
Data type (
FeatureDtype).For categorical types, can define from which registry values are sampled, e.g.,
'cat[ULabel]'or'cat[bionty.CellType]'. Unions are also allowed if the feature samples from two registries, e.g.,'cat[ULabel|bionty.CellType]'
- is_type: bool¶
Distinguish types from instances of the type.
- unit: str | None¶
Unit of measure, ideally SI (
m,s,kg, etc.) or ‘normalized’ etc. (optional).
- description: str | None¶
A description.
- array_rank: int¶
Rank of feature.
Number of indices of the array: 0 for scalar, 1 for vector, 2 for matrix.
Is called
.ndiminnumpyandpytorchbut shouldn’t be confused with the dimension of the feature space.
- array_size: int¶
Number of elements of the feature.
Total number of elements (product of shape components) of the array.
A number or string (a scalar): 1
A 50-dimensional embedding: 50
A 25 x 25 image: 625
- array_shape: list[int] | None¶
Shape of the feature.
A number or string (a scalar): [1]
A 50-dimensional embedding: [50]
A 25 x 25 image: [25, 25]
Is stored as a list rather than a tuple because it’s serialized as JSON.
- proxy_dtype: FeatureDtype | None¶
Proxy data type.
If the feature is an image it’s often stored via a path to the image file. Hence, while the dtype might be image with a certain shape, the proxy dtype would be str.
- synonyms: str | None¶
Bar-separated (|) synonyms (optional).
- created_at: datetime¶
Time of creation of record.
- updated_at: datetime¶
Time of last update to record.
Relational fields¶
-
type:
Feature| None¶ Type of feature (e.g., ‘Readout’, ‘Metric’, ‘Metadata’, ‘ExpertAnnotation’, ‘ModelPrediction’).
Allows to group features by type, e.g., all read outs, all metrics, etc.
- values: FeatureValue¶
Values for this feature.
- projects¶
Accessor to the related objects manager on the forward and reverse sides of a many-to-many relation.
In the example:
class Pizza(Model): toppings = ManyToManyField(Topping, related_name='pizzas')
Pizza.toppingsandTopping.pizzasareManyToManyDescriptorinstances.Most of the implementation is delegated to a dynamically defined manager class built by
create_forward_many_to_many_manager()defined below.
Class methods¶
- classmethod df(include=None, features=False, limit=100)¶
Convert to
pd.DataFrame.By default, shows all direct fields, except
updated_at.Use arguments
includeorfeatureto include other data.- Parameters:
include (
str|list[str] |None, default:None) – Related fields to include as columns. Takes strings of form"ulabels__name","cell_types__name", etc. or a list of such strings.features (
bool|list[str], default:False) – IfTrue, map all features of theFeatureregistry onto the resultingDataFrame. Only available forArtifact.limit (
int, default:100) – Maximum number of rows to display from a Pandas DataFrame. Defaults to 100 to reduce database load.
- Return type:
DataFrame
Examples
Include the name of the creator in the
DataFrame:>>> ln.ULabel.df(include="created_by__name"])
Include display of features for
Artifact:>>> df = ln.Artifact.df(features=True) >>> ln.view(df) # visualize with type annotations
Only include select features:
>>> df = ln.Artifact.df(features=["cell_type_by_expert", "cell_type_by_model"])
- classmethod filter(*queries, **expressions)¶
Query records.
- Parameters:
queries – One or multiple
Qobjects.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A
QuerySet.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ln.ULabel(name="my label").save() >>> ln.ULabel.filter(name__startswith="my").df()
- classmethod from_df(df, field=None)¶
Create Feature records for columns.
- Return type:
- classmethod from_values(values, field=None, create=False, organism=None, source=None, mute=False)¶
Bulk create validated records by parsing values for an identifier such as a name or an id).
- Parameters:
values (
list[str] |Series|array) – A list of values for an identifier, e.g.["name1", "name2"].field (
str|DeferredAttribute|None, default:None) – ARecordfield to look up, e.g.,bt.CellMarker.name.create (
bool, default:False) – Whether to create records if they don’t exist.organism (
Record|str|None, default:None) – Abionty.Organismname or record.source (
Record|None, default:None) – Abionty.Sourcerecord to validate against to create records for.mute (
bool, default:False) – Whether to mute logging.
- Return type:
- Returns:
A list of validated records. For bionty registries. Also returns knowledge-coupled records.
Notes
For more info, see tutorial: Manage biological registries.
Examples
Bulk create from non-validated values will log warnings & returns empty list:
>>> ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"], field="name") >>> assert len(ulabels) == 0
Bulk create records from validated values returns the corresponding existing records:
>>> ln.save([ln.ULabel(name=name) for name in ["benchmark", "prediction", "test"]]) >>> ulabels = ln.ULabel.from_values(["benchmark", "prediction", "test"], field="name") >>> assert len(ulabels) == 3
Bulk create records from public reference:
>>> import bionty as bt >>> records = bt.CellType.from_values(["T cell", "B cell"], field="name") >>> records
- classmethod get(idlike=None, **expressions)¶
Get a single record.
- Parameters:
idlike (
int|str|None, default:None) – Either a uid stub, uid or an integer id.expressions – Fields and values passed as Django query expressions.
- Return type:
- Returns:
A record.
- Raises:
lamindb.errors.DoesNotExist – In case no matching record is found.
See also
Guide: Query & search registries
Django documentation: Queries
Examples
>>> ulabel = ln.ULabel.get("FvtpPJLJ") >>> ulabel = ln.ULabel.get(name="my-label")
- classmethod inspect(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Inspect if values are mappable to a field.
Being mappable means that an exact match exists.
- Parameters:
values (
list[str] |Series|array) – Values that will be checked against the field.field (
str|DeferredAttribute|None, default:None) – The field of values. Examples are'ontology_id'to map against the source ID or'name'to map against the ontologies field names.mute (
bool, default:False) – Whether to mute logging.organism (
str|Record|None, default:None) – An Organism name or record.source (
Record|None, default:None) – Abionty.Sourcerecord that specifies the version to inspect against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.
- Return type:
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol")) >>> gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] >>> result = bt.Gene.inspect(gene_symbols, field=bt.Gene.symbol) >>> result.validated ['A1CF', 'A1BG'] >>> result.non_validated ['FANCD1', 'FANCD20']
- classmethod lookup(field=None, return_field=None)¶
Return an auto-complete object for a field.
- Parameters:
field (
str|DeferredAttribute|None, default:None) – The field to look up the values for. Defaults to first string field.return_field (
str|DeferredAttribute|None, default:None) – The field to return. IfNone, returns the whole record.
- Return type:
NamedTuple- Returns:
A
NamedTupleof lookup information of the field values with a dictionary converter.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> bt.Gene.from_source(symbol="ADGB-DT").save() >>> lookup = bt.Gene.lookup() >>> lookup.adgb_dt >>> lookup_dict = lookup.dict() >>> lookup_dict['ADGB-DT'] >>> lookup_by_ensembl_id = bt.Gene.lookup(field="ensembl_gene_id") >>> genes.ensg00000002745 >>> lookup_return_symbols = bt.Gene.lookup(field="ensembl_gene_id", return_field="symbol")
- classmethod search(string, *, field=None, limit=20, case_sensitive=False)¶
Search.
- Parameters:
string (
str) – The input string to match against the field ontology values.field (
str|DeferredAttribute|None, default:None) – The field or fields to search. Search all string fields by default.limit (
int|None, default:20) – Maximum amount of top results to return.case_sensitive (
bool, default:False) – Whether the match is case sensitive.
- Return type:
- Returns:
A sorted
DataFrameof search results with a score in columnscore. Ifreturn_querysetisTrue.QuerySet.
Examples
>>> ulabels = ln.ULabel.from_values(["ULabel1", "ULabel2", "ULabel3"], field="name") >>> ln.save(ulabels) >>> ln.ULabel.search("ULabel2")
- classmethod standardize(values, field=None, *, return_field=None, return_mapper=False, case_sensitive=False, mute=False, public_aware=True, keep='first', synonyms_field='synonyms', organism=None, source=None, strict_source=False)¶
Maps input synonyms to standardized names.
- Parameters:
values (
Iterable) – Identifiers that will be standardized.field (
str|DeferredAttribute|None, default:None) – The field representing the standardized names.return_field (
str|DeferredAttribute|None, default:None) – The field to return. Defaults to field.return_mapper (
bool, default:False) – IfTrue, returns{input_value: standardized_name}.case_sensitive (
bool, default:False) – Whether the mapping is case sensitive.mute (
bool, default:False) – Whether to mute logging.public_aware (
bool, default:True) – Whether to standardize from Bionty reference. Defaults toTruefor Bionty registries.keep (
Literal['first','last',False], default:'first') –- When a synonym maps to multiple names, determines which duplicates to mark as
pd.DataFrame.duplicated: "first": returns the first mapped standardized name"last": returns the last mapped standardized nameFalse: returns all mapped standardized name.
When
keepisFalse, the returned list of standardized names will contain nested lists in case of duplicates.When a field is converted into return_field, keep marks which matches to keep when multiple return_field values map to the same field value.
- When a synonym maps to multiple names, determines which duplicates to mark as
synonyms_field (
str, default:'synonyms') – A field containing the concatenated synonyms.organism (
str|Record|None, default:None) – An Organism name or record.source (
Record|None, default:None) – Abionty.Sourcerecord that specifies the version to validate against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.
- Return type:
list[str] |dict[str,str]- Returns:
If
return_mapperisFalse– a list of standardized names. Otherwise, a dictionary of mapped values with mappable synonyms as keys and standardized names as values.
See also
add_synonym()Add synonyms.
remove_synonym()Remove synonyms.
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol")) >>> gene_synonyms = ["A1CF", "A1BG", "FANCD1", "FANCD20"] >>> standardized_names = bt.Gene.standardize(gene_synonyms) >>> standardized_names ['A1CF', 'A1BG', 'BRCA2', 'FANCD20']
- classmethod using(instance)¶
Use a non-default LaminDB instance.
- Parameters:
instance (
str|None) – An instance identifier of form “account_handle/instance_name”.- Return type:
Examples
>>> ln.ULabel.using("account_handle/instance_name").search("ULabel7", field="name") uid score name ULabel7 g7Hk9b2v 100.0 ULabel5 t4Jm6s0q 75.0 ULabel6 r2Xw8p1z 75.0
- classmethod validate(values, field=None, *, mute=False, organism=None, source=None, strict_source=False)¶
Validate values against existing values of a string field.
Note this is strict_source validation, only asserts exact matches.
- Parameters:
values (
list[str] |Series|array) – Values that will be validated against the field.field (
str|DeferredAttribute|None, default:None) – The field of values. Examples are'ontology_id'to map against the source ID or'name'to map against the ontologies field names.mute (
bool, default:False) – Whether to mute logging.organism (
str|Record|None, default:None) – An Organism name or record.source (
Record|None, default:None) – Abionty.Sourcerecord that specifies the version to validate against.strict_source (
bool, default:False) – Determines the validation behavior against records in the registry. - IfFalse, validation will include all records in the registry, ignoring the specified source. - IfTrue, validation will only include records in the registry that are linked to the specified source. Note: this parameter won’t affect validation against bionty/public sources.
- Return type:
ndarray- Returns:
A vector of booleans indicating if an element is validated.
See also
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> ln.save(bt.Gene.from_values(["A1CF", "A1BG", "BRCA2"], field="symbol")) >>> gene_symbols = ["A1CF", "A1BG", "FANCD1", "FANCD20"] >>> bt.Gene.validate(gene_symbols, field=bt.Gene.symbol) array([ True, True, False, False])
Methods¶
- add_synonym(synonym, force=False, save=None)¶
Add synonyms to a record.
- Parameters:
synonym (
str|list[str] |Series|array) – The synonyms to add to the record.force (
bool, default:False) – Whether to add synonyms even if they are already synonyms of other records.save (
bool|None, default:None) – Whether to save the record to the database.
See also
remove_synonym()Remove synonyms.
Examples
>>> import bionty as bt >>> bt.CellType.from_source(name="T cell").save() >>> lookup = bt.CellType.lookup() >>> record = lookup.t_cell >>> record.synonyms 'T-cell|T lymphocyte|T-lymphocyte' >>> record.add_synonym("T cells") >>> record.synonyms 'T cells|T-cell|T-lymphocyte|T lymphocyte'
- delete()¶
Delete.
- Return type:
None
- remove_synonym(synonym)¶
Remove synonyms from a record.
- Parameters:
synonym (
str|list[str] |Series|array) – The synonym values to remove.
See also
add_synonym()Add synonyms
Examples
>>> import bionty as bt >>> bt.CellType.from_source(name="T cell").save() >>> lookup = bt.CellType.lookup() >>> record = lookup.t_cell >>> record.synonyms 'T-cell|T lymphocyte|T-lymphocyte' >>> record.remove_synonym("T-cell") 'T lymphocyte|T-lymphocyte'
- set_abbr(value)¶
Set value for abbr field and add to synonyms.
- Parameters:
value (
str) – A value for an abbreviation.
See also
Examples
>>> import bionty as bt >>> bt.ExperimentalFactor.from_source(name="single-cell RNA sequencing").save() >>> scrna = bt.ExperimentalFactor.get(name="single-cell RNA sequencing") >>> scrna.abbr None >>> scrna.synonyms 'single-cell RNA-seq|single-cell transcriptome sequencing|scRNA-seq|single cell RNA sequencing' >>> scrna.set_abbr("scRNA") >>> scrna.abbr 'scRNA' >>> scrna.synonyms 'scRNA|single-cell RNA-seq|single cell RNA sequencing|single-cell transcriptome sequencing|scRNA-seq' >>> scrna.save()