sakura.dataset.rna_count.SCRNASeqCountData
- class sakura.dataset.rna_count.SCRNASeqCountData(gene_csv_path, pheno_csv_path, gene_meta_json_path=None, pheno_meta_json_path=None, pheno_df_dtype=None, pheno_df_na_filter=True, gene_meta=None, pheno_meta=None, mode='all', verbose=False)
Bases:
DatasetGeneral scRNA-Seq count dataset class for SAKURA inputs
- Parameters:
gene_csv_path (str) – Path to the gene csv file
pheno_csv_path (str) – Path to the phenotype csv file
gene_meta_json_path (str, optional) – Path to the genotype meta JSON file
pheno_meta_json_path (str, optional) – Path to the phenotype meta JSON file
pheno_df_dtype (dtype or dict of {Hashable dtype}, optional) – Pandas dtype applied to phenotype data, either the whole dataframe or individual columns
pheno_df_na_filter* – Detect missing value markers (empty strings and the value of na_values), defaults to True :type pheno_df_na_filter: bool, optional
gene_meta* – A configuration dictionary related to gene data processing
pheno_meta (dict[str, Any], optional) – A dictionary contains definition and configurations of phenotype data
mode (str, optional) – data export option [‘all’,’key’, or others] of the dataset, defaults to ‘all’.
verbose (bool, optional) – Whether to enable verbose console logging, defaults to False
Expected inputs:
- gene_csv:
Assuming rows are genes, columns are samples (or cells)
rownames are gene identifiers (gene names, or Ensembl IDs)
colnames are sample identifiers (cell names)
- genotype_meta_csv:
A JSON file related to gene data processing
pre_procedure: transformations that will perform when load the dataset
post_procedure: transformations that will perform when export requested samples
- phenotype_csv:
Assuming rows are samples, columns are metadata contents
rownames are sample identifiers (cell names)
- phenotype_meta_csv:
A JSON file to define Type, Range, and Order for phenotype columns, and related to phenotype configurations for SAKURA model training
Storage entity is a dict
Type: ‘categorical’, ‘numeric’, ‘ordinal’ (tbd)
For ‘categorical’ range: array of possible values, ordered
pre_procedure: transformations that will perform when load the dataset
post_procedure: transformations that will perform when export requested samples
- Modes:
‘all’: export both raw and processed data, together with names/keys of cells
‘key’: export only names/keys of cells
otherwise: export only processed data
- Transformations:
ToTensor: convert input data into a PyTorch tensor; input type should be ‘gene’ or ‘pheno’
ToOneHot: transform categorical data to one-hot encoding; an order of classes should be specified, otherwise will use sorted labels, assuming the range of labels is derived from input data
ToOrdinal: convert categorical data into ordinal (integer) encoding; each unique category is assigned with a unique integer value, which can be useful for models that require numerical input
ToKBins: transform continuous data into k bins; quantile-based binning is applied to convert continuous features into categorical features
Note
For more details of the transformations, see
utils.data_transformations().<gene_meta> example:
{ "all": { "gene_list": "*", "pre_procedure": [], "post_procedure": [{ "type": "ToTensor" }] } }
<pheno_meta>: For more details of the JSON structure, see
utils.data_transformations().<na_filter>: For phenotype data without any NA values, passing <na_filter>=False can improve the performance of reading a large file.
Methods
Export a batch of data based on the specified items and expression data flags.
Pre-slices the expression matrix based on gene metadata.