sakura.dataset.rna_count.SCRNASeqCountData

class sakura.dataset.rna_count.SCRNASeqCountData(gene_csv_path, pheno_csv_path, gene_meta_json_path=None, pheno_meta_json_path=None, pheno_df_dtype=None, pheno_df_na_filter=True, gene_meta=None, pheno_meta=None, mode='all', verbose=False)

Bases: Dataset

General scRNA-Seq count dataset class for SAKURA inputs

Parameters:

gene_csv_path (str) – Path to the gene csv file
pheno_csv_path (str) – Path to the phenotype csv file
gene_meta_json_path (str, optional) – Path to the genotype meta JSON file
pheno_meta_json_path (str, optional) – Path to the phenotype meta JSON file
pheno_df_dtype (dtype or dict of {Hashable dtype}, optional) – Pandas dtype applied to phenotype data, either the whole dataframe or individual columns
pheno_df_na_filter* – Detect missing value markers (empty strings and the value of na_values), defaults to True :type pheno_df_na_filter: bool, optional
gene_meta* – A configuration dictionary related to gene data processing
pheno_meta (dict[str, Any], optional) – A dictionary contains definition and configurations of phenotype data
mode (str, optional) – data export option [‘all’,’key’, or others] of the dataset, defaults to ‘all’.
verbose (bool, optional) – Whether to enable verbose console logging, defaults to False

Expected inputs:

gene_csv:

Assuming rows are genes, columns are samples (or cells)
rownames are gene identifiers (gene names, or Ensembl IDs)
colnames are sample identifiers (cell names)

genotype_meta_csv:

A JSON file related to gene data processing
pre_procedure: transformations that will perform when load the dataset
post_procedure: transformations that will perform when export requested samples

phenotype_csv:

Assuming rows are samples, columns are metadata contents
rownames are sample identifiers (cell names)

phenotype_meta_csv:

A JSON file to define Type, Range, and Order for phenotype columns, and related to phenotype configurations for SAKURA model training
Storage entity is a dict
Type: ‘categorical’, ‘numeric’, ‘ordinal’ (tbd)
For ‘categorical’ range: array of possible values, ordered
pre_procedure: transformations that will perform when load the dataset
post_procedure: transformations that will perform when export requested samples

Modes:

‘all’: export both raw and processed data, together with names/keys of cells
‘key’: export only names/keys of cells
otherwise: export only processed data

Transformations:

ToTensor: convert input data into a PyTorch tensor; input type should be ‘gene’ or ‘pheno’
ToOneHot: transform categorical data to one-hot encoding; an order of classes should be specified, otherwise will use sorted labels, assuming the range of labels is derived from input data
ToOrdinal: convert categorical data into ordinal (integer) encoding; each unique category is assigned with a unique integer value, which can be useful for models that require numerical input
ToKBins: transform continuous data into k bins; quantile-based binning is applied to convert continuous features into categorical features

Note

For more details of the transformations, see utils.data_transformations().

<gene_meta> example:

{
    "all": {
        "gene_list": "*",
        "pre_procedure": [],
        "post_procedure": [{
        "type": "ToTensor"
        }]
    }
}

<pheno_meta>: For more details of the JSON structure, see utils.data_transformations().

<na_filter>: For phenotype data without any NA values, passing <na_filter>=False can improve the performance of reading a large file.

Methods

`export_data`	Export a batch of data based on the specified items and expression data flags.
`expr_set_pre_slice`	Pre-slices the expression matrix based on gene metadata.