sakura.dataset.rna_count_dask.SCRNASeqCountDataDask

class sakura.dataset.rna_count_dask.SCRNASeqCountDataDask(gene_csv_path, pheno_csv_path, gene_meta_json_path=None, pheno_meta_json_path=None, gene_meta=None, pheno_meta=None, mode='all', verbose=False)

Bases: Dataset

Dask version of scRNA-seq count dataset class for SAKURA inputs

This class fits for dataset with a very large number of cells.

Parameters:

gene_csv_path (str) – Path to the gene csv file
pheno_csv_path (str) – Path to the phenotype csv file
gene_meta_json_path (str, optional) – Path to the genotype meta JSON file
pheno_meta_json_path – Path to the phenotype meta JSON file
gene_meta* – A configuration dictionary related to gene data processing
pheno_meta (dict[str, Any], optional) – A dictionary contains definition and configurations of phenotype data
mode (str, optional) – data export option [‘all’,’key’, or others] of the dataset, defaults to ‘all’.
verbose (bool, optional) – Whether to enable verbose console logging, defaults to False

Expected inputs: Unlike rna_count, which directly accepts the Seurat compatible datasheets (i.e. row gene, col cell)

gene_csv:

Assuming rows are cells (or samples), columns are genes
rownames are sample identifiers (cell names)
colnames are gene identifiers (gene names, or Ensembl IDs)

genotype_meta_csv:

A JSON file related to gene data processing
pre_procedure: transformations that will perform when load the dataset
post_procedure: transformations that will perform when export requested samples

phenotype_csv:

Assuming rows are cells (or samples), columns are metadata features
rownames are sample identifiers (cell names)

phenotype_meta_csv:

A JSON file to define Type, Range, and Order for phenotype columns, and related to phenotype configurations for SAKURA model training
Storage entity is a dict
Type: ‘categorical’, ‘numeric’, ‘ordinal’ (tbd)
The ‘categorical’ range: array of possible values, ordered
pre_procedure: transformations that will perform when load the dataset
post_procedure: transformations that will perform when export requested samples

Modes:

‘all’: export both raw and processed data, together with names/keys of cells
‘key’: export only names/keys of cells
otherwise: export only processed data

Transformations:

ToTensor: convert input data into a PyTorch tensor; input type should be ‘gene’ or ‘pheno’
ToOneHot: transform categorical data to one-hot encoding; an order of classes should be specified, otherwise will use sorted labels, assuming the range of labels is derived from input data
ToOrdinal: convert categorical data into ordinal (integer) encoding; each unique category is assigned with a unique integer value, which can be useful for models that require numerical input
ToKBins: transform continuous data into k bins; quantile-based binning is applied to convert continuous features into categorical features

Note

For more details of the transformations, see utils.data_transformations().

<gene_meta> example:

{
    "all": {
        "gene_list": "*",
        "pre_procedure": [],
        "post_procedure": [{
        "type": "ToTensor"
        }]
    }
}

<pheno_meta>: For more details of the JSON structure, see utils.data_transformations().

<na_filter>: For phenotype data without any NA values, passing <na_filter>=False can improve the performance of reading a large file.

Methods