sakura.dataset.rna_count_dask.SCRNASeqCountDataDask

class sakura.dataset.rna_count_dask.SCRNASeqCountDataDask(gene_csv_path, pheno_csv_path, gene_meta_json_path=None, pheno_meta_json_path=None, gene_meta=None, pheno_meta=None, mode='all', verbose=False)

Bases: Dataset

Dask version of scRNA-seq count dataset class for SAKURA inputs

This class fits for dataset with a very large number of cells.

Parameters:
  • gene_csv_path (str) – Path to the gene csv file

  • pheno_csv_path (str) – Path to the phenotype csv file

  • gene_meta_json_path (str, optional) – Path to the genotype meta JSON file

  • pheno_meta_json_path – Path to the phenotype meta JSON file

  • gene_meta* – A configuration dictionary related to gene data processing

  • pheno_meta (dict[str, Any], optional) – A dictionary contains definition and configurations of phenotype data

  • mode (str, optional) – data export option [‘all’,’key’, or others] of the dataset, defaults to ‘all’.

  • verbose (bool, optional) – Whether to enable verbose console logging, defaults to False

Expected inputs: Unlike rna_count, which directly accepts the Seurat compatible datasheets (i.e. row gene, col cell)

gene_csv:
  • Assuming rows are cells (or samples), columns are genes

  • rownames are sample identifiers (cell names)

  • colnames are gene identifiers (gene names, or Ensembl IDs)

genotype_meta_csv:
  • A JSON file related to gene data processing

  • pre_procedure: transformations that will perform when load the dataset

  • post_procedure: transformations that will perform when export requested samples

phenotype_csv:
  • Assuming rows are cells (or samples), columns are metadata features

  • rownames are sample identifiers (cell names)

phenotype_meta_csv:
  • A JSON file to define Type, Range, and Order for phenotype columns, and related to phenotype configurations for SAKURA model training

  • Storage entity is a dict

  • Type: ‘categorical’, ‘numeric’, ‘ordinal’ (tbd)

  • The ‘categorical’ range: array of possible values, ordered

  • pre_procedure: transformations that will perform when load the dataset

  • post_procedure: transformations that will perform when export requested samples

Modes:
  • ‘all’: export both raw and processed data, together with names/keys of cells

  • ‘key’: export only names/keys of cells

  • otherwise: export only processed data

Transformations:
  • ToTensor: convert input data into a PyTorch tensor; input type should be ‘gene’ or ‘pheno’

  • ToOneHot: transform categorical data to one-hot encoding; an order of classes should be specified, otherwise will use sorted labels, assuming the range of labels is derived from input data

  • ToOrdinal: convert categorical data into ordinal (integer) encoding; each unique category is assigned with a unique integer value, which can be useful for models that require numerical input

  • ToKBins: transform continuous data into k bins; quantile-based binning is applied to convert continuous features into categorical features

Note

For more details of the transformations, see utils.data_transformations().

<gene_meta> example:

{
    "all": {
        "gene_list": "*",
        "pre_procedure": [],
        "post_procedure": [{
        "type": "ToTensor"
        }]
    }
}

<pheno_meta>: For more details of the JSON structure, see utils.data_transformations().

<na_filter>: For phenotype data without any NA values, passing <na_filter>=False can improve the performance of reading a large file.

Methods