sakura.dataset.rna_count_sparse.SCRNASeqCountDataSparse

class sakura.dataset.rna_count_sparse.SCRNASeqCountDataSparse(gene_MM_path, gene_name_csv_path, cell_name_csv_path, pheno_csv_path, pheno_df_dtype=None, pheno_df_na_filter=True, gene_meta_json_path=None, pheno_meta_json_path=None, gene_meta=None, pheno_meta=None, mode='all', verbose=False)

Bases: Dataset

Sparse version of scRNA-Seq count dataset class for SAKURA inputs

Accepts matrixMM (could be dgcmatrix in R) as data contained (will still load everything into memory, but using sparse matrix now).

Parameters:
  • gene_MM_path (str) – Path to the gene MM file

  • gene_name_csv_path (str) – Path to the gene name csv file

  • cell_name_csv_path (str) – Path to the cell name csv file

  • pheno_csv_path (str) – Path to the phenotype csv file

  • pheno_df_dtype (dtype or dict of {Hashable dtype}, optional) – Pandas dtype applied to phenotype data, either the whole dataframe or individual columns

  • pheno_df_na_filter* – Detect missing value markers (empty strings and the value of na_values), defaults to True

  • gene_meta_json_path (str, optional) – Path to the genotype meta JSON file

  • pheno_meta_json_path (str, optional) – Path to the phenotype meta JSON file

  • gene_meta* – A configuration dictionary related to gene data processing

  • pheno_meta (dict[str, Any], optional) – A dictionary contains definition and configurations of phenotype data

  • mode (str, optional) – data export option [‘all’,’key’, or others] of the dataset, defaults to ‘all’.

  • verbose (bool, optional) – Whether to enable verbose console logging, defaults to False

Expected inputs:

gene_MM: gene expression matrix MM .mtx file

gene_name_csv: gene identifiers (gene names, or Ensembl IDs) of the gene expression matrix

cell_name_csv: cell names (or sample identifiers) of the gene expression matrix

genotype_meta_csv:
  • A JSON file related to gene data processing

  • pre_procedure: transformations that will perform when load the dataset

  • post_procedure: transformations that will perform when export requested samples

phenotype_csv:
  • Assuming rows are samples, columns are metadata contents

  • rownames are sample identifiers (cell names)

phenotype_meta_csv:
  • A JSON file to define Type, Range, and Order for phenotype columns, and related to phenotype configurations for SAKURA model training

  • Storage entity is a dict

  • Type: ‘categorical’, ‘numeric’, ‘ordinal’ (tbd)

  • The ‘categorical’ range: array of possible values, ordered

  • pre_procedure: transformations that will perform when load the dataset

  • post_procedure: transformations that will perform when export requested samples

Modes:
  • ‘all’: export both raw and processed data, together with names/keys of cells

  • ‘key’: export only names/keys of cells

  • otherwise: export only processed data

Transformations:
  • ToTensor: convert input data into a PyTorch tensor; input type should be ‘gene’ or ‘pheno’

  • ToOneHot: transform categorical data to one-hot encoding; an order of classes should be specified, otherwise will use sorted labels, assuming the range of labels is derived from input data

  • ToOrdinal: convert categorical data into ordinal (integer) encoding; each unique category is assigned with a unique integer value, which can be useful for models that require numerical input

  • ToKBins: transform continuous data into k bins; quantile-based binning is applied to convert continuous features into categorical features

Note

For more details of the transformations, see utils.data_transformations().

<gene_meta> example:

{
    "all": {
        "gene_list": "*",
        "pre_procedure": [],
        "post_procedure": [{
        "type": "ToTensor"
        }]
    }
}

<pheno_meta>: For more details of the JSON structure, see utils.data_transformations().

<na_filter>: For phenotype data without any NA values, passing <na_filter>=False can improve the performance of reading a large file.

Methods

collate_fn

Assemble individual data sample from a batch of samples.

export_data

Export a batch of data based on the specified items and expression data flags.

expr_set_pre_slice

Pre-slices the expression matrix based on gene metadata.