Tutorial: SCENIC+ step-by-step in the human cerebellum

[1]:
%matplotlib inline
import scenicplus
scenicplus.__version__
[1]:
'0.1.dev373+geb5f14e'

In this tutorial we describe the minimum steps to generate a SCENIC+ object and build e-GRNs. For more details on how the data was processed with pycisTopic and pycistarget, check the human cerebellum tutorials available at https://pycistopic.readthedocs.io/ and https://pycistarget.readthedocs.io/, respectively.

Data used in the tutorial is freely accessible at: https://www.10xgenomics.com/resources/datasets/frozen-human-healthy-brain-tissue-3-k-1-standard-1-0-0

Processed loom files are available at: https://scope.aertslab.org/#/scenic-v2

A UCSC session is available at: https://genome-euro.ucsc.edu/s/cbravo/SCENIC%2B_cerebellum

1. Create SCENIC+ object

For generating a SCENIC+ you will require:

  • scRNA-seq annData object (e.g. scanpy)

  • scATAC-seq cisTopic object

  • Pycistarget motif enrichment dictionary

[1]:
# Load functions
from scenicplus.scenicplus_class import SCENICPLUS, create_SCENICPLUS_object
from scenicplus.preprocessing.filtering import *

First we will load the scRNA-seq and the scATAC-seq data. We make sure that names match between them.

[2]:
# Load data
## ATAC - cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
## Precomputed imputed data
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/Imputed_accessibility.pkl', 'rb')
imputed_acc_obj = pickle.load(infile)
infile.close()
## RNA - Create Anndata
from loomxpy.loomxpy import SCopeLoom
from pycisTopic.loom import *
import itertools
import anndata
path_to_loom = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/seurat/10x_multiome_brain_Seurat.loom'
loom = SCopeLoom.read_loom(path_to_loom)
metadata = get_metadata(loom)
# Fix names
metadata['barcode'] = [x.split('-')[0] for x in metadata.index.tolist()]
metadata['barcode'] = metadata['barcode'] + '-1'
metadata.index = metadata['barcode'] + '-10x_multiome_brain'
expr_mat = loom.ex_mtx
expr_mat.index = metadata['barcode'] + '-10x_multiome_brain'
rna_anndata = anndata.AnnData(X=expr_mat)
rna_anndata.obs = metadata

If you have generated your cisTopic object with an old version of pycisTopic, it is possible that your region data was affected by a previous bug. You can fix it with the code below:

[3]:
# Fix region data (bug in old pycistopic versions)
from pycisTopic.utils import region_names_to_coordinates
fragment_matrix = cistopic_obj.fragment_matrix
binary_matrix = cistopic_obj.binary_matrix
region_data = region_names_to_coordinates(cistopic_obj.region_names)
region_data['Width'] = abs(region_data.End -region_data.Start).astype(np.int32)
region_data['cisTopic_nr_frag'] = np.array(
fragment_matrix.sum(axis=1)).flatten()
region_data['cisTopic_log_nr_frag'] = np.log10(
region_data['cisTopic_nr_frag'])
region_data['cisTopic_nr_acc'] = np.array(
binary_matrix.sum(axis=1)).flatten()
region_data['cisTopic_log_nr_acc'] = np.log10(
region_data['cisTopic_nr_acc'])
cistopic_obj.region_data = region_data

Next we load the motif enrichment results into a dictionary. We can load motif results from the different methods in pycistarget (e.g. cisTarget, DEM) and different region sets (e.g. topics, DARs, MACS bdgdiff peaks). In this tutorial we will use both cisTarget and DEM peaks from topics and DARs. Note that if you have used the pycistarget wrapper function, the dictionary in menr.pkl can be directly pass as menr in the create_SCENICPLUS_object function.

[4]:
# Load cistarget and DEM motif enrichment results
motif_enrichment_dict={}
import pickle
from pycistarget.motif_enrichment_dem import *
path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/'
infile = open(path +'topics/topic_cistarget_dict.pkl', 'rb')
motif_enrichment_dict['CTX_Topics_All'] = pickle.load(infile)
infile.close()
infile = open(path +'topics/topic_DEM_dict.pkl', 'rb')
motif_enrichment_dict['DEM_Topics_All'] = pickle.load(infile)
infile.close()
infile = open(path +'DARs/DARs_cistarget_dict.pkl', 'rb')
motif_enrichment_dict['CTX_DARs_All'] = pickle.load(infile)
infile.close()
infile = open(path +'DARs/DARs_DEM_dict.pkl', 'rb')
motif_enrichment_dict['DEM_DARs_All'] = pickle.load(infile)
infile.close()

Now we can create the SCENIC+ object:

[5]:
scplus_obj = create_SCENICPLUS_object(
        GEX_anndata = rna_anndata,
        cisTopic_obj = cistopic_obj,
        imputed_acc_obj = imputed_acc_obj,
        menr = motif_enrichment_dict,
        ACC_prefix = 'ACC_',
        GEX_prefix = 'GEX_',
        bc_transform_func = lambda x: x,
        normalize_imputed_acc = False)
[6]:
print(scplus_obj)
SCENIC+ object with n_cells x n_genes = 1736 x 26399 and n_cells x n_regions = 1736 x 422146
        metadata_regions:'Chromosome', 'Start', 'End', 'Width', 'cisTopic_nr_frag', 'cisTopic_log_nr_frag', 'cisTopic_nr_acc', 'cisTopic_log_nr_acc'
        metadata_cell:'GEX_VSN_cell_type', 'GEX_VSN_leiden_res0.3', 'GEX_VSN_leiden_res0.6', 'GEX_VSN_leiden_res0.9', 'GEX_VSN_leiden_res1.2', 'GEX_VSN_sample_id', 'GEX_Seurat_leiden_res0.6', 'GEX_Seurat_leiden_res1.2', 'GEX_Seurat_cell_type', 'GEX_barcode', 'ACC_cisTopic_nr_frag', 'ACC_cisTopic_log_nr_acc', 'ACC_barcode', 'ACC_VSN_leiden_res0.6', 'ACC_FRIP', 'ACC_TSS_enrichment', 'ACC_cisTopic_nr_acc', 'ACC_VSN_leiden_res0.9', 'ACC_VSN_RNA+ATAC_leiden_100_2', 'ACC_Dupl_rate', 'ACC_sample_id', 'ACC_pycisTopic_leiden_10_0.6', 'ACC_pycisTopic_leiden_10_1.2', 'ACC_Seurat_leiden_res0.6', 'ACC_Total_nr_frag', 'ACC_Total_nr_frag_in_regions', 'ACC_VSN_leiden_res1.2', 'ACC_VSN_cell_type', 'ACC_Unique_nr_frag_in_regions', 'ACC_Seurat_cell_type', 'ACC_Predicted_doublets_fragments', 'ACC_Dupl_nr_frag', 'ACC_cisTopic_log_nr_frag', 'ACC_VSN_leiden_res0.3', 'ACC_Log_total_nr_frag', 'ACC_Unique_nr_frag', 'ACC_Doublet_scores_fragments', 'ACC_VSN_sample_id', 'ACC_Log_unique_nr_frag', 'ACC_Seurat_leiden_res1.2', 'ACC_Seurat_RNA+ATAC_leiden_100_2', 'ACC_pycisTopic_ingest_cell_type', 'ACC_pycisTopic_harmony_cell_type', 'ACC_pycisTopic_bbknn_cell_type', 'ACC_pycisTopic_scanorama_cell_type', 'ACC_pycisTopic_cca_cell_type'
        menr:'CTX_Topics_All', 'DEM_Topics_All', 'CTX_DARs_All', 'DEM_DARs_All'
        dr_cell:'ACC_UMAP', 'ACC_tSNE', 'ACC_VSN_RNA+ATAC_UMAP', 'ACC_Seurat_RNA+ATAC_UMAP'

You can also filter low accessible regions and low expressed genes. This recommended to avoid getting false relationships with these regions and genes.

[7]:
filter_genes(scplus_obj, min_pct = 0.5)
filter_regions(scplus_obj, min_pct = 0.5)
2022-01-04 17:24:16,750 Preprocessing INFO     Going from 26399 genes to 20448 genes.
2022-01-04 17:24:50,192 Preprocessing INFO     Going from 422146 regions to 375805 regions.

2. Generate cistromes

The next step is to generate cistromes. By default, all targets assigned to a TF across the motif enrichment dictionaries will be taken, and overlapped with regions in the SCENIC+ object. However, it is possible to also subset for regions accessible in certain cell type as well, to generate cell type specific cistromes. This approach is described in the tutorial cistrome_pruning_advanced.ipynb.

[8]:
# Merge cistromes (all)
from scenicplus.cistromes import *
import time
start_time = time.time()
merge_cistromes(scplus_obj)
time = time.time()-start_time
print(time/60)
7.251678760846456

3. Infer enhancer to gene relationships

To infer enhancer-to-gene relationships, we exploit correlation between region accessibility and gene expression. In addition, to assess no non-linear relationships, we also use Gradient Boosting Machines. More details on these steps can be found in the tutorial r2g_advanced.ipynb.

A. Get search space

We first need to define the search space around the gene. Here we will use 150kb upstream/downtream the gene, but TAD boundaries can be also used. WARNING: Make sure that the specified biomart_host matches your genome assembly.

[29]:
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'rb')
scplus_obj = pickle.load(infile)
infile.close()
[30]:
from scenicplus.enhancer_to_gene import get_search_space, calculate_regions_to_genes_relationships, GBM_KWARGS
get_search_space(scplus_obj,
                 biomart_host = 'http://www.ensembl.org',
                 species = 'hsapiens',
                 assembly = 'hg38',
                 upstream = [1000, 150000],
                 downstream = [1000, 150000])
2022-01-04 17:50:07,795 R2G          INFO     Downloading gene annotation from biomart dataset: hsapiens_gene_ensembl
/opt/venv/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py:197: DtypeWarning:

Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.

2022-01-04 17:50:40,505 R2G          INFO     Downloading chromosome sizes from: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
2022-01-04 17:50:42,099 R2G          INFO     Extending promoter annotation to 10 bp upstream and 10 downstream
2022-01-04 17:50:44,632 R2G          INFO     Extending search space to:
                                                        150000 bp downstream of the end of the gene.
                                                        150000 bp upstream of the start of the gene.
2022-01-04 17:51:07,232 R2G          INFO     Intersecting with regions.
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
2022-01-04 17:51:08,592 R2G          INFO     Calculating distances from region to gene
2022-01-04 17:55:43,084 R2G          INFO     Imploding multiple entries per region and gene
2022-01-04 17:58:01,862 R2G          INFO     Done!

B. Enhancer-to-gene models

Enhancer-to-gene models can be done using correlation, random forest (RF) or Gradient Boosting Machines (GBM). GBMs are a much faster alternative to RFs.

[31]:
calculate_regions_to_genes_relationships(scplus_obj,
                    ray_n_cpu = 20,
                    _temp_dir = '/scratch/leuven/313/vsc31305/ray_spill',
                    importance_scoring_method = 'GBM',
                    importance_scoring_kwargs = GBM_KWARGS)
2022-01-04 17:59:17,845 R2G          INFO     Calculating region to gene importances, using GBM method
2022-01-04 17:59:30,930 INFO services.py:1338 -- View the Ray dashboard at http://127.0.0.1:8267
initializing:  22%|██▏       | 3225/14548 [04:27<15:17, 12.33it/s]
(score_regions_to_single_gene_ray pid=34847)
initializing:  27%|██▋       | 3905/14548 [05:23<14:24, 12.30it/s]
(score_regions_to_single_gene_ray pid=34848)
initializing:  27%|██▋       | 3953/14548 [05:27<14:34, 12.11it/s]
(score_regions_to_single_gene_ray pid=34840)
initializing:  54%|█████▍    | 7836/14548 [11:03<09:18, 12.01it/s]  (score_regions_to_single_gene_ray pid=34844)
initializing:  68%|██████▊   | 9874/14548 [13:49<06:23, 12.20it/s]
(score_regions_to_single_gene_ray pid=34855)
initializing: 100%|██████████| 14548/14548 [20:26<00:00, 11.86it/s]
Running using 20 cores: 100%|██████████| 14548/14548 [00:35<00:00, 408.74it/s]
2022-01-04 18:20:38,987 R2G          INFO     Took 1281.1407558918 seconds
2022-01-04 18:20:38,990 R2G          INFO     Calculating region to gene correlation, using SR method
2022-01-04 18:20:53,862 INFO services.py:1338 -- View the Ray dashboard at http://127.0.0.1:8267
initializing:  68%|██████▊   | 9871/14548 [13:15<06:17, 12.39it/s]  (score_regions_to_single_gene_ray pid=20985)
initializing:  84%|████████▍ | 12185/14548 [16:17<03:03, 12.88it/s](score_regions_to_single_gene_ray pid=20982)
initializing:  98%|█████████▊| 14275/14548 [19:01<00:21, 12.94it/s](score_regions_to_single_gene_ray pid=20988)
initializing: 100%|██████████| 14548/14548 [19:23<00:00, 12.51it/s]
Running using 20 cores: 100%|██████████| 14548/14548 [00:50<00:00, 288.45it/s]
2022-01-04 18:41:13,309 R2G          INFO     Took 1234.3179178237915 seconds
2022-01-04 18:41:26,108 R2G          INFO     Done!
[32]:
# Save
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'wb') as f:
  pickle.dump(scplus_obj, f)

4. Infer TF to gene relationships

The next step is to infer relationships between TFs and target genes based on expression. We will use similar approaches as for the enhancer-to-gene relationships (GBM/RF and correlation).

[33]:
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'rb')
scplus_obj = pickle.load(infile)
infile.close()
[ ]:
from scenicplus.TF_to_gene import *
tf_file = '/staging/leuven/stg_00002/lcb/cflerin/resources/allTFs_hg38.txt'
calculate_TFs_to_genes_relationships(scplus_obj,
                    tf_file = tf_file,
                    ray_n_cpu = 20,
                    method = 'GBM',
                    _temp_dir = '/scratch/leuven/313/vsc31305/ray_spill',
                    key= 'TF2G_adj')
[ ]:
# Save
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'wb') as f:
  pickle.dump(scplus_obj, f)

If you have run SCENIC before in the gene expression matrix, it is possible to directly load adjancencies from that pipeline:

[ ]:
load_TF2G_adj_from_file(scplus_obj,
                        f_adj = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/single_sample_scenic_HQ/out/scenic/10x_multiome_brain_HQ/arboreto_with_multiprocessing/10x_multiome_brain_HQ__adj.tsv',
                        inplace = True,
                        key= 'TF2G_adj')

5. Build eGRNs

The last step is to build the eGRNs using a recovery approach (GSEA). The ranking to use will be based on the TF-2-gene importances, while gene sets will be derived with different thresholding methods on the enhancer-to-gene relationships and the unfiltered cistromes.

[1]:
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'rb')
scplus_obj = pickle.load(infile)
infile.close()
[ ]:
# Load functions
from scenicplus.grn_builder.gsea_approach import build_grn

build_grn(scplus_obj,
         min_target_genes = 10,
         adj_pval_thr = 1,
         min_regions_per_gene = 0,
         quantiles = (0.85, 0.90, 0.95),
         top_n_regionTogenes_per_gene = (5, 10, 15),
         top_n_regionTogenes_per_region = (),
         binarize_using_basc = True,
         rho_dichotomize_tf2g = True,
         rho_dichotomize_r2g = True,
         rho_dichotomize_eregulon = True,
         rho_threshold = 0.05,
         keep_extended_motif_annot = True,
         merge_eRegulons = True,
         order_regions_to_genes_by = 'importance',
         order_TFs_to_genes_by = 'importance',
         key_added = 'eRegulons_importance',
         cistromes_key = 'Unfiltered',
         disable_tqdm = False, #If running in notebook, set to True
         ray_n_cpu = 20,
         _temp_dir = '/scratch/leuven/313/vsc31305/ray_spill')

To access the eGRNs:

[ ]:
import dill
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'wb') as f:
  dill.dump(scplus_obj, f)

6. Exploring SCENIC+ results

A. Generate eRegulon metadata

As a first step, we will format the eGRN metadata. This will integrate the results from the inference of enhancer-to-gene and TF-to-gene relationships and the eRegulon construction in one pandas dataframe that can be used for further exploration.

[1]:
import dill
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'rb')
scplus_obj = dill.load(infile)
infile.close()
[ ]:
from scenicplus.utils import format_egrns
format_egrns(scplus_obj, eregulons_key = 'eRegulons_importance', TF2G_key = 'TF2G_adj', key_added = 'eRegulon_metadata')

The eRegulon metadata will look as:

[4]:
scplus_obj.uns['eRegulon_metadata'][0:10]
[4]:
Region_signature_name Gene_signature_name TF is_extended Region Gene R2G_importance R2G_rho R2G_importance_x_rho R2G_importance_x_abs_rho TF2G_importance TF2G_regulation TF2G_rho TF2G_importance_x_abs_rho TF2G_importance_x_rho
0 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chr1:244918651-244919151 HNRNPU 0.021839 0.085878 0.001876 0.001876 0.236510 1 0.122128 0.028884 0.028884
1 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chr10:88991114-88991614 FAS 0.023378 0.215477 0.005037 0.005037 0.457794 1 0.162854 0.074554 0.074554
2 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chr10:88990599-88991099 FAS 0.031784 0.212687 0.006760 0.006760 0.457794 1 0.162854 0.074554 0.074554
3 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chrX:107676906-107677406 MID2 0.021056 0.098110 0.002066 0.002066 0.663257 1 0.165609 0.109842 0.109842
4 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chr16:1556214-1556714 IFT140 0.015752 0.098841 0.001557 0.001557 0.336581 1 0.095788 0.032240 0.032240
5 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chr10:103692953-103693453 SH3PXD2A 0.004005 0.146456 0.000587 0.000587 4.731638 1 0.200345 0.947960 0.947960
6 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chr19:57527137-57527637 ZNF549 0.078793 0.083988 0.006618 0.006618 0.522762 1 0.154494 0.080764 0.080764
7 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chr22:36323698-36324198 TXN2 0.020686 0.073909 0.001529 0.001529 1.298719 1 0.160698 0.208702 0.208702
8 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chr6:15365250-15365750 JARID2 0.025076 0.174829 0.004384 0.004384 0.678238 1 0.231925 0.157300 0.157300
9 ACO1_+_+_(24r) ACO1_+_+_(23g) ACO1 False chr10:126525383-126525883 ADAM12 0.027865 0.067947 0.001893 0.001893 0.233867 1 0.090295 0.021117 0.021117

The first sign in the eRegulon name indicates the relationship between TF and gene; while the second indicates the relationship between region and gene. Additional columns can be added by the user, for example the enrichment of the TF motif in the regions.

B. Assesing eGRN enrichment in cells

Next, we can score the eRegulons in the cells to assess how enriched target regions and target genes are enriched in each cell. We will score the region-based regulons on the scATAC-seq layer, while gene-based eRegulons will be scored using the transcriptome layer.

[40]:
# Format eRegulons
from scenicplus.eregulon_enrichment import *
get_eRegulons_as_signatures(scplus_obj, eRegulon_metadata_key='eRegulon_metadata', key_added='eRegulon_signatures')
[6]:
## Score chromatin layer
# Region based raking
from scenicplus.cistromes import *
import time
start_time = time.time()
region_ranking = make_rankings(scplus_obj, target='region')
# Score region regulons
score_eRegulons(scplus_obj,
                ranking = region_ranking,
                eRegulon_signatures_key = 'eRegulon_signatures',
                key_added = 'eRegulon_AUC',
                enrichment_type= 'region',
                auc_threshold = 0.05,
                normalize = False,
                n_cpu = 1)
time = time.time()-start_time
print(time/60)
100%|██████████| 832/832 [00:26<00:00, 31.16it/s]
2.8363844593365988
[7]:
## Score transcriptome layer
# Gene based raking
from scenicplus.cistromes import *
import time
start_time = time.time()
gene_ranking = make_rankings(scplus_obj, target='gene')
# Score gene regulons
score_eRegulons(scplus_obj,
                gene_ranking,
                eRegulon_signatures_key = 'eRegulon_signatures',
                key_added = 'eRegulon_AUC',
                enrichment_type = 'gene',
                auc_threshold = 0.05,
                normalize= False,
                n_cpu = 1)
time = time.time()-start_time
print(time/60)
100%|██████████| 832/832 [00:07<00:00, 104.80it/s]
0.21863729159037273

C. Assessing TF-eGRN relationships

Next we can assess the relationship between the TF expression and the eRegulons; in other words, whether genes expressed/repressed and regions accessible/closed when the TF is present. Due to the amount of drop-outs, and the variability in cell types proportions, using directly the AUC cistrome matrix can result in noisy correlations. Here, we use pseudobulks, in which we sample a number of cells per cell type. In this example, we merge 5 cells per pseudobulk and generate 100 pseudobulks per cell type.

[8]:
# Generate pseudobulks
import time
start_time = time.time()
generate_pseudobulks(scplus_obj,
                         variable = 'ACC_Seurat_cell_type',
                         auc_key = 'eRegulon_AUC',
                         signature_key = 'Gene_based',
                         nr_cells = 5,
                         nr_pseudobulks = 100,
                         seed=555)
generate_pseudobulks(scplus_obj,
                         variable = 'ACC_Seurat_cell_type',
                         auc_key = 'eRegulon_AUC',
                         signature_key = 'Region_based',
                         nr_cells = 5,
                         nr_pseudobulks = 100,
                         seed=555)
time = time.time()-start_time
print(time/60)
1.9503313183784485
[9]:
# Correlation between TF and eRegulons
import time
start_time = time.time()
TF_cistrome_correlation(scplus_obj,
                        variable = 'ACC_Seurat_cell_type',
                        auc_key = 'eRegulon_AUC',
                        signature_key = 'Gene_based',
                        out_key = 'ACC_Seurat_cell_type_eGRN_gene_based')
TF_cistrome_correlation(scplus_obj,
                        variable = 'ACC_Seurat_cell_type',
                        auc_key = 'eRegulon_AUC',
                        signature_key = 'Region_based',
                        out_key = 'ACC_Seurat_cell_type_eGRN_region_based')
time = time.time()-start_time
print(time/60)
0.032043548425038655

We can plot eRegulon enrichment versus TF expression for each pseudobulk.

[11]:
# Region based
%matplotlib inline
import seaborn as sns
sns.set_style("white")
colors = ["#E9842C","#F8766D", "#BC9D00", "#00C0B4", "#9CA700", "#6FB000", "#00B813", "#00BD61", "#00C08E", "#00BDD4",
           "#00A7FF", "#7F96FF", "#E26EF7", "#FF62BF", "#D69100", "#BC81FF"]
categories = sorted(set(scplus_obj.metadata_cell['ACC_Seurat_cell_type']))
color_dict = dict(zip(categories, colors[0:len(categories)]))
prune_plot(scplus_obj,
           'SOX5_-_+',
           pseudobulk_variable = 'ACC_Seurat_cell_type',
           show_dot_plot = True,
           show_line_plot = False,
           color_dict = color_dict,
           use_pseudobulk = True,
           auc_key = 'eRegulon_AUC',
           signature_key = 'Region_based',
           seed=555)
_images/Scenicplus_step_by_step-RTD_62_0.png
[12]:
# Gene based
%matplotlib inline
sns.set_style("white")
colors = ["#E9842C","#F8766D", "#BC9D00", "#00C0B4", "#9CA700", "#6FB000", "#00B813", "#00BD61", "#00C08E", "#00BDD4",
           "#00A7FF", "#7F96FF", "#E26EF7", "#FF62BF", "#D69100", "#BC81FF"]
categories = sorted(set(scplus_obj.metadata_cell['ACC_Seurat_cell_type']))
color_dict = dict(zip(categories, colors[0:len(categories)]))
prune_plot(scplus_obj,
           'SOX5_-_+',
           pseudobulk_variable = 'ACC_Seurat_cell_type',
           show_dot_plot = True,
           show_line_plot = False,
           color_dict = color_dict,
           use_pseudobulk = True,
           auc_key = 'eRegulon_AUC',
           signature_key = 'Gene_based',
           seed=555)
_images/Scenicplus_step_by_step-RTD_63_0.png

D. Identification of high quality regulons

We will select a subset of regulons based on the correlation between the region based and the gene based regulons. We will only use the extended eRegulon if there is not a direct eRegulon available.

[46]:
# Correlation between region based regulons and gene based regulons
import pandas
df1 = scplus_obj.uns['eRegulon_AUC']['Gene_based'].copy()
df2 = scplus_obj.uns['eRegulon_AUC']['Region_based'].copy()
df1.columns = [x.split('_(')[0] for x in df1.columns]
df2.columns = [x.split('_(')[0] for x in df2.columns]
correlations = df1.corrwith(df2, axis = 0)
correlations = correlations[abs(correlations) > 0.6]
# Kepp only R2G +
keep = [x for x in correlations.index if '+_+' in x] + [x for x in correlations.index if '-_+' in x]
# Keep extended if not direct
extended = [x for x in keep if 'extended' in x]
direct = [x for x in keep if not 'extended' in x]
keep_extended = [x for x in extended if not x.replace('extended_', '') in direct]
keep = direct + keep_extended
# Keep regulons with more than 10 genes
keep_gene = [x for x in scplus_obj.uns['eRegulon_AUC']['Gene_based'].columns if x.split('_(')[0] in keep]
keep_gene = [x for x in keep_gene if (int(x.split('_(')[1].replace('g)', '')) > 10)]
keep_all = [x.split('_(')[0] for x in keep_gene]
keep_region = [x for x in scplus_obj.uns['eRegulon_AUC']['Region_based'].columns if x.split('_(')[0] in keep]
scplus_obj.uns['selected_eRegulons'] = {}
scplus_obj.uns['selected_eRegulons']['Gene_based'] = keep_gene
scplus_obj.uns['selected_eRegulons']['Region_based'] = keep_region
[47]:
len(keep_gene)
[47]:
114

E. Overlap between eRegulons

In addition, to assess which eRegulons tend to be enriched in the same group of cells we can generate a correlation plot as well.

[49]:
from scenicplus.plotting.correlation_plot import *
correlation_heatmap(scplus_obj,
                    auc_key = 'eRegulon_AUC',
                    signature_keys = ['Gene_based'],
                    selected_regulons = scplus_obj.uns['selected_eRegulons']['Gene_based'],
                    fcluster_threshold = 0.1,
                    fontsize = 3)