Tutorial: SCENIC+ step-by-step in the human cerebellum
Contents
Tutorial: SCENIC+ step-by-step in the human cerebellum
[1]:
%matplotlib inline
import scenicplus
scenicplus.__version__
[1]:
'0.1.dev373+geb5f14e'
In this tutorial we describe the minimum steps to generate a SCENIC+ object and build e-GRNs. For more details on how the data was processed with pycisTopic and pycistarget, check the human cerebellum tutorials available at https://pycistopic.readthedocs.io/ and https://pycistarget.readthedocs.io/, respectively.
Data used in the tutorial is freely accessible at: https://www.10xgenomics.com/resources/datasets/frozen-human-healthy-brain-tissue-3-k-1-standard-1-0-0
Processed loom files are available at: https://scope.aertslab.org/#/scenic-v2
A UCSC session is available at: https://genome-euro.ucsc.edu/s/cbravo/SCENIC%2B_cerebellum
1. Create SCENIC+ object
For generating a SCENIC+ you will require:
scRNA-seq annData object (e.g. scanpy)
scATAC-seq cisTopic object
Pycistarget motif enrichment dictionary
[1]:
# Load functions
from scenicplus.scenicplus_class import SCENICPLUS, create_SCENICPLUS_object
from scenicplus.preprocessing.filtering import *
First we will load the scRNA-seq and the scATAC-seq data. We make sure that names match between them.
[2]:
# Load data
## ATAC - cisTopic object
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/10x_multiome_brain_cisTopicObject_noDBL.pkl', 'rb')
cistopic_obj = pickle.load(infile)
infile.close()
## Precomputed imputed data
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistopic/DARs/Imputed_accessibility.pkl', 'rb')
imputed_acc_obj = pickle.load(infile)
infile.close()
## RNA - Create Anndata
from loomxpy.loomxpy import SCopeLoom
from pycisTopic.loom import *
import itertools
import anndata
path_to_loom = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/seurat/10x_multiome_brain_Seurat.loom'
loom = SCopeLoom.read_loom(path_to_loom)
metadata = get_metadata(loom)
# Fix names
metadata['barcode'] = [x.split('-')[0] for x in metadata.index.tolist()]
metadata['barcode'] = metadata['barcode'] + '-1'
metadata.index = metadata['barcode'] + '-10x_multiome_brain'
expr_mat = loom.ex_mtx
expr_mat.index = metadata['barcode'] + '-10x_multiome_brain'
rna_anndata = anndata.AnnData(X=expr_mat)
rna_anndata.obs = metadata
If you have generated your cisTopic object with an old version of pycisTopic, it is possible that your region data was affected by a previous bug. You can fix it with the code below:
[3]:
# Fix region data (bug in old pycistopic versions)
from pycisTopic.utils import region_names_to_coordinates
fragment_matrix = cistopic_obj.fragment_matrix
binary_matrix = cistopic_obj.binary_matrix
region_data = region_names_to_coordinates(cistopic_obj.region_names)
region_data['Width'] = abs(region_data.End -region_data.Start).astype(np.int32)
region_data['cisTopic_nr_frag'] = np.array(
fragment_matrix.sum(axis=1)).flatten()
region_data['cisTopic_log_nr_frag'] = np.log10(
region_data['cisTopic_nr_frag'])
region_data['cisTopic_nr_acc'] = np.array(
binary_matrix.sum(axis=1)).flatten()
region_data['cisTopic_log_nr_acc'] = np.log10(
region_data['cisTopic_nr_acc'])
cistopic_obj.region_data = region_data
Next we load the motif enrichment results into a dictionary. We can load motif results from the different methods in pycistarget (e.g. cisTarget, DEM) and different region sets (e.g. topics, DARs, MACS bdgdiff peaks). In this tutorial we will use both cisTarget and DEM peaks from topics and DARs. Note that if you have used the pycistarget wrapper function, the dictionary in menr.pkl can be directly pass as menr
in the create_SCENICPLUS_object
function.
[4]:
# Load cistarget and DEM motif enrichment results
motif_enrichment_dict={}
import pickle
from pycistarget.motif_enrichment_dem import *
path = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/atac/pycistarget/'
infile = open(path +'topics/topic_cistarget_dict.pkl', 'rb')
motif_enrichment_dict['CTX_Topics_All'] = pickle.load(infile)
infile.close()
infile = open(path +'topics/topic_DEM_dict.pkl', 'rb')
motif_enrichment_dict['DEM_Topics_All'] = pickle.load(infile)
infile.close()
infile = open(path +'DARs/DARs_cistarget_dict.pkl', 'rb')
motif_enrichment_dict['CTX_DARs_All'] = pickle.load(infile)
infile.close()
infile = open(path +'DARs/DARs_DEM_dict.pkl', 'rb')
motif_enrichment_dict['DEM_DARs_All'] = pickle.load(infile)
infile.close()
Now we can create the SCENIC+ object:
[5]:
scplus_obj = create_SCENICPLUS_object(
GEX_anndata = rna_anndata,
cisTopic_obj = cistopic_obj,
imputed_acc_obj = imputed_acc_obj,
menr = motif_enrichment_dict,
ACC_prefix = 'ACC_',
GEX_prefix = 'GEX_',
bc_transform_func = lambda x: x,
normalize_imputed_acc = False)
[6]:
print(scplus_obj)
SCENIC+ object with n_cells x n_genes = 1736 x 26399 and n_cells x n_regions = 1736 x 422146
metadata_regions:'Chromosome', 'Start', 'End', 'Width', 'cisTopic_nr_frag', 'cisTopic_log_nr_frag', 'cisTopic_nr_acc', 'cisTopic_log_nr_acc'
metadata_cell:'GEX_VSN_cell_type', 'GEX_VSN_leiden_res0.3', 'GEX_VSN_leiden_res0.6', 'GEX_VSN_leiden_res0.9', 'GEX_VSN_leiden_res1.2', 'GEX_VSN_sample_id', 'GEX_Seurat_leiden_res0.6', 'GEX_Seurat_leiden_res1.2', 'GEX_Seurat_cell_type', 'GEX_barcode', 'ACC_cisTopic_nr_frag', 'ACC_cisTopic_log_nr_acc', 'ACC_barcode', 'ACC_VSN_leiden_res0.6', 'ACC_FRIP', 'ACC_TSS_enrichment', 'ACC_cisTopic_nr_acc', 'ACC_VSN_leiden_res0.9', 'ACC_VSN_RNA+ATAC_leiden_100_2', 'ACC_Dupl_rate', 'ACC_sample_id', 'ACC_pycisTopic_leiden_10_0.6', 'ACC_pycisTopic_leiden_10_1.2', 'ACC_Seurat_leiden_res0.6', 'ACC_Total_nr_frag', 'ACC_Total_nr_frag_in_regions', 'ACC_VSN_leiden_res1.2', 'ACC_VSN_cell_type', 'ACC_Unique_nr_frag_in_regions', 'ACC_Seurat_cell_type', 'ACC_Predicted_doublets_fragments', 'ACC_Dupl_nr_frag', 'ACC_cisTopic_log_nr_frag', 'ACC_VSN_leiden_res0.3', 'ACC_Log_total_nr_frag', 'ACC_Unique_nr_frag', 'ACC_Doublet_scores_fragments', 'ACC_VSN_sample_id', 'ACC_Log_unique_nr_frag', 'ACC_Seurat_leiden_res1.2', 'ACC_Seurat_RNA+ATAC_leiden_100_2', 'ACC_pycisTopic_ingest_cell_type', 'ACC_pycisTopic_harmony_cell_type', 'ACC_pycisTopic_bbknn_cell_type', 'ACC_pycisTopic_scanorama_cell_type', 'ACC_pycisTopic_cca_cell_type'
menr:'CTX_Topics_All', 'DEM_Topics_All', 'CTX_DARs_All', 'DEM_DARs_All'
dr_cell:'ACC_UMAP', 'ACC_tSNE', 'ACC_VSN_RNA+ATAC_UMAP', 'ACC_Seurat_RNA+ATAC_UMAP'
You can also filter low accessible regions and low expressed genes. This recommended to avoid getting false relationships with these regions and genes.
[7]:
filter_genes(scplus_obj, min_pct = 0.5)
filter_regions(scplus_obj, min_pct = 0.5)
2022-01-04 17:24:16,750 Preprocessing INFO Going from 26399 genes to 20448 genes.
2022-01-04 17:24:50,192 Preprocessing INFO Going from 422146 regions to 375805 regions.
2. Generate cistromes
The next step is to generate cistromes. By default, all targets assigned to a TF across the motif enrichment dictionaries will be taken, and overlapped with regions in the SCENIC+ object. However, it is possible to also subset for regions accessible in certain cell type as well, to generate cell type specific cistromes. This approach is described in the tutorial cistrome_pruning_advanced.ipynb
.
[8]:
# Merge cistromes (all)
from scenicplus.cistromes import *
import time
start_time = time.time()
merge_cistromes(scplus_obj)
time = time.time()-start_time
print(time/60)
7.251678760846456
3. Infer enhancer to gene relationships
To infer enhancer-to-gene relationships, we exploit correlation between region accessibility and gene expression. In addition, to assess no non-linear relationships, we also use Gradient Boosting Machines. More details on these steps can be found in the tutorial r2g_advanced.ipynb
.
A. Get search space
We first need to define the search space around the gene. Here we will use 150kb upstream/downtream the gene, but TAD boundaries can be also used. WARNING: Make sure that the specified biomart_host matches your genome assembly.
[29]:
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'rb')
scplus_obj = pickle.load(infile)
infile.close()
[30]:
from scenicplus.enhancer_to_gene import get_search_space, calculate_regions_to_genes_relationships, GBM_KWARGS
get_search_space(scplus_obj,
biomart_host = 'http://www.ensembl.org',
species = 'hsapiens',
assembly = 'hg38',
upstream = [1000, 150000],
downstream = [1000, 150000])
2022-01-04 17:50:07,795 R2G INFO Downloading gene annotation from biomart dataset: hsapiens_gene_ensembl
/opt/venv/lib/python3.8/site-packages/scenicplus/enhancer_to_gene.py:197: DtypeWarning:
Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
2022-01-04 17:50:40,505 R2G INFO Downloading chromosome sizes from: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
2022-01-04 17:50:42,099 R2G INFO Extending promoter annotation to 10 bp upstream and 10 downstream
2022-01-04 17:50:44,632 R2G INFO Extending search space to:
150000 bp downstream of the end of the gene.
150000 bp upstream of the start of the gene.
2022-01-04 17:51:07,232 R2G INFO Intersecting with regions.
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
2022-01-04 17:51:08,592 R2G INFO Calculating distances from region to gene
2022-01-04 17:55:43,084 R2G INFO Imploding multiple entries per region and gene
2022-01-04 17:58:01,862 R2G INFO Done!
B. Enhancer-to-gene models
Enhancer-to-gene models can be done using correlation, random forest (RF) or Gradient Boosting Machines (GBM). GBMs are a much faster alternative to RFs.
[31]:
calculate_regions_to_genes_relationships(scplus_obj,
ray_n_cpu = 20,
_temp_dir = '/scratch/leuven/313/vsc31305/ray_spill',
importance_scoring_method = 'GBM',
importance_scoring_kwargs = GBM_KWARGS)
2022-01-04 17:59:17,845 R2G INFO Calculating region to gene importances, using GBM method
2022-01-04 17:59:30,930 INFO services.py:1338 -- View the Ray dashboard at http://127.0.0.1:8267
initializing: 22%|██▏ | 3225/14548 [04:27<15:17, 12.33it/s]
(score_regions_to_single_gene_ray pid=34847)
initializing: 27%|██▋ | 3905/14548 [05:23<14:24, 12.30it/s]
(score_regions_to_single_gene_ray pid=34848)
initializing: 27%|██▋ | 3953/14548 [05:27<14:34, 12.11it/s]
(score_regions_to_single_gene_ray pid=34840)
initializing: 54%|█████▍ | 7836/14548 [11:03<09:18, 12.01it/s] (score_regions_to_single_gene_ray pid=34844)
initializing: 68%|██████▊ | 9874/14548 [13:49<06:23, 12.20it/s]
(score_regions_to_single_gene_ray pid=34855)
initializing: 100%|██████████| 14548/14548 [20:26<00:00, 11.86it/s]
Running using 20 cores: 100%|██████████| 14548/14548 [00:35<00:00, 408.74it/s]
2022-01-04 18:20:38,987 R2G INFO Took 1281.1407558918 seconds
2022-01-04 18:20:38,990 R2G INFO Calculating region to gene correlation, using SR method
2022-01-04 18:20:53,862 INFO services.py:1338 -- View the Ray dashboard at http://127.0.0.1:8267
initializing: 68%|██████▊ | 9871/14548 [13:15<06:17, 12.39it/s] (score_regions_to_single_gene_ray pid=20985)
initializing: 84%|████████▍ | 12185/14548 [16:17<03:03, 12.88it/s](score_regions_to_single_gene_ray pid=20982)
initializing: 98%|█████████▊| 14275/14548 [19:01<00:21, 12.94it/s](score_regions_to_single_gene_ray pid=20988)
initializing: 100%|██████████| 14548/14548 [19:23<00:00, 12.51it/s]
Running using 20 cores: 100%|██████████| 14548/14548 [00:50<00:00, 288.45it/s]
2022-01-04 18:41:13,309 R2G INFO Took 1234.3179178237915 seconds
2022-01-04 18:41:26,108 R2G INFO Done!
[32]:
# Save
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'wb') as f:
pickle.dump(scplus_obj, f)
4. Infer TF to gene relationships
The next step is to infer relationships between TFs and target genes based on expression. We will use similar approaches as for the enhancer-to-gene relationships (GBM/RF and correlation).
[33]:
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'rb')
scplus_obj = pickle.load(infile)
infile.close()
[ ]:
from scenicplus.TF_to_gene import *
tf_file = '/staging/leuven/stg_00002/lcb/cflerin/resources/allTFs_hg38.txt'
calculate_TFs_to_genes_relationships(scplus_obj,
tf_file = tf_file,
ray_n_cpu = 20,
method = 'GBM',
_temp_dir = '/scratch/leuven/313/vsc31305/ray_spill',
key= 'TF2G_adj')
[ ]:
# Save
import pickle
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'wb') as f:
pickle.dump(scplus_obj, f)
If you have run SCENIC before in the gene expression matrix, it is possible to directly load adjancencies from that pipeline:
[ ]:
load_TF2G_adj_from_file(scplus_obj,
f_adj = '/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/rna/vsn/single_sample_scenic_HQ/out/scenic/10x_multiome_brain_HQ/arboreto_with_multiprocessing/10x_multiome_brain_HQ__adj.tsv',
inplace = True,
key= 'TF2G_adj')
5. Build eGRNs
The last step is to build the eGRNs using a recovery approach (GSEA). The ranking to use will be based on the TF-2-gene importances, while gene sets will be derived with different thresholding methods on the enhancer-to-gene relationships and the unfiltered cistromes.
[1]:
import pickle
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'rb')
scplus_obj = pickle.load(infile)
infile.close()
[ ]:
# Load functions
from scenicplus.grn_builder.gsea_approach import build_grn
build_grn(scplus_obj,
min_target_genes = 10,
adj_pval_thr = 1,
min_regions_per_gene = 0,
quantiles = (0.85, 0.90, 0.95),
top_n_regionTogenes_per_gene = (5, 10, 15),
top_n_regionTogenes_per_region = (),
binarize_using_basc = True,
rho_dichotomize_tf2g = True,
rho_dichotomize_r2g = True,
rho_dichotomize_eregulon = True,
rho_threshold = 0.05,
keep_extended_motif_annot = True,
merge_eRegulons = True,
order_regions_to_genes_by = 'importance',
order_TFs_to_genes_by = 'importance',
key_added = 'eRegulons_importance',
cistromes_key = 'Unfiltered',
disable_tqdm = False, #If running in notebook, set to True
ray_n_cpu = 20,
_temp_dir = '/scratch/leuven/313/vsc31305/ray_spill')
To access the eGRNs:
[ ]:
import dill
with open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'wb') as f:
dill.dump(scplus_obj, f)
6. Exploring SCENIC+ results
A. Generate eRegulon metadata
As a first step, we will format the eGRN metadata. This will integrate the results from the inference of enhancer-to-gene and TF-to-gene relationships and the eRegulon construction in one pandas dataframe that can be used for further exploration.
[1]:
import dill
infile = open('/staging/leuven/stg_00002/lcb/cbravo/Multiomics_pipeline/analysis/10x_multiome_brain/output/scenicplus/scplus_obj.pkl', 'rb')
scplus_obj = dill.load(infile)
infile.close()
[ ]:
from scenicplus.utils import format_egrns
format_egrns(scplus_obj, eregulons_key = 'eRegulons_importance', TF2G_key = 'TF2G_adj', key_added = 'eRegulon_metadata')
The eRegulon metadata will look as:
[4]:
scplus_obj.uns['eRegulon_metadata'][0:10]
[4]:
Region_signature_name | Gene_signature_name | TF | is_extended | Region | Gene | R2G_importance | R2G_rho | R2G_importance_x_rho | R2G_importance_x_abs_rho | TF2G_importance | TF2G_regulation | TF2G_rho | TF2G_importance_x_abs_rho | TF2G_importance_x_rho | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chr1:244918651-244919151 | HNRNPU | 0.021839 | 0.085878 | 0.001876 | 0.001876 | 0.236510 | 1 | 0.122128 | 0.028884 | 0.028884 |
1 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chr10:88991114-88991614 | FAS | 0.023378 | 0.215477 | 0.005037 | 0.005037 | 0.457794 | 1 | 0.162854 | 0.074554 | 0.074554 |
2 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chr10:88990599-88991099 | FAS | 0.031784 | 0.212687 | 0.006760 | 0.006760 | 0.457794 | 1 | 0.162854 | 0.074554 | 0.074554 |
3 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chrX:107676906-107677406 | MID2 | 0.021056 | 0.098110 | 0.002066 | 0.002066 | 0.663257 | 1 | 0.165609 | 0.109842 | 0.109842 |
4 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chr16:1556214-1556714 | IFT140 | 0.015752 | 0.098841 | 0.001557 | 0.001557 | 0.336581 | 1 | 0.095788 | 0.032240 | 0.032240 |
5 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chr10:103692953-103693453 | SH3PXD2A | 0.004005 | 0.146456 | 0.000587 | 0.000587 | 4.731638 | 1 | 0.200345 | 0.947960 | 0.947960 |
6 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chr19:57527137-57527637 | ZNF549 | 0.078793 | 0.083988 | 0.006618 | 0.006618 | 0.522762 | 1 | 0.154494 | 0.080764 | 0.080764 |
7 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chr22:36323698-36324198 | TXN2 | 0.020686 | 0.073909 | 0.001529 | 0.001529 | 1.298719 | 1 | 0.160698 | 0.208702 | 0.208702 |
8 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chr6:15365250-15365750 | JARID2 | 0.025076 | 0.174829 | 0.004384 | 0.004384 | 0.678238 | 1 | 0.231925 | 0.157300 | 0.157300 |
9 | ACO1_+_+_(24r) | ACO1_+_+_(23g) | ACO1 | False | chr10:126525383-126525883 | ADAM12 | 0.027865 | 0.067947 | 0.001893 | 0.001893 | 0.233867 | 1 | 0.090295 | 0.021117 | 0.021117 |
The first sign in the eRegulon name indicates the relationship between TF and gene; while the second indicates the relationship between region and gene. Additional columns can be added by the user, for example the enrichment of the TF motif in the regions.
B. Assesing eGRN enrichment in cells
Next, we can score the eRegulons in the cells to assess how enriched target regions and target genes are enriched in each cell. We will score the region-based regulons on the scATAC-seq layer, while gene-based eRegulons will be scored using the transcriptome layer.
[40]:
# Format eRegulons
from scenicplus.eregulon_enrichment import *
get_eRegulons_as_signatures(scplus_obj, eRegulon_metadata_key='eRegulon_metadata', key_added='eRegulon_signatures')
[6]:
## Score chromatin layer
# Region based raking
from scenicplus.cistromes import *
import time
start_time = time.time()
region_ranking = make_rankings(scplus_obj, target='region')
# Score region regulons
score_eRegulons(scplus_obj,
ranking = region_ranking,
eRegulon_signatures_key = 'eRegulon_signatures',
key_added = 'eRegulon_AUC',
enrichment_type= 'region',
auc_threshold = 0.05,
normalize = False,
n_cpu = 1)
time = time.time()-start_time
print(time/60)
100%|██████████| 832/832 [00:26<00:00, 31.16it/s]
2.8363844593365988
[7]:
## Score transcriptome layer
# Gene based raking
from scenicplus.cistromes import *
import time
start_time = time.time()
gene_ranking = make_rankings(scplus_obj, target='gene')
# Score gene regulons
score_eRegulons(scplus_obj,
gene_ranking,
eRegulon_signatures_key = 'eRegulon_signatures',
key_added = 'eRegulon_AUC',
enrichment_type = 'gene',
auc_threshold = 0.05,
normalize= False,
n_cpu = 1)
time = time.time()-start_time
print(time/60)
100%|██████████| 832/832 [00:07<00:00, 104.80it/s]
0.21863729159037273
C. Assessing TF-eGRN relationships
Next we can assess the relationship between the TF expression and the eRegulons; in other words, whether genes expressed/repressed and regions accessible/closed when the TF is present. Due to the amount of drop-outs, and the variability in cell types proportions, using directly the AUC cistrome matrix can result in noisy correlations. Here, we use pseudobulks, in which we sample a number of cells per cell type. In this example, we merge 5 cells per pseudobulk and generate 100 pseudobulks per cell type.
[8]:
# Generate pseudobulks
import time
start_time = time.time()
generate_pseudobulks(scplus_obj,
variable = 'ACC_Seurat_cell_type',
auc_key = 'eRegulon_AUC',
signature_key = 'Gene_based',
nr_cells = 5,
nr_pseudobulks = 100,
seed=555)
generate_pseudobulks(scplus_obj,
variable = 'ACC_Seurat_cell_type',
auc_key = 'eRegulon_AUC',
signature_key = 'Region_based',
nr_cells = 5,
nr_pseudobulks = 100,
seed=555)
time = time.time()-start_time
print(time/60)
1.9503313183784485
[9]:
# Correlation between TF and eRegulons
import time
start_time = time.time()
TF_cistrome_correlation(scplus_obj,
variable = 'ACC_Seurat_cell_type',
auc_key = 'eRegulon_AUC',
signature_key = 'Gene_based',
out_key = 'ACC_Seurat_cell_type_eGRN_gene_based')
TF_cistrome_correlation(scplus_obj,
variable = 'ACC_Seurat_cell_type',
auc_key = 'eRegulon_AUC',
signature_key = 'Region_based',
out_key = 'ACC_Seurat_cell_type_eGRN_region_based')
time = time.time()-start_time
print(time/60)
0.032043548425038655
We can plot eRegulon enrichment versus TF expression for each pseudobulk.
[11]:
# Region based
%matplotlib inline
import seaborn as sns
sns.set_style("white")
colors = ["#E9842C","#F8766D", "#BC9D00", "#00C0B4", "#9CA700", "#6FB000", "#00B813", "#00BD61", "#00C08E", "#00BDD4",
"#00A7FF", "#7F96FF", "#E26EF7", "#FF62BF", "#D69100", "#BC81FF"]
categories = sorted(set(scplus_obj.metadata_cell['ACC_Seurat_cell_type']))
color_dict = dict(zip(categories, colors[0:len(categories)]))
prune_plot(scplus_obj,
'SOX5_-_+',
pseudobulk_variable = 'ACC_Seurat_cell_type',
show_dot_plot = True,
show_line_plot = False,
color_dict = color_dict,
use_pseudobulk = True,
auc_key = 'eRegulon_AUC',
signature_key = 'Region_based',
seed=555)

[12]:
# Gene based
%matplotlib inline
sns.set_style("white")
colors = ["#E9842C","#F8766D", "#BC9D00", "#00C0B4", "#9CA700", "#6FB000", "#00B813", "#00BD61", "#00C08E", "#00BDD4",
"#00A7FF", "#7F96FF", "#E26EF7", "#FF62BF", "#D69100", "#BC81FF"]
categories = sorted(set(scplus_obj.metadata_cell['ACC_Seurat_cell_type']))
color_dict = dict(zip(categories, colors[0:len(categories)]))
prune_plot(scplus_obj,
'SOX5_-_+',
pseudobulk_variable = 'ACC_Seurat_cell_type',
show_dot_plot = True,
show_line_plot = False,
color_dict = color_dict,
use_pseudobulk = True,
auc_key = 'eRegulon_AUC',
signature_key = 'Gene_based',
seed=555)

D. Identification of high quality regulons
We will select a subset of regulons based on the correlation between the region based and the gene based regulons. We will only use the extended eRegulon if there is not a direct eRegulon available.
[46]:
# Correlation between region based regulons and gene based regulons
import pandas
df1 = scplus_obj.uns['eRegulon_AUC']['Gene_based'].copy()
df2 = scplus_obj.uns['eRegulon_AUC']['Region_based'].copy()
df1.columns = [x.split('_(')[0] for x in df1.columns]
df2.columns = [x.split('_(')[0] for x in df2.columns]
correlations = df1.corrwith(df2, axis = 0)
correlations = correlations[abs(correlations) > 0.6]
# Kepp only R2G +
keep = [x for x in correlations.index if '+_+' in x] + [x for x in correlations.index if '-_+' in x]
# Keep extended if not direct
extended = [x for x in keep if 'extended' in x]
direct = [x for x in keep if not 'extended' in x]
keep_extended = [x for x in extended if not x.replace('extended_', '') in direct]
keep = direct + keep_extended
# Keep regulons with more than 10 genes
keep_gene = [x for x in scplus_obj.uns['eRegulon_AUC']['Gene_based'].columns if x.split('_(')[0] in keep]
keep_gene = [x for x in keep_gene if (int(x.split('_(')[1].replace('g)', '')) > 10)]
keep_all = [x.split('_(')[0] for x in keep_gene]
keep_region = [x for x in scplus_obj.uns['eRegulon_AUC']['Region_based'].columns if x.split('_(')[0] in keep]
scplus_obj.uns['selected_eRegulons'] = {}
scplus_obj.uns['selected_eRegulons']['Gene_based'] = keep_gene
scplus_obj.uns['selected_eRegulons']['Region_based'] = keep_region
[47]:
len(keep_gene)
[47]:
114
E. Overlap between eRegulons
In addition, to assess which eRegulons tend to be enriched in the same group of cells we can generate a correlation plot as well.
[49]:
from scenicplus.plotting.correlation_plot import *
correlation_heatmap(scplus_obj,
auc_key = 'eRegulon_AUC',
signature_keys = ['Gene_based'],
selected_regulons = scplus_obj.uns['selected_eRegulons']['Gene_based'],
fcluster_threshold = 0.1,
fontsize = 3)