Tutorial: 10x multiome pbmc
Contents
Tutorial: 10x multiome pbmc
The data consists of PBMC from a Healthy Donor - Granulocytes Removed Through Cell Sorting (3k) which is freely available from 10x Genomics (click here, some personal information needs to be provided before you can gain access to the data). This is a multi-ome dataset.
Note:
In this notebook we will only show the minimal steps needed for running the SCENIC+ analysis. For more information on analysing scRNA-seq data and scATAC-seq data we refer the reader to other tutorials (e.g. Scanpy and pycisTopic in python or Seurat and cisTopic or Signac in R).
Set-up environment and download data
We will first create a directory to store the data and results
[1]:
#supress warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import sys
import os
_stderr = sys.stderr
null = open(os.devnull,'wb')
[43]:
!mkdir -p pbmc_tutorial/data
!wget -O pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5 https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5
!wget -O pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
--2022-08-04 13:20:33-- https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 2606:4700::6812:ad, 2606:4700::6812:1ad, 104.18.1.173, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|2606:4700::6812:ad|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 38844318 (37M) [binary/octet-stream]
Saving to: ‘pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5’
100%[======================================>] 38,844,318 16.1MB/s in 2.3s
2022-08-04 13:20:36 (16.1 MB/s) - ‘pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5’ saved [38844318/38844318]
--2022-08-04 13:20:36-- https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 2606:4700::6812:ad, 2606:4700::6812:1ad, 104.18.1.173, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|2606:4700::6812:ad|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 467587065 (446M) [text/tab-separated-values]
Saving to: ‘pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz’
100%[======================================>] 467,587,065 47.9MB/s in 18s
2022-08-04 13:20:55 (24.2 MB/s) - ‘pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz’ saved [467587065/467587065]
[111]:
import os
work_dir = 'pbmc_tutorial'
scRNA-seq preprocessing using Scanpy
First we preprocess the scRNA-seq side of the mutliome datasets. Most importantly we will use this side of the data to annotate celltypes.
For this we will make use of Scanpy.
Note:
You may also use Seurat (or any other tool in fact) to preprocess your data, however this will require some extra steps to import the data in python.
Note:
Further on in the actual SCENIC+ analysis the raw count matrix will be used.
[112]:
import scanpy as sc
#set some figure parameters for nice display inside jupyternotebooks.
%matplotlib inline
sc.settings.set_figure_params(dpi=80, frameon=False, figsize=(5, 5), facecolor='white')
#make a directory for to store the processed scRNA-seq data.
if not os.path.exists(os.path.join(work_dir, 'scRNA')):
os.makedirs(os.path.join(work_dir, 'scRNA'))
Read in the scRNA-seq count matrix into AnnData object.
[113]:
adata = sc.read_10x_h5(os.path.join(work_dir, 'data/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5'))
adata.var_names_make_unique()
adata
/user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/anndata/_core/anndata.py:1830: UserWarning: Variable names are not unique. To make them unique, call `.var_names_make_unique`.
[113]:
AnnData object with n_obs × n_vars = 2711 × 36601
var: 'gene_ids', 'feature_types', 'genome'
Basic quality control
Only keep cells with at least 200 genes expressed and only keep genes which are expressed in at least 3 cells.
[ ]:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
Optionally, predict and filter out doublets using Scrublet.
[ ]:
sc.external.pp.scrublet(adata) #estimates doublets
/user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/scanpy/preprocessing/_normalization.py:155: UserWarning: Revieved a view of an AnnData. Making a copy.
view_to_actual(adata)
Could not find library "annoy" for approx. nearest neighbor search
Automatically set threshold at doublet score = 0.23
Detected doublet rate = 2.4%
Estimated detectable doublet fraction = 61.1%
Overall doublet rate:
Expected = 5.0%
Estimated = 3.9%
[ ]:
adata = adata[adata.obs['predicted_doublet'] == False] #do the actual filtering
adata
View of AnnData object with n_obs × n_vars = 2635 × 21255
obs: 'n_genes', 'doublet_score', 'predicted_doublet'
var: 'gene_ids', 'feature_types', 'genome', 'n_cells'
uns: 'scrublet'
Filter based on mitochondrial counts and total counts.
[ ]:
adata.var['mt'] = adata.var_names.str.startswith('MT-') # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
import matplotlib.pyplot as plt
mito_filter = 25
n_counts_filter = 4300
fig, axs = plt.subplots(ncols = 2, figsize = (8,4))
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt', ax = axs[0], show=False)
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts', ax = axs[1], show = False)
#draw horizontal red lines indicating thresholds.
axs[0].hlines(y = mito_filter, xmin = 0, xmax = max(adata.obs['total_counts']), color = 'red', ls = 'dashed')
axs[1].hlines(y = n_counts_filter, xmin = 0, xmax = max(adata.obs['total_counts']), color = 'red', ls = 'dashed')
fig.tight_layout()
plt.show()
/local_scratch/tmp-vsc33053/ipykernel_34327/3790233708.py:1: ImplicitModificationWarning: Trying to modify attribute `.var` of view, initializing view as actual.
adata.var['mt'] = adata.var_names.str.startswith('MT-') # annotate the group of mitochondrial genes as 'mt'

[ ]:
adata = adata[adata.obs.n_genes_by_counts < n_counts_filter, :]
adata = adata[adata.obs.pct_counts_mt < mito_filter, :]
adata
View of AnnData object with n_obs × n_vars = 2606 × 21255
obs: 'n_genes', 'doublet_score', 'predicted_doublet', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
var: 'gene_ids', 'feature_types', 'genome', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
uns: 'scrublet'
Data normalization
Note:
Below the data will be normalized and scaled. This is only for visualization purposes. For the actual SCENIC+ analysis we will use the raw count matrix. For this reason we save the non-normalized and non-scaled AnnData object in the raw slot before carying on.
[ ]:
adata.raw = adata
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)
/user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Revieved a view of an AnnData. Making a copy.
view_to_actual(adata)
Cell type annotation
Here we use a pre-annotated dataset as reference (this is actually the processed data from the main scanpy tutorial: https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html) to transfer labels to unannotated dataset using ingest.
This is an important step of the preprocessing part because these annotations will be used in pycisTopic to generate pseudobulk ATAC profiles and call peaks.
[ ]:
adata_ref = sc.datasets.pbmc3k_processed() #use the preprocessed data from the Scanpy tutorial as reference
var_names = adata_ref.var_names.intersection(adata.var_names) #use genes which are present in both assays
adata_ref = adata_ref[:, var_names]
adata = adata[:, var_names]
sc.pp.pca(adata_ref) #calculate PCA embedding
sc.pp.neighbors(adata_ref) #calculate neighborhood graph
sc.tl.umap(adata_ref) #calculate umap embedding
sc.tl.ingest(adata, adata_ref, obs='louvain') #run label transfer
adata.obs.rename({'louvain': 'ingest_celltype_label'}, inplace = True, axis = 1)
Let’s visualize the labels
[ ]:
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca_variance_ratio(adata, log=True)

[ ]:
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=10)
sc.tl.umap(adata)
sc.pl.umap(adata, color = 'ingest_celltype_label')

We can clean the annotation a bit by running a clustering and assigning clusters to cell types based on maximum overlap.
[ ]:
sc.tl.leiden(adata, resolution = 0.8, key_added = 'leiden_res_0.8')
sc.pl.umap(adata, color = 'leiden_res_0.8')

[ ]:
tmp_df = adata.obs.groupby(['leiden_res_0.8', 'ingest_celltype_label']).size().unstack(fill_value=0)
tmp_df = (tmp_df / tmp_df.sum(0)).fillna(0)
leiden_to_annotation = tmp_df.idxmax(1).to_dict()
leiden_to_annotation
{'0': 'CD4 T cells',
'1': 'CD4 T cells',
'2': 'CD4 T cells',
'3': 'CD14+ Monocytes',
'4': 'CD14+ Monocytes',
'5': 'CD8 T cells',
'6': 'CD14+ Monocytes',
'7': 'B cells',
'8': 'FCGR3A+ Monocytes',
'9': 'NK cells',
'10': 'Dendritic cells',
'11': 'B cells'}
Note, there are two clusters annotated as B cells (cluster 7 and cluster 11). Let’s call these cells B cells 1 and B cells 2.
Let’s also remove the spaces from the cell type labels.
[ ]:
leiden_to_annotation['7'] = 'B cells 1'
leiden_to_annotation['11'] = 'B cells 2'
leiden_to_annotation = {cluster: leiden_to_annotation[cluster].replace(' ', '_') for cluster in leiden_to_annotation.keys()}
leiden_to_annotation
{'0': 'CD4_T_cells',
'1': 'CD4_T_cells',
'2': 'CD4_T_cells',
'3': 'CD14+_Monocytes',
'4': 'CD14+_Monocytes',
'5': 'CD8_T_cells',
'6': 'CD14+_Monocytes',
'7': 'B_cells_1',
'8': 'FCGR3A+_Monocytes',
'9': 'NK_cells',
'10': 'Dendritic_cells',
'11': 'B_cells_2'}
Add the annotation to the AnnData object.
[ ]:
adata.obs['celltype'] = [leiden_to_annotation[cluster_id] for cluster_id in adata.obs['leiden_res_0.8']]
del(leiden_to_annotation)
del(tmp_df)
[ ]:
sc.pl.umap(adata, color = 'celltype')

Save results
[ ]:
adata.write(os.path.join(work_dir, 'scRNA/adata.h5ad'), compression='gzip')
We now have preprocessed the scRNA-seq side of the multiome data.
In particular we have:
fitlered the data to only contain high quality cells.
annotated cells to cell types.
We also did some preliminary visualization of the data for which we needed to normalize the gene expression counts. Note that SCENIC+ uses the raw gene expression counts (i.e. without normalization and scaling). We have kept this raw data in adata.raw
.
Now that we have clusters of annotated cells we can continue with preprocessing the scATAC-seq data. There we will use the annotated clusters of cells to generate pseudobulk ATAC-seq profiles per cell type which will be used for peak calling.
scATAC-seq preprocessing using pycisTopic
Now we will preprocess the scATAC-seq side of the multiome data.
Most importantly we will:
generate pseudobulk ATAC-seq profiles per cell type and call peaks
merge these peaks into a consensus peak-set
do quality control on the scATAC-seq barcodes
run topic modeling to find sets of co-accessible regions and to impute chromatin accessibility resolving the issue of drop outs
For this we will use the python package pycisTopic.
Note:
pycisTopic can also be used for independent analysis of scATAC-seq data and has many more features which will not be demonstrated here. For more information see the read the docs page of pycisTopic
[3]:
import os
work_dir = 'pbmc_tutorial'
import pycisTopic
#set some figure parameters for nice display inside jupyternotebooks.
%matplotlib inline
#make a directory for to store the processed scRNA-seq data.
if not os.path.exists(os.path.join(work_dir, 'scATAC')):
os.makedirs(os.path.join(work_dir, 'scATAC'))
tmp_dir = '/scratch/leuven/330/vsc33053/'
Specify the location of the ATAC fragments file, this is the main input into pycisTopic.
[4]:
fragments_dict = {'10x_pbmc': os.path.join(work_dir, 'data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz')}
Generate pseudobulk ATAC-seq profiles, call peaks and generate a consensus peak set
We will use the cell type labels from the scRNA-seq side of the data to generate pseudobulk profiles per cell type. These pseudobulk profiles will be used to call peaks which will be merged in one consensus peak set.
The advantage of calling peaks per cell type (instead of calling peaks on the whole bulk profile at once) is that, for rare cell types, there is a risk that the ATAC signal of these cells might be confused for noise by the peak caller when viewed in the whole bulk profile. Calling peaks per cell type helps resolving these rare peaks and thus increases the resolution of the data.
We will first load the cell type annotation we generated in the scRNA-seq analysis above.
[5]:
import scanpy as sc
adata = sc.read_h5ad(os.path.join(work_dir, 'scRNA/adata.h5ad'))
cell_data = adata.obs
cell_data['sample_id'] = '10x_pbmc'
cell_data['celltype'] = cell_data['celltype'].astype(str) # set data type of the celltype column to str, otherwise the export_pseudobulk function will complain.
del(adata)
Next, we will generate the pseudobulk profiles.
This will generate two sets of files:
pseudobulk_bed_files: pseudobulk profiles stored in bed format.
pseudobulk_bw_files: pseudobulk profiles stored in BigWig format.
The BigWig files are useful for visualization in IGV or UCSC genome browser.
[6]:
# Get chromosome sizes (for hg38 here)
import pyranges as pr
import requests
import pandas as pd
target_url='http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes'
chromsizes=pd.read_csv(target_url, sep='\t', header=None)
chromsizes.columns=['Chromosome', 'End']
chromsizes['Start']=[0]*chromsizes.shape[0]
chromsizes=chromsizes.loc[:,['Chromosome', 'Start', 'End']]
# Exceptionally in this case, to agree with CellRangerARC annotations
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].replace('v', '.') for x in range(len(chromsizes['Chromosome']))]
chromsizes['Chromosome'] = [chromsizes['Chromosome'][x].split('_')[1] if len(chromsizes['Chromosome'][x].split('_')) > 1 else chromsizes['Chromosome'][x] for x in range(len(chromsizes['Chromosome']))]
chromsizes=pr.PyRanges(chromsizes)
[7]:
from pycisTopic.pseudobulk_peak_calling import export_pseudobulk
bw_paths, bed_paths = export_pseudobulk(input_data = cell_data,
variable = 'celltype', # variable by which to generate pseubulk profiles, in this case we want pseudobulks per celltype
sample_id_col = 'sample_id',
chromsizes = chromsizes,
bed_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/'), # specify where pseudobulk_bed_files should be stored
bigwig_path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bw_files/'),# specify where pseudobulk_bw_files should be stored
path_to_fragments = fragments_dict, # location of fragment fiels
n_cpu = 8, # specify the number of cores to use, we use ray for multi processing
normalize_bigwig = True,
remove_duplicates = True,
_temp_dir = os.path.join(tmp_dir, 'ray_spill'),
split_pattern = '-')
2022-08-09 17:56:46,100 cisTopic INFO Reading fragments from pbmc_tutorial/data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
2022-08-09 17:57:36,579 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
(raylet) cut: write error: Broken pipe
(export_pseudobulk_ray pid=13683) 2022-08-09 17:57:43,021 cisTopic INFO Creating pseudobulk for B_cells_1
(export_pseudobulk_ray pid=13679) 2022-08-09 17:57:43,404 cisTopic INFO Creating pseudobulk for B_cells_2
(export_pseudobulk_ray pid=13677) 2022-08-09 17:57:43,888 cisTopic INFO Creating pseudobulk for CD14_Monocytes
(export_pseudobulk_ray pid=13681) 2022-08-09 17:57:44,305 cisTopic INFO Creating pseudobulk for CD4_T_cells
(export_pseudobulk_ray pid=13682) 2022-08-09 17:57:44,706 cisTopic INFO Creating pseudobulk for CD8_T_cells
(export_pseudobulk_ray pid=13680) 2022-08-09 17:57:45,230 cisTopic INFO Creating pseudobulk for Dendritic_cells
(export_pseudobulk_ray pid=13684) 2022-08-09 17:57:45,726 cisTopic INFO Creating pseudobulk for FCGR3A_Monocytes
(export_pseudobulk_ray pid=13678) 2022-08-09 17:57:46,225 cisTopic INFO Creating pseudobulk for NK_cells
(export_pseudobulk_ray pid=13679) 2022-08-09 17:57:52,695 cisTopic INFO B_cells_2 done!
(export_pseudobulk_ray pid=13680) 2022-08-09 17:58:00,289 cisTopic INFO Dendritic_cells done!
(export_pseudobulk_ray pid=13678) 2022-08-09 17:58:01,561 cisTopic INFO NK_cells done!
(export_pseudobulk_ray pid=13684) 2022-08-09 17:58:08,569 cisTopic INFO FCGR3A_Monocytes done!
(export_pseudobulk_ray pid=13682) 2022-08-09 17:58:24,903 cisTopic INFO CD8_T_cells done!
(export_pseudobulk_ray pid=13683) 2022-08-09 17:58:29,227 cisTopic INFO B_cells_1 done!
(export_pseudobulk_ray pid=13677) 2022-08-09 18:00:49,939 cisTopic INFO CD14_Monocytes done!
Save location to bed and bigwig files for later access.
[ ]:
import pickle
pickle.dump(bed_paths,
open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl'), 'wb'))
pickle.dump(bw_paths,
open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl'), 'wb'))
Call peaks per pseudobulk profile
[8]:
import pickle
bed_paths = pickle.load(open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bed_paths.pkl'), 'rb'))
bw_paths = pickle.load(open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/pseudobulk_bed_files/bw_paths.pkl'), 'rb'))
from pycisTopic.pseudobulk_peak_calling import peak_calling
macs_path='macs2'
# Run peak calling
narrow_peaks_dict = peak_calling(macs_path,
bed_paths,
os.path.join(work_dir, 'scATAC/consensus_peak_calling/MACS/'),
genome_size='hs',
n_cpu=8,
input_format='BEDPE',
shift=73,
ext_size=146,
keep_dup = 'all',
q_value = 0.05,
_temp_dir = os.path.join(tmp_dir, 'ray_spill'))
2022-08-09 18:03:22,921 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
(raylet) cut: write error: Broken pipe
(macs_call_peak_ray pid=29402) 2022-08-09 18:03:28,395 cisTopic INFO Calling peaks for CD8_T_cells with macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD8_T_cells.bed.gz --name CD8_T_cells --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=29408) 2022-08-09 18:03:28,439 cisTopic INFO Calling peaks for B_cells_2 with macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells_2.bed.gz --name B_cells_2 --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=29405) 2022-08-09 18:03:28,394 cisTopic INFO Calling peaks for B_cells_1 with macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/B_cells_1.bed.gz --name B_cells_1 --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=29403) 2022-08-09 18:03:28,504 cisTopic INFO Calling peaks for NK_cells with macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/NK_cells.bed.gz --name NK_cells --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=29406) 2022-08-09 18:03:28,471 cisTopic INFO Calling peaks for CD14_Monocytes with macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD14_Monocytes.bed.gz --name CD14_Monocytes --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=29404) 2022-08-09 18:03:28,471 cisTopic INFO Calling peaks for Dendritic_cells with macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/Dendritic_cells.bed.gz --name Dendritic_cells --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=29407) 2022-08-09 18:03:28,547 cisTopic INFO Calling peaks for CD4_T_cells with macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/CD4_T_cells.bed.gz --name CD4_T_cells --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=29401) 2022-08-09 18:03:28,558 cisTopic INFO Calling peaks for FCGR3A_Monocytes with macs2 callpeak --treatment pbmc_tutorial/scATAC/consensus_peak_calling/pseudobulk_bed_files/FCGR3A_Monocytes.bed.gz --name FCGR3A_Monocytes --outdir pbmc_tutorial/scATAC/consensus_peak_calling/MACS/ --format BEDPE --gsize hs --qvalue 0.05 --nomodel --shift 73 --extsize 146 --keep-dup all --call-summits --nolambda
(macs_call_peak_ray pid=29408) 2022-08-09 18:03:46,105 cisTopic INFO B_cells_2 done!
(macs_call_peak_ray pid=29403) 2022-08-09 18:03:53,171 cisTopic INFO NK_cells done!
(macs_call_peak_ray pid=29404) 2022-08-09 18:03:57,035 cisTopic INFO Dendritic_cells done!
(macs_call_peak_ray pid=29402) 2022-08-09 18:03:59,414 cisTopic INFO CD8_T_cells done!
(macs_call_peak_ray pid=29405) 2022-08-09 18:04:06,158 cisTopic INFO B_cells_1 done!
(macs_call_peak_ray pid=29401) 2022-08-09 18:04:08,467 cisTopic INFO FCGR3A_Monocytes done!
(macs_call_peak_ray pid=29406) 2022-08-09 18:04:53,000 cisTopic INFO CD14_Monocytes done!
(macs_call_peak_ray pid=29407) 2022-08-09 18:04:59,539 cisTopic INFO CD4_T_cells done!
[ ]:
pickle.dump(narrow_peaks_dict,
open(os.path.join(work_dir, 'scATAC/consensus_peak_calling/MACS/narrow_peaks_dict.pkl'), 'wb'))
Merge peaks into consensus peak set, for more info see pyCistopic read the docs.
[9]:
from pycisTopic.iterative_peak_calling import *
# Other param
peak_half_width = 250
path_to_blacklist= '../../pycisTopic/blacklist/hg38-blacklist.v2.bed'
# Get consensus peaks
consensus_peaks=get_consensus_peaks(narrow_peaks_dict, peak_half_width, chromsizes=chromsizes, path_to_blacklist=path_to_blacklist)
2022-08-09 18:05:09,046 cisTopic INFO Extending and merging peaks per class
2022-08-09 18:05:46,751 cisTopic INFO Normalizing peak scores
2022-08-09 18:05:46,984 cisTopic INFO Merging peaks
Warning! Start and End columns now have different dtypes: int64 and int32
2022-08-09 18:06:31,373 cisTopic INFO Done!
[10]:
consensus_peaks.to_bed(
path = os.path.join(work_dir, 'scATAC/consensus_peak_calling/consensus_regions.bed'),
keep=True,
compression='infer',
chain=False)
Quality control
Next we will calculate sample level and cell-barcode level quality control statistics.
Barcode level QC stats include (these stats will be used to filter good quality cell barcodes from bad quality ones):
Log number of unique fragments per cell barcode.
FRIP per cell barcode.
TSS enrichment per cell barcode.
Duplication rate per cell barcode.
[11]:
import pybiomart as pbm
dataset = pbm.Dataset(name='hsapiens_gene_ensembl', host='http://www.ensembl.org')
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].to_numpy(dtype = str)
filter = annot['Chromosome/scaffold name'].str.contains('CHR|GL|JH|MT')
annot = annot[~filter]
annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1')
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot = annot[annot.Transcript_type == 'protein_coding']
from pycisTopic.qc import *
path_to_regions = {'10x_pbmc':os.path.join(work_dir, 'scATAC/consensus_peak_calling/consensus_regions.bed')}
metadata_bc, profile_data_dict = compute_qc_stats(
fragments_dict = fragments_dict,
tss_annotation = annot,
stats=['barcode_rank_plot', 'duplicate_rate', 'insert_size_distribution', 'profile_tss', 'frip'],
label_list = None,
path_to_regions = path_to_regions,
n_cpu = 1,
valid_bc = None,
n_frag = 100,
n_bc = None,
tss_flank_window = 1000,
tss_window = 50,
tss_minimum_signal_window = 100,
tss_rolling_window = 10,
remove_duplicates = True,
_temp_dir = os.path.join(tmp_dir + 'ray_spill'))
if not os.path.exists(os.path.join(work_dir, 'scATAC/quality_control')):
os.makedirs(os.path.join(work_dir, 'scATAC/quality_control'))
pickle.dump(metadata_bc,
open(os.path.join(work_dir, 'scATAC/quality_control/metadata_bc.pkl'), 'wb'))
pickle.dump(profile_data_dict,
open(os.path.join(work_dir, 'scATAC/quality_control/profile_data_dict.pkl'), 'wb'))
2022-08-09 18:07:07,053 cisTopic INFO Reading 10x_pbmc
2022-08-09 18:07:44,742 cisTopic INFO Computing barcode rank plot for 10x_pbmc
2022-08-09 18:07:44,745 cisTopic INFO Counting fragments
2022-08-09 18:07:48,420 cisTopic INFO Marking barcodes with more than 100
2022-08-09 18:07:48,501 cisTopic INFO Returning plot data
2022-08-09 18:07:48,503 cisTopic INFO Returning valid barcodes
2022-08-09 18:07:51,836 cisTopic INFO Computing duplicate rate plot for 10x_pbmc
2022-08-09 18:07:55,910 cisTopic INFO Return plot data
2022-08-09 18:07:56,112 cisTopic INFO Computing insert size distribution for 10x_pbmc
2022-08-09 18:07:56,114 cisTopic INFO Counting fragments
2022-08-09 18:07:57,522 cisTopic INFO Returning plot data
2022-08-09 18:08:17,834 cisTopic INFO Computing TSS profile for 10x_pbmc
2022-08-09 18:08:20,165 cisTopic INFO Formatting annnotation
2022-08-09 18:08:20,235 cisTopic INFO Creating coverage matrix
2022-08-09 18:09:51,432 cisTopic INFO Coverage matrix done
2022-08-09 18:09:54,418 cisTopic INFO Returning normalized TSS coverage matrix per barcode
2022-08-09 18:09:56,789 cisTopic INFO Returning normalized sample TSS enrichment data
2022-08-09 18:09:56,978 cisTopic INFO Computing FRIP profile for 10x_pbmc
2022-08-09 18:09:57,718 cisTopic INFO Counting fragments
2022-08-09 18:10:02,780 cisTopic INFO Intersecting fragments with regions
2022-08-09 18:10:17,481 cisTopic INFO Sample 10x_pbmc done!
Filter cell barcodes.
[16]:
#[min, #max]
QC_filters = {
'Log_unique_nr_frag': [3.3 , None],
'FRIP': [0.45, None],
'TSS_enrichment': [5 , None],
'Dupl_rate': [None, None]
}
# Return figure to plot together with other metrics, and cells passing filters. Figure will be saved as pdf.
from pycisTopic.qc import *
FRIP_NR_FRAG_fig, FRIP_NR_FRAG_filter=plot_barcode_metrics(metadata_bc['10x_pbmc'],
var_x='Log_unique_nr_frag',
var_y='FRIP',
min_x=QC_filters['Log_unique_nr_frag'][0],
max_x=QC_filters['Log_unique_nr_frag'][1],
min_y=QC_filters['FRIP'][0],
max_y=QC_filters['FRIP'][1],
return_cells=True,
return_fig=True,
plot=False)
# Return figure to plot together with other metrics, and cells passing filters
TSS_NR_FRAG_fig, TSS_NR_FRAG_filter=plot_barcode_metrics(metadata_bc['10x_pbmc'],
var_x='Log_unique_nr_frag',
var_y='TSS_enrichment',
min_x=QC_filters['Log_unique_nr_frag'][0],
max_x=QC_filters['Log_unique_nr_frag'][1],
min_y=QC_filters['TSS_enrichment'][0],
max_y=QC_filters['TSS_enrichment'][1],
return_cells=True,
return_fig=True,
plot=False)
# Return figure to plot together with other metrics, but not returning cells (no filter applied for the duplication rate per barcode)
DR_NR_FRAG_fig=plot_barcode_metrics(metadata_bc['10x_pbmc'],
var_x='Log_unique_nr_frag',
var_y='Dupl_rate',
min_x=QC_filters['Log_unique_nr_frag'][0],
max_x=QC_filters['Log_unique_nr_frag'][1],
min_y=QC_filters['Dupl_rate'][0],
max_y=QC_filters['Dupl_rate'][1],
return_cells=False,
return_fig=True,
plot=False,
plot_as_hexbin = True)
# Plot barcode stats in one figure
fig=plt.figure(figsize=(10,10))
plt.subplot(1, 3, 1)
img = fig2img(FRIP_NR_FRAG_fig)
plt.imshow(img)
plt.axis('off')
plt.subplot(1, 3, 2)
img = fig2img(TSS_NR_FRAG_fig)
plt.imshow(img)
plt.axis('off')
plt.subplot(1, 3, 3)
img = fig2img(DR_NR_FRAG_fig)
plt.imshow(img)
plt.axis('off')
plt.show()

[ ]:
bc_passing_filters = {'10x_pbmc':[]}
bc_passing_filters['10x_pbmc'] = list((set(FRIP_NR_FRAG_filter) & set(TSS_NR_FRAG_filter)))
pickle.dump(bc_passing_filters,
open(os.path.join(work_dir, 'scATAC/quality_control/bc_passing_filters.pkl'), 'wb'))
print(f"{len(bc_passing_filters['10x_pbmc'])} barcodes passed QC stats")
2521 barcodes passed QC stats
Creating a cisTopic object and topic modeling
Now that we have good quality barcodes we will generate a binary count matrix of ATAC-seq fragments over consensus peaks. This matrix, along with metadata, will be stored in a cisTopic object and be used for topic modeling.
We will start by reading cell metadata from the scRNA-seq side of the analysis. For SCENIC+ we will only keep cells which passed quality metrics in both assays.
Note:
For independent scATAC-seq analysis you probably want to keep all cells (not only the cells passing the scRNA-seq filters).
[ ]:
import scanpy as sc
adata = sc.read_h5ad(os.path.join(work_dir, 'scRNA/adata.h5ad'))
scRNA_bc = adata.obs_names
cell_data = adata.obs
cell_data['sample_id'] = '10x_pbmc'
cell_data['celltype'] = cell_data['celltype'].astype(str) # set data type of the celltype column to str, otherwise the export_pseudobulk function will complain.
del(adata)
Load scATAC-seq data
[ ]:
import pickle
fragments_dict = {'10x_pbmc': os.path.join(work_dir, 'data/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz')}
path_to_regions = {'10x_pbmc':os.path.join(work_dir, 'scATAC/consensus_peak_calling/consensus_regions.bed')}
path_to_blacklist= '../../pycisTopic/blacklist/hg38-blacklist.v2.bed'
metadata_bc = pickle.load(open(os.path.join(work_dir, 'scATAC/quality_control/metadata_bc.pkl'), 'rb'))
bc_passing_filters = pickle.load(open(os.path.join(work_dir, 'scATAC/quality_control/bc_passing_filters.pkl'), 'rb'))
[ ]:
print(f"{len(list(set(bc_passing_filters['10x_pbmc']) & set(scRNA_bc)))} cell barcodes pass both scATAC-seq and scRNA-seq based filtering")
2415 cell barcodes pass both scATAC-seq and scRNA-seq based filtering
Create cisTopic object
[ ]:
from pycisTopic.cistopic_class import *
key = '10x_pbmc'
cistopic_obj = create_cistopic_object_from_fragments(
path_to_fragments=fragments_dict[key],
path_to_regions=path_to_regions[key],
path_to_blacklist=path_to_blacklist,
metrics=metadata_bc[key],
valid_bc=list(set(bc_passing_filters[key]) & set(scRNA_bc)),
n_cpu=1,
project=key,
split_pattern='-')
cistopic_obj.add_cell_data(cell_data, split_pattern='-')
print(cistopic_obj)
2022-08-04 09:16:19,076 cisTopic INFO Reading data for 10x_pbmc
2022-08-04 09:17:01,729 cisTopic INFO metrics provided!
2022-08-04 09:17:03,660 cisTopic INFO valid_bc provided, selecting barcodes!
2022-08-04 09:17:05,593 cisTopic INFO Counting fragments in regions
2022-08-04 09:17:18,288 cisTopic INFO Creating fragment matrix
2022-08-04 09:17:35,403 cisTopic INFO Converting fragment matrix to sparse matrix
2022-08-04 09:17:41,083 cisTopic INFO Removing blacklisted regions
2022-08-04 09:17:41,878 cisTopic INFO Creating CistopicObject
2022-08-04 09:17:42,871 cisTopic INFO Done!
Columns ['sample_id'] will be overwritten
CistopicObject from project 10x_pbmc with n_cells × n_regions = 2415 × 194062
Save object.
[ ]:
pickle.dump(cistopic_obj,
open(os.path.join(work_dir, 'scATAC/cistopic_obj.pkl'), 'wb'))
Run topic modeling. The purpose of this is twofold:
To find sets of co-accessible regions (topics), this will be used downstream as candidate enhancers (together with Differentially Accessible Regions (DARs)).
To impute dropouts.
Note:
scATAC-seq data is very sparse. This is because of the nature of the technique. It profiles chromatin accessibility in single cells and each cell only has two copies (two alleles) of each genomic region, also the genome is realtively large (so the number of regions which can be measured is large). For this reason drop outs are a real problem, the chance of measuring a specific genomic region (out of many potential regions) which only has two copies per cell is relatively small. Compare this to scRNA-seq analysis, here the number of genes which can be measure is relatively small (compared to the number of genomic regions) and each cell has potentially hundres of copies of each transcript. To account for drop outs in the scATAC-seq assay imputation techniques are often used, in this case we make use of topic modeling, making use of the fact that the data often contains cells which are quite similar to each other but might have slightly different features measured.
Before running the topic modeling we are not sure what the best number of topics will be to explain the data. Analog to PCA where you also often don’t know before hand what the best number of principle components is. For this reason we will generate models with increasing numbers of topics and after the fact choose the model with the optimal amount of topics. For demonstration purposes we will only try a few amount of models, you might want to explore a larger number of topics.
Warning:
Topic modeling can be computationaly intense!
[ ]:
import pickle
cistopic_obj = pickle.load(open(os.path.join(work_dir, 'scATAC/cistopic_obj.pkl'), 'rb'))
from pycisTopic.cistopic_class import *
models=run_cgs_models(cistopic_obj,
n_topics=[2,4,10,16,32,48],
n_cpu=5,
n_iter=500,
random_state=555,
alpha=50,
alpha_by_topic=True,
eta=0.1,
eta_by_topic=False,
save_path=None,
_temp_dir = os.path.join(tmp_dir + 'ray_spill'))
(run_cgs_model pid=13803) 2022-08-04 09:39:09,402 cisTopic INFO Running model with 10 topics
(run_cgs_model pid=13804) 2022-08-04 09:39:09,404 cisTopic INFO Running model with 16 topics
(run_cgs_model pid=13805) 2022-08-04 09:39:09,402 cisTopic INFO Running model with 32 topics
(run_cgs_model pid=13806) 2022-08-04 09:39:09,402 cisTopic INFO Running model with 2 topics
(run_cgs_model pid=13802) 2022-08-04 09:39:09,402 cisTopic INFO Running model with 4 topics
(run_cgs_model pid=13806) 2022-08-04 09:45:30,724 cisTopic INFO Model with 2 topics done!
(run_cgs_model pid=13806) 2022-08-04 09:45:30,797 cisTopic INFO Running model with 48 topics
(run_cgs_model pid=13802) 2022-08-04 09:49:16,771 cisTopic INFO Model with 4 topics done!
(run_cgs_model pid=13803) 2022-08-04 09:59:39,401 cisTopic INFO Model with 10 topics done!
(run_cgs_model pid=13804) 2022-08-04 10:08:35,989 cisTopic INFO Model with 16 topics done!
(run_cgs_model pid=13805) 2022-08-04 10:36:04,467 cisTopic INFO Model with 32 topics done!
(run_cgs_model pid=13806) 2022-08-04 11:14:06,391 cisTopic INFO Model with 48 topics done!
Save results
[ ]:
if not os.path.exists(os.path.join(work_dir, 'scATAC/models')):
os.makedirs(os.path.join(work_dir, 'scATAC/models'))
pickle.dump(models,
open(os.path.join(work_dir, 'scATAC/models/10x_pbmc_models_500_iter_LDA.pkl'), 'wb'))
Analyze models.
We will make use of four quality metrics to select the model with the optimal amount of topics: 1. Arun et al. 2010 2. Cao & Juan et al. 2009 3. Mimno et al. 2011 4. Log likelihood
For more information on these metrics see publications (linked above) and the read the docs page of pycisTopic.
[ ]:
models = pickle.load(open(os.path.join(work_dir, 'scATAC/models/10x_pbmc_models_500_iter_LDA.pkl'), 'rb'))
cistopic_obj = pickle.load(open(os.path.join(work_dir, 'scATAC/cistopic_obj.pkl'), 'rb'))
from pycisTopic.lda_models import *
model = evaluate_models(models,
select_model=16,
return_model=True,
metrics=['Arun_2010','Cao_Juan_2009', 'Minmo_2011', 'loglikelihood'],
plot_metrics=False)

The metrics seem to stabelise with a model using 16 topics, so let’s choose that model.
[ ]:
cistopic_obj.add_LDA_model(model)
pickle.dump(cistopic_obj,
open(os.path.join(work_dir, 'scATAC/cistopic_obj.pkl'), 'wb'))
Visualization
We can use the cell-topic probabilities to generate dimensionality reductions.
[ ]:
from pycisTopic.clust_vis import *
run_umap(cistopic_obj, target = 'cell', scale=True)
plot_metadata(cistopic_obj, reduction_name = 'UMAP', variables = ['celltype'])
2022-08-04 11:30:48,959 cisTopic INFO Running UMAP

We can also plot the cell-topic probabilities on the UMAP, to visualize their cell type specifiticy.
[ ]:
plot_topic(cistopic_obj, reduction_name = 'UMAP', num_columns = 4)

For further analysis see the pycisTopic read the docs page
Inferring candidate enhancer regions
Next we will infer candidate enhancer regions by:
binarization of region-topic probabilites.
calculation differentially accessibile regions (DARs) per cell type.
These regions will be used as input for the next step, pycistarget, in which we will look which motifs are enriched in these regions.
First we will binarize the topics using the otsu method and by taking the top 3k regions per topic.
[ ]:
from pycisTopic.topic_binarization import *
region_bin_topics_otsu = binarize_topics(cistopic_obj, method='otsu')
region_bin_topics_top3k = binarize_topics(cistopic_obj, method='ntop', ntop = 3000)
<Figure size 460.8x345.6 with 0 Axes>
<Figure size 460.8x345.6 with 0 Axes>
Next we will calculate DARs per cell type
[ ]:
from pycisTopic.diff_features import *
imputed_acc_obj = impute_accessibility(cistopic_obj, selected_cells=None, selected_regions=None, scale_factor=10**6)
normalized_imputed_acc_obj = normalize_scores(imputed_acc_obj, scale_factor=10**4)
variable_regions = find_highly_variable_features(normalized_imputed_acc_obj, plot = False)
markers_dict = find_diff_features(cistopic_obj, imputed_acc_obj, variable='celltype', var_features=variable_regions, split_pattern = '-')
2022-08-04 11:47:45,338 cisTopic INFO Imputing drop-outs
2022-08-04 11:47:46,586 cisTopic INFO Scaling
2022-08-04 11:47:47,670 cisTopic INFO Keep non zero rows
2022-08-04 11:47:49,654 cisTopic INFO Imputed accessibility sparsity: 0.43567783880139876
2022-08-04 11:47:49,655 cisTopic INFO Create CistopicImputedFeatures object
2022-08-04 11:47:49,656 cisTopic INFO Done!
2022-08-04 11:47:49,657 cisTopic INFO Normalizing imputed data
2022-08-04 11:47:57,846 cisTopic INFO Done!
2022-08-04 11:47:57,857 cisTopic INFO Calculating mean
2022-08-04 11:47:58,377 cisTopic INFO Calculating variance
2022-08-04 11:48:04,153 cisTopic INFO Done!
2022-08-04 11:48:04,820 cisTopic INFO Formatting data for B_cells_1
2022-08-04 11:48:06,159 cisTopic INFO Computing p-value for B_cells_1
2022-08-04 11:48:37,356 cisTopic INFO Computing log2FC for B_cells_1
2022-08-04 11:48:38,629 cisTopic INFO B_cells_1 done!
2022-08-04 11:48:38,635 cisTopic INFO Formatting data for B_cells_2
2022-08-04 11:48:39,976 cisTopic INFO Computing p-value for B_cells_2
2022-08-04 11:49:12,303 cisTopic INFO Computing log2FC for B_cells_2
2022-08-04 11:49:13,489 cisTopic INFO B_cells_2 done!
2022-08-04 11:49:13,495 cisTopic INFO Formatting data for CD14+_Monocytes
2022-08-04 11:49:14,866 cisTopic INFO Computing p-value for CD14+_Monocytes
2022-08-04 11:49:46,613 cisTopic INFO Computing log2FC for CD14+_Monocytes
2022-08-04 11:49:47,810 cisTopic INFO CD14+_Monocytes done!
2022-08-04 11:49:47,816 cisTopic INFO Formatting data for CD4_T_cells
2022-08-04 11:49:49,169 cisTopic INFO Computing p-value for CD4_T_cells
2022-08-04 11:50:20,299 cisTopic INFO Computing log2FC for CD4_T_cells
2022-08-04 11:50:21,531 cisTopic INFO CD4_T_cells done!
2022-08-04 11:50:21,538 cisTopic INFO Formatting data for CD8_T_cells
2022-08-04 11:50:22,935 cisTopic INFO Computing p-value for CD8_T_cells
2022-08-04 11:50:55,451 cisTopic INFO Computing log2FC for CD8_T_cells
2022-08-04 11:50:57,142 cisTopic INFO CD8_T_cells done!
2022-08-04 11:50:57,148 cisTopic INFO Formatting data for Dendritic_cells
2022-08-04 11:50:58,499 cisTopic INFO Computing p-value for Dendritic_cells
2022-08-04 11:51:31,104 cisTopic INFO Computing log2FC for Dendritic_cells
2022-08-04 11:51:32,403 cisTopic INFO Dendritic_cells done!
2022-08-04 11:51:32,411 cisTopic INFO Formatting data for FCGR3A+_Monocytes
2022-08-04 11:51:33,795 cisTopic INFO Computing p-value for FCGR3A+_Monocytes
2022-08-04 11:52:07,187 cisTopic INFO Computing log2FC for FCGR3A+_Monocytes
2022-08-04 11:52:08,510 cisTopic INFO FCGR3A+_Monocytes done!
2022-08-04 11:52:08,521 cisTopic INFO Formatting data for NK_cells
2022-08-04 11:52:10,423 cisTopic INFO Computing p-value for NK_cells
2022-08-04 11:52:45,449 cisTopic INFO Computing log2FC for NK_cells
2022-08-04 11:52:46,906 cisTopic INFO NK_cells done!
<Figure size 432x288 with 0 Axes>
Save results
[ ]:
if not os.path.exists(os.path.join(work_dir, 'scATAC/candidate_enhancers')):
os.makedirs(os.path.join(work_dir, 'scATAC/candidate_enhancers'))
import pickle
pickle.dump(region_bin_topics_otsu, open(os.path.join(work_dir, 'scATAC/candidate_enhancers/region_bin_topics_otsu.pkl'), 'wb'))
pickle.dump(region_bin_topics_top3k, open(os.path.join(work_dir, 'scATAC/candidate_enhancers/region_bin_topics_top3k.pkl'), 'wb'))
pickle.dump(markers_dict, open(os.path.join(work_dir, 'scATAC/candidate_enhancers/markers_dict.pkl'), 'wb'))
We now completed all the mininal scATAC-seq preprocessing steps.
In particular we:
generated a set of consensus peaks
performed quality control steps, only keeping cell barcods which passed QC metrics in both the scRNA-seq and scATAC-seq assay
performed topic modeling
inferred candidate enhancer regions by binarizing the region-topic probabilities and DARs per cell type
In the next step we will perform motif enrichment analysis on these candidate enhancer regions using the python package, pycistarget. For this a precomputed motif-score database is needed. A sample specific database can be generated by scoring the consensus peaks with motifs or a general pre-scored database can be used as well.
Motif enrichment analysis using pycistarget
After having identified candidate enhancer regions we will use pycistarget to find which motifs are enriched in these regions.
Cistarget databases
In order to run pycistarget one needs a precomputed database containing motif scores for genomic regions.
You can choose to compute this database yourself by scoring the consensus peaks generated in the scATAC-seq analysis using a set of motifs. The advantage of creating a sample specific database is that you can potentially pick up more target regions, given that only regions included/overlappig with regions in the cistarget database will be used for the SCENIC+ analysis. For more information checkout the create_cisTarget_databases repo on github.
We also provide several precomputed databases containing regions covering many experimentally defined candidate cis-regulatory elements. These databases are available on: https://resources.aertslab.org/cistarget/.
For this analysis we will use a precomputed database using screen regions.
Next to a precomputed motif database we also need a motif-to-tf annotation database. This is also available on https://resources.aertslab.org/cistarget/.
Load candidate enhancer regions identified in previous step.
[3]:
import pickle
region_bin_topics_otsu = pickle.load(open(os.path.join(work_dir, 'scATAC/candidate_enhancers/region_bin_topics_otsu.pkl'), 'rb'))
region_bin_topics_top3k = pickle.load(open(os.path.join(work_dir, 'scATAC/candidate_enhancers/region_bin_topics_top3k.pkl'), 'rb'))
markers_dict = pickle.load(open(os.path.join(work_dir, 'scATAC/candidate_enhancers/markers_dict.pkl'), 'rb'))
Convert to dictionary of pyranges objects.
[4]:
import pyranges as pr
from pycistarget.utils import region_names_to_coordinates
region_sets = {}
region_sets['topics_otsu'] = {}
region_sets['topics_top_3'] = {}
region_sets['DARs'] = {}
for topic in region_bin_topics_otsu.keys():
regions = region_bin_topics_otsu[topic].index[region_bin_topics_otsu[topic].index.str.startswith('chr')] #only keep regions on known chromosomes
region_sets['topics_otsu'][topic] = pr.PyRanges(region_names_to_coordinates(regions))
for topic in region_bin_topics_top3k.keys():
regions = region_bin_topics_top3k[topic].index[region_bin_topics_top3k[topic].index.str.startswith('chr')] #only keep regions on known chromosomes
region_sets['topics_top_3'][topic] = pr.PyRanges(region_names_to_coordinates(regions))
for DAR in markers_dict.keys():
regions = markers_dict[DAR].index[markers_dict[DAR].index.str.startswith('chr')] #only keep regions on known chromosomes
region_sets['DARs'][DAR] = pr.PyRanges(region_names_to_coordinates(regions))
[5]:
for key in region_sets.keys():
print(f'{key}: {region_sets[key].keys()}')
topics_otsu: dict_keys(['Topic1', 'Topic2', 'Topic3', 'Topic4', 'Topic5', 'Topic6', 'Topic7', 'Topic8', 'Topic9', 'Topic10', 'Topic11', 'Topic12', 'Topic13', 'Topic14', 'Topic15', 'Topic16'])
topics_top_3: dict_keys(['Topic1', 'Topic2', 'Topic3', 'Topic4', 'Topic5', 'Topic6', 'Topic7', 'Topic8', 'Topic9', 'Topic10', 'Topic11', 'Topic12', 'Topic13', 'Topic14', 'Topic15', 'Topic16'])
DARs: dict_keys(['B_cells_1', 'B_cells_2', 'CD14+_Monocytes', 'CD4_T_cells', 'CD8_T_cells', 'Dendritic_cells', 'FCGR3A+_Monocytes', 'NK_cells'])
Define rankings, score and motif annotation database.
The ranking database is used for running the cistarget analysis and the scores database is used for running the DEM analysis. For more information see the pycistarget read the docs page
[6]:
db_fpath = "/staging/leuven/stg_00002/lcb/icistarget/data/make_rankings/v10_clust/CTX_hg38"
motif_annot_fpath = "/staging/leuven/stg_00002/lcb/cbravo/cluster_motif_collection_V10_no_desso_no_factorbook/snapshots"
[7]:
rankings_db = os.path.join(db_fpath, 'cluster_SCREEN.regions_vs_motifs.rankings.v2.feather')
scores_db = os.path.join(db_fpath, 'cluster_SCREEN.regions_vs_motifs.scores.v2.feather')
motif_annotation = os.path.join(motif_annot_fpath, 'motifs-v10-nr.hgnc-m0.00001-o0.0.tbl')
Next we will run pycistarget using the run_pycistarget
wrapper function.
This function will run cistarget based and DEM based motif enrichment analysis with or without promoter regions.
[8]:
if not os.path.exists(os.path.join(work_dir, 'motifs')):
os.makedirs(os.path.join(work_dir, 'motifs'))
[9]:
from scenicplus.wrappers.run_pycistarget import run_pycistarget
run_pycistarget(
region_sets = region_sets,
species = 'homo_sapiens',
save_path = os.path.join(work_dir, 'motifs'),
ctx_db_path = rankings_db,
dem_db_path = scores_db,
path_to_motif_annotations = motif_annotation,
run_without_promoters = True,
n_cpu = 8,
_temp_dir = os.path.join(tmp_dir, 'ray_spill'),
annotation_version = 'v10nr_clust',
)
2022-08-05 08:53:16,277 pycisTarget_wrapper INFO pbmc_tutorial/motifs folder already exists.
2022-08-05 08:53:17,650 pycisTarget_wrapper INFO Loading cisTarget database for topics_otsu
2022-08-05 08:53:17,653 cisTarget INFO Reading cisTarget database
2022-08-05 09:13:51,198 pycisTarget_wrapper INFO Running cisTarget for topics_otsu
(ctx_internal_ray pid=17049) 2022-08-05 09:14:38,091 cisTarget INFO Running cisTarget for Topic1 which has 4760 regions
(ctx_internal_ray pid=17050) 2022-08-05 09:14:38,555 cisTarget INFO Running cisTarget for Topic2 which has 6170 regions
(ctx_internal_ray pid=17047) 2022-08-05 09:14:38,925 cisTarget INFO Running cisTarget for Topic3 which has 4559 regions
(ctx_internal_ray pid=17040) 2022-08-05 09:14:39,376 cisTarget INFO Running cisTarget for Topic4 which has 3273 regions
(ctx_internal_ray pid=17043) 2022-08-05 09:14:39,819 cisTarget INFO Running cisTarget for Topic5 which has 2115 regions
(ctx_internal_ray pid=17044) 2022-08-05 09:14:40,172 cisTarget INFO Running cisTarget for Topic6 which has 3547 regions
(ctx_internal_ray pid=17046) 2022-08-05 09:14:40,636 cisTarget INFO Running cisTarget for Topic7 which has 5380 regions
(ctx_internal_ray pid=17048) 2022-08-05 09:14:41,022 cisTarget INFO Running cisTarget for Topic8 which has 7775 regions
2022-08-05 09:16:01,882 cisTarget INFO Done!
2022-08-05 09:16:01,884 pycisTarget_wrapper INFO pbmc_tutorial/motifs/CTX_topics_otsu_All folder already exists.
2022-08-05 09:16:02,212 pycisTarget_wrapper INFO Running cisTarget without promoters for topics_otsu
2022-08-05 09:16:13,911 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8266
2022-08-05 09:17:31,790 cisTarget INFO Done!
2022-08-05 09:17:31,793 pycisTarget_wrapper INFO pbmc_tutorial/motifs/CTX_topics_otsu_No_promoters folder already exists.
2022-08-05 09:17:32,015 pycisTarget_wrapper INFO Running DEM for topics_otsu
2022-08-05 09:17:32,017 DEM INFO Reading DEM database
Let’s explore some of the results. Below we show the motifs found for topic 8 (specific to B-cells) using DEM.
[3]:
import dill
menr = dill.load(open(os.path.join(work_dir, 'motifs/menr.pkl'), 'rb'))
[8]:
menr['DEM_topics_otsu_All'].DEM_results('Topic8')
[8]:
Logo | Contrast | Direct_annot | Orthology_annot | Log2FC | Adjusted_pval | Mean_fg | Mean_bg | Motif_hit_thr | Motif_hits | |
---|---|---|---|---|---|---|---|---|---|---|
metacluster_70.24 | ![]() |
Topic8 | POU3F2 | NaN | 2.033431 | 0.007407 | 0.417515 | 0.101988 | 3.0 | 436.0 |
metacluster_70.25 | ![]() |
Topic8 | POU3F4 | NaN | 1.87856 | 0.010579 | 0.431751 | 0.117417 | 3.0 | 439.0 |
metacluster_70.20 | ![]() |
Topic8 | POU1F1 | NaN | 1.808349 | 0.02941 | 0.44035 | 0.125728 | 3.0 | 420.0 |
metacluster_70.14 | ![]() |
Topic8 | POU2F3 | NaN | 1.583922 | 0.002572 | 0.538298 | 0.179562 | 3.0 | 516.0 |
metacluster_142.5 | ![]() |
Topic8 | POU2F2 | NaN | 1.422996 | 0.001821 | 0.559761 | 0.208755 | 3.0 | 462.0 |
taipale_tf_pairs__POU5F1_NATATGCTAATKN_HT | ![]() |
Topic8 | POU5F1 | NaN | 1.315945 | 0.011452 | 0.563511 | 0.226341 | 3.0 | 533.0 |
homer__ATATGCAAAT_Oct2 | ![]() |
Topic8 | NaN | POU2F2 | 1.298874 | 0.005947 | 0.731667 | 0.297381 | 3.0 | 656.0 |
metacluster_70.15 | ![]() |
Topic8 | POU2F2 | NaN | 1.275756 | 0.005288 | 0.622926 | 0.257274 | 3.0 | 577.0 |
tfdimers__MD00527 | ![]() |
Topic8 | ZEB1, IRF1, IRF7, IRF6, IRF8, IRF4, IRF5, IRF2, IRF3 | NaN | 1.247605 | 0.032119 | 0.495943 | 0.208865 | 3.0 | 495.0 |
metacluster_70.4 | ![]() |
Topic8 | POU3F2 | NaN | 1.11287 | 0.001088 | 0.638315 | 0.29514 | 3.0 | 572.0 |
tfdimers__MD00466 | ![]() |
Topic8 | SPI1, MYB | NaN | 1.075531 | 0.001821 | 0.936467 | 0.44435 | 3.0 | 947.0 |
metacluster_70.12 | ![]() |
Topic8 | POU1F1 | NaN | 1.072198 | 0.011736 | 0.558699 | 0.265714 | 3.0 | 486.0 |
tfdimers__MD00276 | ![]() |
Topic8 | IRF1, STAT6 | NaN | 1.002905 | 0.009414 | 0.778363 | 0.388399 | 3.0 | 737.0 |
tfdimers__MD00456 | ![]() |
Topic8 | OVOL2, E2F6 | NaN | 0.999106 | 0.016108 | 0.556238 | 0.278291 | 3.0 | 490.0 |
tfdimers__MD00336 | ![]() |
Topic8 | STAT1, IRF8 | NaN | 0.946475 | 0.002001 | 0.782795 | 0.406191 | 3.0 | 787.0 |
transfac_pro__M04722 | ![]() |
Topic8 | SPI1 | NaN | 0.922754 | 0.001821 | 1.099434 | 0.579953 | 3.0 | 882.0 |
tfdimers__MD00015 | ![]() |
Topic8 | SPI1, E2F1 | NaN | 0.906524 | 0.000034 | 1.706551 | 0.910392 | 3.0 | 1745.0 |
metacluster_70.7 | ![]() |
Topic8 | NaN | POU3F3 | 0.861957 | 0.008258 | 0.941491 | 0.518014 | 3.0 | 868.0 |
tfdimers__MD00026 | ![]() |
Topic8 | IRF1, MYB | NaN | 0.854851 | 0.002572 | 0.931829 | 0.515229 | 3.0 | 948.0 |
tfdimers__MD00176 | ![]() |
Topic8 | IRF3, E2F1 | NaN | 0.845236 | 0.000239 | 1.215198 | 0.676403 | 3.0 | 1238.0 |
tfdimers__MD00598 | ![]() |
Topic8 | TBP, TAF6, NKX3-2 | NaN | 0.844324 | 0.034463 | 0.694534 | 0.386836 | 3.0 | 658.0 |
tfdimers__MD00056 | ![]() |
Topic8 | ZEB1, EP300 | NaN | 0.839267 | 0.028728 | 0.806163 | 0.450587 | 3.0 | 778.0 |
tfdimers__MD00581 | ![]() |
Topic8 | ATF5, NKX3-2 | NaN | 0.81537 | 0.007569 | 0.89239 | 0.507112 | 3.0 | 912.0 |
metacluster_70.3 | ![]() |
Topic8 | POU3F1 | NaN | 0.811904 | 0.001787 | 0.811364 | 0.462177 | 3.0 | 686.0 |
metacluster_70.5 | ![]() |
Topic8 | POU3F3 | NaN | 0.782496 | 0.008123 | 0.641249 | 0.372797 | 3.0 | 542.0 |
tfdimers__MD00217 | ![]() |
Topic8 | LTF, MZF1 | NaN | 0.782231 | 0.007569 | 0.914582 | 0.531799 | 3.0 | 914.0 |
tfdimers__MD00138 | ![]() |
Topic8 | MYB, IRF1, IRF7, IRF8, IRF9, IRF4, IRF5, IRF2, IRF3 | NaN | 0.774719 | 0.016218 | 0.761009 | 0.444812 | 3.0 | 748.0 |
tfdimers__MD00299 | ![]() |
Topic8 | ETS1, SRY | NaN | 0.756815 | 0.043195 | 0.758165 | 0.448683 | 3.0 | 781.0 |
tfdimers__MD00414 | ![]() |
Topic8 | IRF1, IRF7, IRF8, IRF9, POU5F1, IRF4, IRF5, IRF2, IRF3 | NaN | 0.747596 | 0.010667 | 1.073022 | 0.639087 | 3.0 | 1097.0 |
metacluster_142.4 | ![]() |
Topic8 | POU3F3, POU2F3, TBP, POU5F1B, POU5F1, POU2F1, POU2AF1, POU4F1, POU3F1, POU3F2, POU2F2 | CCDC160, POU3F3, POU3F4, POU5F2, POU5F1B, POU5F1, POU3F1, POU2F2 | 0.737862 | 0.001903 | 0.992548 | 0.595159 | 3.0 | 787.0 |
metacluster_70.26 | ![]() |
Topic8 | POU2F2, POU2F1 | NaN | 0.737029 | 0.001821 | 0.958818 | 0.575265 | 3.0 | 759.0 |
tfdimers__MD00097 | ![]() |
Topic8 | PAX4, GABPA | NaN | 0.731601 | 0.004411 | 0.951365 | 0.572946 | 3.0 | 994.0 |
swissregulon__hs__POU5F1 | ![]() |
Topic8 | POU5F1 | NaN | 0.729873 | 0.033279 | 0.954035 | 0.575242 | 3.0 | 763.0 |
taipale_tf_pairs__TEAD4_GSC2_RGWATGTTAATCCS_CAP_repr | ![]() |
Topic8 | TEAD4, GSC2 | NaN | 0.724203 | 0.039349 | 0.382943 | 0.231807 | 3.0 | 198.0 |
tfdimers__MD00489 | ![]() |
Topic8 | IRF1, NR3C1 | NaN | 0.720851 | 0.024492 | 0.706047 | 0.428387 | 3.0 | 634.0 |
transfac_pro__M01239 | ![]() |
Topic8 | RELA | NaN | 0.68609 | 0.005642 | 1.022739 | 0.635669 | 3.0 | 920.0 |
transfac_pro__M04814 | ![]() |
Topic8 | RELA | NaN | 0.682833 | 0.038922 | 0.772691 | 0.481341 | 3.0 | 610.0 |
tfdimers__MD00007 | ![]() |
Topic8 | IRF8, E2F1 | NaN | 0.674146 | 0.011452 | 0.691277 | 0.433225 | 3.0 | 678.0 |
tfdimers__MD00238 | ![]() |
Topic8 | MAFK, MAFB, NFE2, NFE2L2, E2F1, BACH1, MAFF, NFE2L1, BACH2, NFE2L3, MAFG, MAF | NaN | 0.672486 | 0.008258 | 0.943005 | 0.591665 | 3.0 | 872.0 |
transfac_pro__M07078 | ![]() |
Topic8 | NaN | BHLHE40 | 0.668836 | 0.01581 | 0.680823 | 0.428247 | 3.0 | 691.0 |
tfdimers__MD00018 | ![]() |
Topic8 | ELK3, ELF1, FLI1, GABPA, ERG, SPIB, ETV4, ETS2, GABPB1, ELK4, TBP, TAF6, ETV7, ETV6, ELK1, ETS1, ELF4, ELF2, SPI1 | NaN | 0.666655 | 0.001638 | 1.399046 | 0.881351 | 3.0 | 1464.0 |
tfdimers__MD00196 | ![]() |
Topic8 | E2F6, EP300 | NaN | 0.666044 | 0.011452 | 1.073766 | 0.676722 | 3.0 | 1075.0 |
metacluster_2.6 | ![]() |
Topic8 | ZNF426, ZNF71, IRF1, IRF7, STAT1, IRF8, IRF6, IRF9, IRF4, IRF5, IRF2, IRF3, STAT2 | IRF1, IRF3, IRF8 | 0.657586 | 0.000014 | 1.710341 | 1.08425 | 3.0 | 1723.0 |
tfdimers__MD00208 | ![]() |
Topic8 | IRF1, LTF | NaN | 0.65134 | 0.020797 | 0.746346 | 0.47519 | 3.0 | 740.0 |
metacluster_70.8 | ![]() |
Topic8 | NaN | POU3F1 | 0.646561 | 0.00305 | 1.035916 | 0.661744 | 3.0 | 858.0 |
transfac_pro__M04915 | ![]() |
Topic8 | TBP | NaN | 0.644529 | 0.002572 | 0.900932 | 0.576328 | 3.0 | 789.0 |
metacluster_70.9 | ![]() |
Topic8 | POU3F1 | NaN | 0.643741 | 0.004689 | 0.982858 | 0.629079 | 3.0 | 735.0 |
metacluster_32.23 | ![]() |
Topic8 | ZNF235 | NaN | 0.641357 | 0.024492 | 0.658841 | 0.422389 | 3.0 | 602.0 |
tfdimers__MD00240 | ![]() |
Topic8 | TCF7L1, E2F1 | NaN | 0.635128 | 0.016108 | 1.270305 | 0.817928 | 3.0 | 1306.0 |
tfdimers__MD00215 | ![]() |
Topic8 | STAT1, LEF1 | NaN | 0.634386 | 0.002403 | 1.319929 | 0.850318 | 3.0 | 1345.0 |
tfdimers__MD00376 | ![]() |
Topic8 | KLF4, HOXA13 | NaN | 0.632878 | 0.044401 | 0.615301 | 0.3968 | 3.0 | 555.0 |
tfdimers__MD00542 | ![]() |
Topic8 | E2F1, EBF1 | NaN | 0.625781 | 0.02964 | 0.67408 | 0.43685 | 3.0 | 608.0 |
jaspar__MA1960.1 | ![]() |
Topic8 | MGA | NaN | 0.611138 | 0.024492 | 0.406037 | 0.265824 | 3.0 | 254.0 |
tfdimers__MD00174 | ![]() |
Topic8 | STAT6, PDX1 | NaN | 0.610751 | 0.01581 | 0.914349 | 0.598766 | 3.0 | 915.0 |
taipale_tf_pairs__TEAD4_SPIB_RGAATGCGGAAGTN_CAP_1 | ![]() |
Topic8 | SPIB, TEAD4 | NaN | 0.607447 | 0.038911 | 0.729044 | 0.478513 | 3.0 | 744.0 |
tfdimers__MD00131 | ![]() |
Topic8 | STAT6, TBP | NaN | 0.60643 | 0.005947 | 0.983335 | 0.645874 | 3.0 | 1028.0 |
tfdimers__MD00177 | ![]() |
Topic8 | STAT1, TCF7L1 | NaN | 0.604101 | 0.011938 | 1.222177 | 0.804048 | 3.0 | 1242.0 |
tfdimers__MD00603 | ![]() |
Topic8 | YY1, NKX3-2 | NaN | 0.589888 | 0.024374 | 0.727956 | 0.48365 | 3.0 | 665.0 |
tfdimers__MD00204 | ![]() |
Topic8 | GABPA, NKX2-1 | NaN | 0.582448 | 0.010123 | 1.496493 | 0.999402 | 3.0 | 1569.0 |
metacluster_154.2 | ![]() |
Topic8 | PAX2, PAX5, PAX9, PAX8, PAX1, PAX6, PAX4 | PAX5, PAX6, PAX1 | 0.574503 | 0.001257 | 0.68085 | 0.457202 | 3.0 | 429.0 |
tfdimers__MD00105 | ![]() |
Topic8 | IRF8, PURA | NaN | 0.569096 | 0.001128 | 1.249945 | 0.842511 | 3.0 | 1192.0 |
metacluster_70.1 | ![]() |
Topic8 | POU5F1 | NaN | 0.565803 | 0.009173 | 0.942012 | 0.636403 | 3.0 | 778.0 |
tfdimers__MD00145 | ![]() |
Topic8 | TCF7L2, ELK4, ETS1, FLI1, ELF2, ERG, ERF, ETV7, ELF1, ELK1, ETS2 | NaN | 0.565548 | 0.035655 | 1.092264 | 0.738041 | 3.0 | 1115.0 |
tfdimers__MD00220 | ![]() |
Topic8 | POU5F1, ZBTB7A | NaN | 0.560358 | 0.037101 | 0.898604 | 0.609374 | 3.0 | 788.0 |
tfdimers__MD00398 | ![]() |
Topic8 | ETS1, PURA | NaN | 0.551847 | 0.028058 | 1.288485 | 0.878935 | 3.0 | 1303.0 |
metacluster_2.7 | ![]() |
Topic8 | STAT1, IRF1, IRF7, IRF8, IRF6, IRF9, PRDM1, IRF4, IRF5, IRF2, IRF3, STAT2 | PRDM1 | 0.55154 | 0.000034 | 1.693823 | 1.155681 | 3.0 | 1653.0 |
tfdimers__MD00012 | ![]() |
Topic8 | ETS1, YY1 | NaN | 0.550368 | 0.001155 | 1.628644 | 1.112113 | 3.0 | 1691.0 |
hocomoco__PO2F3_HUMAN.H11MO.0.D | ![]() |
Topic8 | POU2F3 | NaN | 0.547829 | 0.001821 | 0.796571 | 0.544894 | 3.0 | 583.0 |
metacluster_70.13 | ![]() |
Topic8 | POU2F1 | NaN | 0.543429 | 0.005947 | 0.907679 | 0.622793 | 3.0 | 644.0 |
metacluster_32.27 | ![]() |
Topic8 | ZNF235, ZFP28 | NaN | 0.543187 | 0.011736 | 0.682661 | 0.468479 | 3.0 | 629.0 |
transfac_pro__M04960 | ![]() |
Topic8 | POU5F1 | NaN | 0.540541 | 0.023501 | 0.481716 | 0.331186 | 3.0 | 367.0 |
tfdimers__MD00263 | ![]() |
Topic8 | STAT1, TCF7L1 | NaN | 0.536461 | 0.005655 | 1.155959 | 0.796988 | 3.0 | 1203.0 |
metacluster_2.9 | ![]() |
Topic8 | IRF2, PRDM1 | PRDM1 | 0.532792 | 0.001207 | 1.32202 | 0.913802 | 3.0 | 1281.0 |
tfdimers__MD00235 | ![]() |
Topic8 | STAT1, DBP | NaN | 0.532413 | 0.017536 | 1.173989 | 0.811693 | 3.0 | 1104.0 |
tfdimers__MD00044 | ![]() |
Topic8 | GABPB1, ELK4, ETS1, ELF4, ELF2, ETV6, FLI1, ELK3, ETV4, GABPA, ERG, SPI1, SIRT6, SPIB, ETV7, ELF1, ELK1, ETS2 | NaN | 0.528101 | 0.024492 | 1.352706 | 0.938057 | 3.0 | 1377.0 |
transfac_pro__M03842 | ![]() |
Topic8 | POU3F1 | NaN | 0.526425 | 0.008124 | 0.656542 | 0.45582 | 3.0 | 488.0 |
metacluster_172.20 | ![]() |
Topic8 | EBF1, ZFP2, EBF2, EBF3 | EBF2, ZFP2, EBF4, EBF3, EBF1 | 0.518482 | 0.011736 | 1.012248 | 0.706656 | 3.0 | 878.0 |
metacluster_167.2 | ![]() |
Topic8 | ZNF41, IKZF2, BCL11A, IRF8, ZBED1, PRDM1, IRF4, EP300, SPI1, TBL1XR1 | IKZF1, IRF4, SPI1 | 0.510972 | 0.000129 | 2.955833 | 2.074254 | 3.0 | 2949.0 |
hocomoco__PO2F1_HUMAN.H11MO.0.C | ![]() |
Topic8 | POU2F1 | NaN | 0.500929 | 0.00067 | 0.862948 | 0.609804 | 3.0 | 629.0 |
We now have completed all the steps necessary for starting the SCENIC+ analysis 😅.
In particalular, we have
preprocessed the scRNA-seq side of the data, selecting high quality cells and annotation these cells.
preprocessed the scATAC-seq side of the data, selecting high quality cells, performing topic modeling and identifying candidate enhacer regions.
looked for enriched motifs in candidate enhancer regions.
In the next section we will combine all these analysis and run SCENIC+
inferring enhancer-driven Gene Regulatory Networks (eGRNs) using SCENIC+
We now have completed all the steps for running the SCENIC+ analysis.
We will start by creating a scenicplus object containing all the analysis we have done up to this point.
For this we will need to load:
the AnnData object containing the scRNA-seq side of the analysis.
the cisTopic object containing the scATAC-seq side of the analysis.
the motif enrichment dictionary containing the motif enrichment results.
[6]:
import dill
import scanpy as sc
import os
import warnings
warnings.filterwarnings("ignore")
import pandas
import pyranges
# Set stderr to null to avoid strange messages from ray
import sys
_stderr = sys.stderr
null = open(os.devnull,'wb')
work_dir = 'pbmc_tutorial'
tmp_dir = '/scratch/leuven/330/vsc33053/'
adata = sc.read_h5ad(os.path.join(work_dir, 'scRNA/adata.h5ad'))
cistopic_obj = dill.load(open(os.path.join(work_dir, 'scATAC/cistopic_obj.pkl'), 'rb'))
menr = dill.load(open(os.path.join(work_dir, 'motifs/menr.pkl'), 'rb'))
Create the SCENIC+ object. It will store both the gene expression and chromatin accessibility along with motif enrichment results and cell/region/gene metadata.
Cell metadata comming from the cistopic_obj will be prefixed with the string ACC_
and metadata comming from the adata object will be prefixed with the string GEX_
.
[7]:
from scenicplus.scenicplus_class import create_SCENICPLUS_object
import numpy as np
scplus_obj = create_SCENICPLUS_object(
GEX_anndata = adata.raw.to_adata(),
cisTopic_obj = cistopic_obj,
menr = menr,
bc_transform_func = lambda x: f'{x}-10x_pbmc' #function to convert scATAC-seq barcodes to scRNA-seq ones
)
scplus_obj.X_EXP = np.array(scplus_obj.X_EXP.todense())
scplus_obj
2022-08-09 21:06:11,028 cisTopic INFO Imputing drop-outs
2022-08-09 21:06:19,958 cisTopic INFO Scaling
2022-08-09 21:06:21,803 cisTopic INFO Keep non zero rows
2022-08-09 21:06:24,861 cisTopic INFO Imputed accessibility sparsity: 0.43567783880139876
2022-08-09 21:06:24,863 cisTopic INFO Create CistopicImputedFeatures object
2022-08-09 21:06:24,864 cisTopic INFO Done!
[7]:
SCENIC+ object with n_cells x n_genes = 2415 x 21255 and n_cells x n_regions = 2415 x 191245
metadata_regions:'Chromosome', 'Start', 'End', 'Width', 'cisTopic_nr_frag', 'cisTopic_log_nr_frag', 'cisTopic_nr_acc', 'cisTopic_log_nr_acc'
metadata_genes:'gene_ids', 'feature_types', 'genome', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
metadata_cell:'GEX_n_genes', 'GEX_doublet_score', 'GEX_predicted_doublet', 'GEX_n_genes_by_counts', 'GEX_total_counts', 'GEX_total_counts_mt', 'GEX_pct_counts_mt', 'GEX_ingest_celltype_label', 'GEX_leiden_res_0.8', 'GEX_celltype', 'ACC_barcode', 'ACC_cisTopic_nr_frag', 'ACC_cisTopic_log_nr_acc', 'ACC_Total_nr_frag_in_regions', 'ACC_cisTopic_nr_acc', 'ACC_TSS_enrichment', 'ACC_Log_total_nr_frag', 'ACC_cisTopic_log_nr_frag', 'ACC_Unique_nr_frag', 'ACC_Dupl_nr_frag', 'ACC_FRIP', 'ACC_Log_unique_nr_frag', 'ACC_Unique_nr_frag_in_regions', 'ACC_Dupl_rate', 'ACC_Total_nr_frag', 'ACC_n_genes', 'ACC_doublet_score', 'ACC_predicted_doublet', 'ACC_n_genes_by_counts', 'ACC_total_counts', 'ACC_total_counts_mt', 'ACC_pct_counts_mt', 'ACC_ingest_celltype_label', 'ACC_leiden_res_0.8', 'ACC_celltype', 'ACC_sample_id'
menr:'CTX_topics_otsu_All', 'CTX_topics_otsu_No_promoters', 'DEM_topics_otsu_All', 'DEM_topics_otsu_No_promoters', 'CTX_topics_top_3_All', 'CTX_topics_top_3_No_promoters', 'DEM_topics_top_3_All', 'DEM_topics_top_3_No_promoters', 'CTX_DARs_All', 'CTX_DARs_No_promoters', 'DEM_DARs_All', 'DEM_DARs_No_promoters'
dr_cell:'GEX_X_pca', 'GEX_X_umap', 'GEX_rep'
Note:
the scenicplus package contains many function, if you need help with any of them just run the help()
function. For example see below:
[7]:
from scenicplus.scenicplus_class import create_SCENICPLUS_object
help(create_SCENICPLUS_object)
Help on function create_SCENICPLUS_object in module scenicplus.scenicplus_class:
create_SCENICPLUS_object(GEX_anndata: anndata._core.anndata.AnnData, cisTopic_obj: pycisTopic.cistopic_class.CistopicObject, menr: Mapping[str, Mapping[str, Any]], multi_ome_mode: bool = True, nr_metacells: Union[int, Mapping[str, int]] = None, nr_cells_per_metacells: Union[int, Mapping[str, int]] = 10, meta_cell_split: str = '_', key_to_group_by: str = None, imputed_acc_obj: pycisTopic.diff_features.CistopicImputedFeatures = None, imputed_acc_kwargs: Mapping[str, Any] = {'scale_factor': 1000000}, normalize_imputed_acc: bool = False, normalize_imputed_acc_kwargs: Mapping[str, Any] = {'scale_factor': 10000}, cell_metadata: pandas.core.frame.DataFrame = None, region_metadata: pandas.core.frame.DataFrame = None, gene_metadata: pandas.core.frame.DataFrame = None, bc_transform_func: Callable = None, ACC_prefix: str = 'ACC_', GEX_prefix: str = 'GEX_') -> scenicplus.scenicplus_class.SCENICPLUS
Function to create instances of :class:`SCENICPLUS`
Parameters
----------
GEX_anndata: sc.AnnData
An instance of :class:`~sc.AnnData` containing gene expression data and metadata.
cisTopic_obj: CistopicObject
An instance of :class:`pycisTopic.cistopic_class.CistopicObject` containing chromatin accessibility data and metadata.
menr: dict
A dict mapping annotations to motif enrichment results
multi_ome_mode: bool
A boolean specifying wether data is multi-ome (i.e. combined scATAC-seq and scRNA-seq from the same cell) or not
default: True
nr_metacells: int
For non multi_ome_mode, use this number of meta cells to link scRNA-seq and scATAC-seq
If this is a single integer the same number of metacells will be used for all annotations.
This can also be a mapping between an annotation and the number of metacells per annotation.
default: None
nr_cells_per_metacells: int
For non multi_ome_mode, use this number of cells per metacell to link scRNA-seq and scATAC-seq.
If this is a single integer the same number of cells will be used for all annotations.
This can also be a mapping between an annotation and the number of cells per metacell per annotation.
default: 10
meta_cell_split: str
Character which is used as seperator in metacell names
default: '_'
key_to_group_by: str
For non multi_ome_mode, use this cell metadata key to generate metacells from scRNA-seq and scATAC-seq.
Key should be common in scRNA-seq and scATAC-seq side
default: None
imputed_acc_obj: CistopicImputedFeatures
An instance of :class:`~pycisTopic.diff_features.CistopicImputedFeatures` containing imputed chromatin accessibility.
default: None
imputed_acc_kwargs: dict
Dict with keyword arguments for imputed chromatin accessibility.
default: {'scale_factor': 10**6}
normalize_imputed_acc: bool
A boolean specifying wether or not to normalize imputed chromatin accessbility.
default: False
normalize_imputed_acc_kwargs: dict
Dict with keyword arguments for normalizing imputed accessibility.
default: {'scale_factor': 10 ** 4}
cell_metadata: pd.DataFrame
An instance of :class:`~pd.DataFrame` containing extra cell metadata
default: None
region_metadata: pd.DataFrame
An instance of :class:`~pd.DataFrame` containing extra region metadata
default: None
gene_metadata: pd.DataFrame
An instance of :class:`~pd.DataFrame` containing extra gene metadata
default: None
bc_transform_func: func
A function used to transform gene expression barcode layout to chromatin accessbility layout.
default: None
ACC_prefix: str
String prefix to add to cell metadata coming from cisTopic_obj
GEX_prefix: str
String prefix to add to cell metadata coming from GEX_anndata
Examples
--------
>>> scplus_obj = create_SCENICPLUS_object(GEX_anndata = rna_anndata, cisTopic_obj = cistopic_obj, menr = motif_enrichment_dict)
Before running SCENIC+ it is important to check with which biomart host the gene names used in your analysis match best. Biomart will be used to find transcription starting sites of each gene. The names of genes (symbols) change quite often, so it is important to select the biomart host with the largest overlap, otherwise a lot of genes can potentially be lost.
Below we show an example on how to select the optimal host.
[21]:
ensembl_version_dict = {'105': 'http://www.ensembl.org',
'104': 'http://may2021.archive.ensembl.org/',
'103': 'http://feb2021.archive.ensembl.org/',
'102': 'http://nov2020.archive.ensembl.org/',
'101': 'http://aug2020.archive.ensembl.org/',
'100': 'http://apr2020.archive.ensembl.org/',
'99': 'http://jan2020.archive.ensembl.org/',
'98': 'http://sep2019.archive.ensembl.org/',
'97': 'http://jul2019.archive.ensembl.org/',
'96': 'http://apr2019.archive.ensembl.org/',
'95': 'http://jan2019.archive.ensembl.org/',
'94': 'http://oct2018.archive.ensembl.org/',
'93': 'http://jul2018.archive.ensembl.org/',
'92': 'http://apr2018.archive.ensembl.org/',
'91': 'http://dec2017.archive.ensembl.org/',
'90': 'http://aug2017.archive.ensembl.org/',
'89': 'http://may2017.archive.ensembl.org/',
'88': 'http://mar2017.archive.ensembl.org/',
'87': 'http://dec2016.archive.ensembl.org/',
'86': 'http://oct2016.archive.ensembl.org/',
'80': 'http://may2015.archive.ensembl.org/',
'77': 'http://oct2014.archive.ensembl.org/',
'75': 'http://feb2014.archive.ensembl.org/',
'54': 'http://may2009.archive.ensembl.org/'}
import pybiomart as pbm
def test_ensembl_host(scplus_obj, host, species):
dataset = pbm.Dataset(name=species+'_gene_ensembl', host=host)
annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype'])
annot.columns = ['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
annot['Chromosome'] = annot['Chromosome'].astype('str')
filter = annot['Chromosome'].str.contains('CHR|GL|JH|MT')
annot = annot[~filter]
annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type']
gene_names_release = set(annot['Gene'].tolist())
ov=len([x for x in scplus_obj.gene_names if x in gene_names_release])
print('Genes recovered: ' + str(ov) + ' out of ' + str(len(scplus_obj.gene_names)))
return ov
n_overlap = {}
for version in ensembl_version_dict.keys():
print(f'host: {version}')
try:
n_overlap[version] = test_ensembl_host(scplus_obj, ensembl_version_dict[version], 'hsapiens')
except:
print('Host not reachable')
v = sorted(n_overlap.items(), key=lambda item: item[1], reverse=True)[0][0]
print(f"version: {v} has the largest overlap, use {ensembl_version_dict[v]} as biomart host")
host: 105
Genes recovered: 16357 out of 21255
host: 104
Genes recovered: 16486 out of 21255
host: 103
Genes recovered: 20757 out of 21255
host: 102
/user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/pybiomart/dataset.py:269: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False.
result = pd.read_csv(StringIO(response.text), sep='\t')
Genes recovered: 20803 out of 21255
host: 101
Genes recovered: 20877 out of 21255
host: 100
Genes recovered: 20978 out of 21255
host: 99
Genes recovered: 21027 out of 21255
host: 98
Genes recovered: 21231 out of 21255
host: 97
Genes recovered: 20989 out of 21255
host: 96
Genes recovered: 20407 out of 21255
host: 95
Genes recovered: 20292 out of 21255
host: 94
Genes recovered: 20231 out of 21255
host: 93
Genes recovered: 20089 out of 21255
host: 92
Genes recovered: 20003 out of 21255
host: 91
Genes recovered: 19862 out of 21255
host: 90
Genes recovered: 19837 out of 21255
host: 89
Genes recovered: 19759 out of 21255
host: 88
Genes recovered: 16160 out of 21255
host: 87
Host not reachable
host: 86
Host not reachable
host: 80
Genes recovered: 15943 out of 21255
host: 77
Genes recovered: 15714 out of 21255
host: 75
Host not reachable
host: 54
Host not reachable
version: 98 has the largest overlap, use http://sep2019.archive.ensembl.org/ as biomart host
Once the object is created we can run SCENIC+. The whole analysis workflow can be run using a single wrapper function or each step can be run individually.
Note:
Here we will use the wrapper function. For more information on the different steps, see the following tutorial: TO BE ADDED.
[9]:
biomart_host = "http://sep2019.archive.ensembl.org/"
Before running we will also download a list of known human TFs from the human transcription factors database.
[26]:
!wget -O pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt http://humantfs.ccbr.utoronto.ca/download/v_1.01/TF_names_v_1.01.txt
--2022-08-05 13:59:20-- http://humantfs.ccbr.utoronto.ca/download/v_1.01/TF_names_v_1.01.txt
Resolving humantfs.ccbr.utoronto.ca (humantfs.ccbr.utoronto.ca)... 142.150.52.218
Connecting to humantfs.ccbr.utoronto.ca (humantfs.ccbr.utoronto.ca)|142.150.52.218|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 11838 (12K) [text/plain]
Saving to: ‘pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt’
100%[======================================>] 11,838 53.9KB/s in 0.2s
2022-08-05 13:59:21 (53.9 KB/s) - ‘pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt’ saved [11838/11838]
We will also download a the program bedToBigBed this will be used to generate files which can be uploaded to the UCSC genome browser
[27]:
!wget -O pbmc_tutorial/bedToBigBed http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
!chmod +x pbmc_tutorial/bedToBigBed
--2022-08-05 14:01:55-- http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bedToBigBed
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9573136 (9.1M)
Saving to: ‘pbmc_tutorial/bedToBigBed’
100%[======================================>] 9,573,136 3.95MB/s in 2.3s
2022-08-05 14:01:58 (3.95 MB/s) - ‘pbmc_tutorial/bedToBigBed’ saved [9573136/9573136]
[18]:
#only keep the first two columns of the PCA embedding in order to be able to visualize this in SCope
scplus_obj.dr_cell['GEX_X_pca'] = scplus_obj.dr_cell['GEX_X_pca'].iloc[:, 0:2]
scplus_obj.dr_cell['GEX_rep'] = scplus_obj.dr_cell['GEX_rep'].iloc[:, 0:2]
Now we are ready to run the analysis.
Warning:
Running SCENIC+ can be computationaly expensive. We don’t recommend to run it on your local machine.
[25]:
from scenicplus.wrappers.run_scenicplus import run_scenicplus
try:
run_scenicplus(
scplus_obj = scplus_obj,
variable = ['GEX_celltype'],
species = 'hsapiens',
assembly = 'hg38',
tf_file = 'pbmc_tutorial/data/utoronto_human_tfs_v_1.01.txt',
save_path = os.path.join(work_dir, 'scenicplus'),
biomart_host = biomart_host,
upstream = [1000, 150000],
downstream = [1000, 150000],
calculate_TF_eGRN_correlation = True,
calculate_DEGs_DARs = True,
export_to_loom_file = True,
export_to_UCSC_file = True,
path_bedToBigBed = 'pbmc_tutorial',
n_cpu = 12,
_temp_dir = os.path.join(tmp_dir, 'ray_spill'))
except Exception as e:
#in case of failure, still save the object
dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
raise(e)
2022-08-10 00:06:08,510 SCENIC+_wrapper INFO pbmc_tutorial/scenicplus folder already exists.
2022-08-09 21:07:44,695 SCENIC+_wrapper INFO Merging cistromes
2022-08-09 21:15:49,444 SCENIC+_wrapper INFO Getting search space
2022-08-09 21:15:50,331 R2G INFO Downloading gene annotation from biomart dataset: hsapiens_gene_ensembl
2022-08-09 21:16:08,074 R2G INFO Downloading chromosome sizes from: http://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
2022-08-09 21:16:09,168 R2G INFO Extending promoter annotation to 10 bp upstream and 10 downstream
2022-08-09 21:16:11,431 R2G INFO Extending search space to:
150000 bp downstream of the end of the gene.
150000 bp upstream of the start of the gene.
2022-08-09 21:16:30,513 R2G INFO Intersecting with regions.
join: Strand data from other will be added as strand data to self.
If this is undesired use the flag apply_strand_suffix=False.
To turn off the warning set apply_strand_suffix to True or False.
2022-08-09 21:16:31,403 R2G INFO Calculating distances from region to gene
2022-08-09 21:18:05,997 R2G INFO Imploding multiple entries per region and gene
2022-08-09 21:20:31,736 R2G INFO Done!
2022-08-09 21:20:32,158 SCENIC+_wrapper INFO Inferring region to gene relationships
2022-08-09 21:20:32,330 R2G INFO Calculating region to gene importances, using GBM method
2022-08-09 21:20:37,215 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
initializing: 0%| | 10/14676 [00:00<14:51, 16.45it/s](raylet) cut: write error: Broken pipe
initializing: 100%|██████████████████████████████████████████████████████| 14676/14676 [13:29<00:00, 18.13it/s]
Running using 12 cores: 100%|███████████████████████████████████████████| 14676/14676 [00:29<00:00, 493.40it/s]
2022-08-09 21:34:41,966 R2G INFO Took 849.634777545929 seconds
2022-08-09 21:34:41,968 R2G INFO Calculating region to gene correlation, using SR method
2022-08-09 21:34:46,611 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
initializing: 0%| | 10/14676 [00:00<15:20, 15.94it/s](raylet) cut: write error: Broken pipe
initializing: 1%|▋ | 164/14676 [00:09<14:13, 17.00it/s](_score_regions_to_single_gene_ray pid=22224) /user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4878: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined.
(_score_regions_to_single_gene_ray pid=22224) warnings.warn(stats.ConstantInputWarning(warn_msg))
initializing: 1%|▋ | 176/14676 [00:10<14:09, 17.06it/s](_score_regions_to_single_gene_ray pid=22235) /user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4878: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined.
(_score_regions_to_single_gene_ray pid=22235) warnings.warn(stats.ConstantInputWarning(warn_msg))
initializing: 2%|█ | 285/14676 [00:16<13:01, 18.41it/s](_score_regions_to_single_gene_ray pid=22224) /user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4878: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined.
(_score_regions_to_single_gene_ray pid=22224) warnings.warn(stats.ConstantInputWarning(warn_msg))
initializing: 3%|█▍ | 367/14676 [00:20<12:56, 18.44it/s](_score_regions_to_single_gene_ray pid=22224) /user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4878: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined.
(_score_regions_to_single_gene_ray pid=22224) warnings.warn(stats.ConstantInputWarning(warn_msg))
initializing: 3%|█▌ | 399/14676 [00:22<12:44, 18.69it/s](_score_regions_to_single_gene_ray pid=22224) /user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4878: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined.
(_score_regions_to_single_gene_ray pid=22224) warnings.warn(stats.ConstantInputWarning(warn_msg))
initializing: 3%|█▌ | 403/14676 [00:22<12:41, 18.73it/s](_score_regions_to_single_gene_ray pid=22224) /user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/scipy/stats/_stats_py.py:4878: ConstantInputWarning: An input array is constant; the correlation coefficient is not defined.
(_score_regions_to_single_gene_ray pid=22224) warnings.warn(stats.ConstantInputWarning(warn_msg))
initializing: 100%|██████████████████████████████████████████████████████| 14676/14676 [13:08<00:00, 18.62it/s]
Running using 12 cores: 100%|███████████████████████████████████████████| 14676/14676 [00:28<00:00, 506.97it/s]
2022-08-09 21:48:29,014 R2G INFO Took 827.04399061203 seconds
2022-08-09 21:48:40,327 R2G INFO Done!
2022-08-09 21:48:40,715 SCENIC+_wrapper INFO Inferring TF to gene relationships
2022-08-09 21:48:48,960 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
2022-08-09 21:48:48,960 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265
initializing: 0%| | 8/21255 [00:00<21:51, 16.20it/s](raylet) cut: write error: Broken pipe
initializing: 10%|█████▌ | 2168/21255 [01:15<20:16, 15.69it/s](raylet) Spilled 2279 MiB, 32 objects, write throughput 192 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.
(raylet) Spilled 4145 MiB, 58 objects, write throughput 260 MiB/s.
initializing: 10%|█████▌ | 2168/21255 [01:30<20:16, 15.69it/s](raylet) Spilled 8290 MiB, 118 objects, write throughput 247 MiB/s.
initializing: 11%|█████▉ | 2304/21255 [01:54<05:49, 54.20it/s](raylet) Spilled 29029 MiB, 419 objects, write throughput 580 MiB/s.
initializing: 11%|██████ | 2325/21255 [01:56<12:34, 25.09it/s](raylet) Spilled 35639 MiB, 513 objects, write throughput 686 MiB/s.
initializing: 12%|██████▍ | 2476/21255 [02:05<06:54, 45.27it/s](raylet) Spilled 67571 MiB, 976 objects, write throughput 1070 MiB/s.
initializing: 13%|███████▎ | 2834/21255 [02:35<31:01, 9.90it/s](raylet) Spilled 137837 MiB, 1993 objects, write throughput 1532 MiB/s.
initializing: 16%|████████▊ | 3385/21255 [03:16<08:56, 33.33it/s](raylet) Spilled 264286 MiB, 3824 objects, write throughput 1966 MiB/s.
initializing: 22%|████████████▏ | 4731/21255 [04:46<05:33, 49.50it/s](raylet) Spilled 532074 MiB, 7699 objects, write throughput 2404 MiB/s.
initializing: 34%|██████████████████▊ | 7256/21255 [07:45<05:40, 41.09it/s](raylet) Spilled 1055640 MiB, 15276 objects, write throughput 2626 MiB/s.
initializing: 58%|███████████████████████████████▎ | 12318/21255 [13:46<02:48, 53.00it/s](raylet) Spilled 2104256 MiB, 30454 objects, write throughput 2761 MiB/s.
initializing: 100%|██████████████████████████████████████████████████████| 21255/21255 [24:12<00:00, 14.63it/s]
Running using 12 cores: 100%|███████████████████████████████████████████| 21255/21255 [01:16<00:00, 278.59it/s]
2022-08-09 22:15:31,375 TF2G INFO Took 1599.8746180534363 seconds
2022-08-09 22:15:31,379 TF2G INFO Adding correlation coefficients to adjacencies.
2022-08-09 22:16:06,074 TF2G INFO Warning: adding TFs as their own target to adjecencies matrix. Importance values will be max + 1e-05
2022-08-09 22:16:17,002 TF2G INFO Adding importance x rho scores to adjacencies.
2022-08-09 22:16:17,053 TF2G INFO Took 45.67378568649292 seconds
2022-08-09 22:16:17,482 SCENIC+_wrapper INFO Build eGRN
2022-08-09 22:16:17,486 GSEA INFO Thresholding region to gene relationships
2022-08-09 22:16:40,569 ERROR services.py:1488 -- Failed to start the dashboard: Failed to start the dashboard
The last 10 lines of /scratch/leuven/330/vsc33053/ray_spill/session_2022-08-09_22-16-17_762354_24083/logs/dashboard.log:
2022-08-09 22:16:36,145 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
2022-08-09 22:16:40,575 ERROR services.py:1489 -- Failed to start the dashboard
The last 10 lines of /scratch/leuven/330/vsc33053/ray_spill/session_2022-08-09_22-16-17_762354_24083/logs/dashboard.log:
2022-08-09 22:16:36,145 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
Traceback (most recent call last):
File "/user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/ray/_private/services.py", line 1465, in start_dashboard
raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard
The last 10 lines of /scratch/leuven/330/vsc33053/ray_spill/session_2022-08-09_22-16-17_762354_24083/logs/dashboard.log:
2022-08-09 22:16:36,145 INFO utils.py:99 -- Get all modules by type: DashboardHeadModule
0%| | 0/14 [00:00<?, ?it/s]
50%|█████████████████████████████████████▌ | 7/14 [09:37<11:05, 95.14s/it] 2.43it/s](raylet) cut: write error: Broken pipe
100%|██████████████████████████████████████████████████████████████████████████| 14/14 [19:19<00:00, 82.82s/it] 47.61it/s]IOStream.flush timed out
2022-08-09 22:36:04,805 GSEA INFO Subsetting TF2G adjacencies for TF with motif.
2022-08-09 22:36:11,708 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8266
2022-08-09 22:36:15,228 GSEA INFO Running GSEA...
initializing: 0%| | 0/30873 [00:00<?, ?it/s](raylet) cut: write error: Broken pipe
initializing: 100%|██████████████████████████████████████████████████████| 30873/30873 [13:12<00:00, 38.97it/s]
Running using 12 cores: 100%|███████████████| 30584/30584 [06:18<00:00, 80.79it/s]
2022-08-09 22:55:52,226 GSEA INFO Subsetting on adjusted pvalue: 1, minimal NES: 0 and minimal leading edge genes 10
2022-08-09 22:56:00,036 GSEA INFO Merging eRegulons
2022-08-09 22:56:02,199 GSEA INFO Storing eRegulons in .uns[eRegulons].
2022-08-09 22:56:30,771 SCENIC+_wrapper INFO Formatting eGRNs
2022-08-09 23:12:13,820 SCENIC+_wrapper INFO Converting eGRNs to signatures
2022-08-09 23:13:50,922 SCENIC+_wrapper INFO Calculating eGRNs AUC
2022-08-09 23:13:50,926 SCENIC+_wrapper INFO Calculating region ranking
2022-08-09 23:16:00,602 SCENIC+_wrapper INFO Calculating eGRNs region based AUC
2022-08-09 23:16:37,138 SCENIC+_wrapper INFO Calculating gene ranking
2022-08-09 23:16:47,672 SCENIC+_wrapper INFO Calculating eGRNs gene based AUC
2022-08-09 23:17:09,299 SCENIC+_wrapper INFO Calculating TF-eGRNs AUC correlation
2022-08-09 23:17:28,778 SCENIC+_wrapper INFO Binarizing eGRNs AUC
2022-08-09 23:30:18,591 SCENIC+_wrapper INFO Making eGRNs AUC UMAP
2022-08-09 23:30:40,377 SCENIC+_wrapper INFO Making eGRNs AUC tSNE
2022-08-09 23:32:21,091 SCENIC+_wrapper INFO Calculating eRSS
2022-08-09 23:32:37,443 SCENIC+_wrapper INFO Calculating DEGs/DARs
2022-08-09 23:32:37,447 SCENIC+ INFO Calculating DEGs for variable GEX_celltype
2022-08-09 23:32:39,198 SCENIC+ INFO There are 4085 variable features
... storing 'ACC_celltype' as categorical
... storing 'ACC_sample_id' as categorical
2022-08-09 23:32:41,037 SCENIC+ INFO Finished calculating DEGs for variable GEX_celltype
2022-08-09 23:32:41,039 SCENIC+ INFO Calculating DARs for variable GEX_celltype
2022-08-09 23:33:04,892 SCENIC+ INFO There are 54507 variable features
... storing 'ACC_celltype' as categorical
... storing 'ACC_sample_id' as categorical
2022-08-09 23:33:32,449 SCENIC+ INFO Finished calculating DARs for variable GEX_celltype
2022-08-09 23:33:32,454 SCENIC+_wrapper INFO Exporting to loom file
2022-08-09 23:33:32,456 SCENIC+ INFO Formatting data
2022-08-09 23:33:32,478 SCENIC+ INFO The number of regulons is more than > 900. keep_direct_and_extended_if_not_direct is set to True
2022-08-09 23:34:22,821 SCENIC+ INFO Creating minimal loom
2022-08-09 23:57:22,659 SCENIC+ INFO Adding annotations
2022-08-09 23:57:24,056 SCENIC+ INFO Adding clusterings
2022-08-09 23:57:24,080 SCENIC+ INFO Adding markers
2022-08-09 23:57:24,376 SCENIC+ INFO Exporting
2022-08-09 23:57:31,189 SCENIC+ INFO Formatting data
2022-08-09 23:57:31,208 SCENIC+ INFO The number of regulons is more than > 900. keep_direct_and_extended_if_not_direct is set to True
2022-08-09 23:58:46,626 SCENIC+ INFO Creating minimal loom
2022-08-10 00:00:06,487 SCENIC+ INFO Adding annotations
2022-08-10 00:00:20,221 SCENIC+ INFO Adding clusterings
2022-08-10 00:00:20,270 SCENIC+ INFO Adding markers
2022-08-10 00:00:23,423 SCENIC+ INFO Exporting
2022-08-10 00:01:28,000 SCENIC+_wrapper INFO Exporting to UCSC
2022-08-10 00:01:29,552 R2G INFO Downloading gene annotation from biomart, using dataset: hsapiens_gene_ensembl
2022-08-10 00:01:50,588 R2G INFO Formatting data ...
2022-08-10 00:02:21,529 R2G INFO Writing data to: pbmc_tutorial/scenicplus/r2g.rho.bed
2022-08-10 00:02:23,374 R2G INFO Writing data to: pbmc_tutorial/scenicplus/r2g.rho.bb
2022-08-10 00:08:13,419 R2G INFO Downloading gene annotation from biomart, using dataset: hsapiens_gene_ensembl
2022-08-10 00:08:13,929 R2G INFO Formatting data ...
2022-08-10 00:08:46,006 R2G INFO Writing data to: pbmc_tutorial/scenicplus/r2g.importance.bed
2022-08-10 00:08:48,167 R2G INFO Writing data to: pbmc_tutorial/scenicplus/r2g.importance.bb
2022-08-10 00:09:01,893 SCENIC+_wrapper INFO Saving object
Note on the output of SCENIC+
After running the SCENIC+ analysis the scplus_obj
will be populated with the results of several analysis. Below we will describe how you can explore these.
For structuring this data we took inspiration from the AnnData format.
[3]:
scplus_obj
[3]:
SCENIC+ object with n_cells x n_genes = 2415 x 21255 and n_cells x n_regions = 2415 x 191245
metadata_regions:'Chromosome', 'Start', 'End', 'Width', 'cisTopic_nr_frag', 'cisTopic_log_nr_frag', 'cisTopic_nr_acc', 'cisTopic_log_nr_acc'
metadata_genes:'gene_ids', 'feature_types', 'genome', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'
metadata_cell:'GEX_n_genes', 'GEX_doublet_score', 'GEX_predicted_doublet', 'GEX_n_genes_by_counts', 'GEX_total_counts', 'GEX_total_counts_mt', 'GEX_pct_counts_mt', 'GEX_ingest_celltype_label', 'GEX_leiden_res_0.8', 'GEX_celltype', 'ACC_barcode', 'ACC_cisTopic_nr_frag', 'ACC_cisTopic_log_nr_acc', 'ACC_Total_nr_frag_in_regions', 'ACC_cisTopic_nr_acc', 'ACC_TSS_enrichment', 'ACC_Log_total_nr_frag', 'ACC_cisTopic_log_nr_frag', 'ACC_Unique_nr_frag', 'ACC_Dupl_nr_frag', 'ACC_FRIP', 'ACC_Log_unique_nr_frag', 'ACC_Unique_nr_frag_in_regions', 'ACC_Dupl_rate', 'ACC_Total_nr_frag', 'ACC_n_genes', 'ACC_doublet_score', 'ACC_predicted_doublet', 'ACC_n_genes_by_counts', 'ACC_total_counts', 'ACC_total_counts_mt', 'ACC_pct_counts_mt', 'ACC_ingest_celltype_label', 'ACC_leiden_res_0.8', 'ACC_celltype', 'ACC_sample_id'
menr:'CTX_topics_otsu_All', 'CTX_topics_otsu_No_promoters', 'DEM_topics_otsu_All', 'DEM_topics_otsu_No_promoters', 'CTX_topics_top_3_All', 'CTX_topics_top_3_No_promoters', 'DEM_topics_top_3_All', 'DEM_topics_top_3_No_promoters', 'CTX_DARs_All', 'CTX_DARs_No_promoters', 'DEM_DARs_All', 'DEM_DARs_No_promoters'
dr_cell:'GEX_X_pca', 'GEX_X_umap', 'GEX_rep', 'eRegulons_UMAP', 'eRegulons_tSNE'
Gene expression and chromatin accessibility data
Both the raw gene expression counts and chromatin accessibility data are stored in the scplus_obj
and can be accessed by running, scplus_obj.to_df('EXP')
or scplus_obj.to_df('ACC')
.
[ ]:
scplus_obj.to_df('EXP').head()
AL627309.1 | AL627309.5 | AL627309.4 | AL669831.2 | LINC01409 | LINC01128 | LINC00115 | FAM41C | AL645608.2 | SAMD11 | ... | MT-ND6 | MT-CYB | AC145212.1 | MAFIP | AC011043.1 | AL354822.1 | AL592183.1 | AC240274.1 | AC004556.3 | AC007325.4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CCCTCACCACTAAGTT-1-10x_pbmc | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 13.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
GAGCTGCTCCTGTTCA-1-10x_pbmc | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 10.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
TCTCGCCCACCTCAGG-1-10x_pbmc | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 29.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 |
GTCAATATCCCATAAA-1-10x_pbmc | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
TCGTTAGCATGAATAG-1-10x_pbmc | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 21.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
5 rows × 21255 columns
[14]:
scplus_obj.to_df('ACC').head()
[14]:
CCCTCACCACTAAGTT-1-10x_pbmc | GAGCTGCTCCTGTTCA-1-10x_pbmc | TCTCGCCCACCTCAGG-1-10x_pbmc | GTCAATATCCCATAAA-1-10x_pbmc | TCGTTAGCATGAATAG-1-10x_pbmc | GATCAGTTCTCTAGCC-1-10x_pbmc | GTTTGTTTCCGGTATG-1-10x_pbmc | CCGCACACAGTTTCTC-1-10x_pbmc | AGTCAAGAGTCATTAG-1-10x_pbmc | AAGCGTTTCATTGTCT-1-10x_pbmc | ... | CCATATTTCCTCCTAA-1-10x_pbmc | GAGGTACAGGACCGCT-1-10x_pbmc | GCGCTAGGTCAAAGGG-1-10x_pbmc | AAATGCCTCCAGGAAA-1-10x_pbmc | TATTCGTTCTTGCAAA-1-10x_pbmc | AAGGCCCTCTTGATGA-1-10x_pbmc | ACTCGCTTCATGAAGG-1-10x_pbmc | CCTACTGGTCAAGTGC-1-10x_pbmc | GTCCGTAAGTTACCGG-1-10x_pbmc | GAAGCTAAGTTAGAGG-1-10x_pbmc | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GL000194.1:101175-101675 | 0 | 7 | 0 | 0 | 6 | 0 | 0 | 0 | 1 | 2 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 9 |
GL000194.1:114712-115212 | 1 | 2 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 | ... | 2 | 2 | 1 | 2 | 1 | 2 | 1 | 1 | 2 | 1 |
GL000195.1:22453-22953 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
GL000195.1:30601-31101 | 32 | 25 | 33 | 35 | 24 | 32 | 19 | 28 | 24 | 21 | ... | 20 | 22 | 28 | 19 | 27 | 18 | 29 | 31 | 33 | 24 |
GL000195.1:32233-32733 | 12 | 14 | 14 | 13 | 12 | 16 | 10 | 12 | 13 | 11 | ... | 11 | 12 | 13 | 12 | 12 | 12 | 13 | 17 | 14 | 12 |
5 rows × 2415 columns
Cell, region and gene metadata
Cell metatdata is stored in the .metadata_cell
slot. Data comming from the gene expression side is prefixed with the string GEX_
ad data comming from the chromatin accessibility side is prefixed with the string ACC_
.
Region metadata is stored in the .metadata_region
slot.
Gene metadata is stored in the .metadata_gene
slot.
[16]:
scplus_obj.metadata_cell.head()
[16]:
GEX_n_genes | GEX_doublet_score | GEX_predicted_doublet | GEX_n_genes_by_counts | GEX_total_counts | GEX_total_counts_mt | GEX_pct_counts_mt | GEX_ingest_celltype_label | GEX_leiden_res_0.8 | GEX_celltype | ... | ACC_doublet_score | ACC_predicted_doublet | ACC_n_genes_by_counts | ACC_total_counts | ACC_total_counts_mt | ACC_pct_counts_mt | ACC_ingest_celltype_label | ACC_leiden_res_0.8 | ACC_celltype | ACC_sample_id | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
CCCTCACCACTAAGTT-1-10x_pbmc | 1684 | 0.046595 | False | 1683 | 3535.0 | 219.0 | 6.195191 | CD4 T cells | 0 | CD4_T_cells | ... | 0.046595 | False | 1683 | 3535.0 | 219.0 | 6.195191 | CD4 T cells | 0 | CD4_T_cells | 10x_pbmc |
GAGCTGCTCCTGTTCA-1-10x_pbmc | 1377 | 0.013972 | False | 1374 | 2356.0 | 222.0 | 9.422750 | NK cells | 9 | NK_cells | ... | 0.013972 | False | 1374 | 2356.0 | 222.0 | 9.422750 | NK cells | 9 | NK_cells | 10x_pbmc |
TCTCGCCCACCTCAGG-1-10x_pbmc | 1517 | 0.064877 | False | 1517 | 2746.0 | 285.0 | 10.378733 | CD4 T cells | 0 | CD4_T_cells | ... | 0.064877 | False | 1517 | 2746.0 | 285.0 | 10.378733 | CD4 T cells | 0 | CD4_T_cells | 10x_pbmc |
GTCAATATCCCATAAA-1-10x_pbmc | 1157 | 0.018142 | False | 1157 | 2984.0 | 84.0 | 2.815013 | CD4 T cells | 0 | CD4_T_cells | ... | 0.018142 | False | 1157 | 2984.0 | 84.0 | 2.815013 | CD4 T cells | 0 | CD4_T_cells | 10x_pbmc |
TCGTTAGCATGAATAG-1-10x_pbmc | 1589 | 0.008130 | False | 1589 | 3111.0 | 275.0 | 8.839602 | CD8 T cells | 5 | CD8_T_cells | ... | 0.008130 | False | 1589 | 3111.0 | 275.0 | 8.839602 | CD8 T cells | 5 | CD8_T_cells | 10x_pbmc |
5 rows × 36 columns
[17]:
scplus_obj.metadata_regions.head()
[17]:
Chromosome | Start | End | Width | cisTopic_nr_frag | cisTopic_log_nr_frag | cisTopic_nr_acc | cisTopic_log_nr_acc | |
---|---|---|---|---|---|---|---|---|
GL000194.1:101175-101675 | GL000194.1 | 101175 | 101675 | 500 | 27 | 1.431364 | 24 | 1.380211 |
GL000194.1:114712-115212 | GL000194.1 | 114712 | 115212 | 500 | 44 | 1.643453 | 44 | 1.643453 |
GL000195.1:22453-22953 | GL000195.1 | 22453 | 22953 | 500 | 7 | 0.845098 | 7 | 0.845098 |
GL000195.1:30601-31101 | GL000195.1 | 30601 | 31101 | 500 | 608 | 2.783904 | 525 | 2.720159 |
GL000195.1:32233-32733 | GL000195.1 | 32233 | 32733 | 500 | 278 | 2.444045 | 258 | 2.411620 |
[19]:
scplus_obj.metadata_genes.head()
[19]:
gene_ids | feature_types | genome | n_cells | mt | n_cells_by_counts | mean_counts | pct_dropout_by_counts | total_counts | |
---|---|---|---|---|---|---|---|---|---|
AL627309.1 | ENSG00000238009 | Gene Expression | GRCh38 | 28 | False | 27 | 0.010247 | 98.975332 | 27.0 |
AL627309.5 | ENSG00000241860 | Gene Expression | GRCh38 | 145 | False | 140 | 0.060342 | 94.686907 | 159.0 |
AL627309.4 | ENSG00000241599 | Gene Expression | GRCh38 | 9 | False | 9 | 0.003416 | 99.658444 | 9.0 |
AL669831.2 | ENSG00000229905 | Gene Expression | GRCh38 | 4 | False | 4 | 0.001518 | 99.848197 | 4.0 |
LINC01409 | ENSG00000237491 | Gene Expression | GRCh38 | 143 | False | 140 | 0.058824 | 94.686907 | 155.0 |
Motif enrichment data
Motif enrichment data is stored in the .menr
slot.
[20]:
scplus_obj.menr.keys()
[20]:
dict_keys(['CTX_topics_otsu_All', 'CTX_topics_otsu_No_promoters', 'DEM_topics_otsu_All', 'DEM_topics_otsu_No_promoters', 'CTX_topics_top_3_All', 'CTX_topics_top_3_No_promoters', 'DEM_topics_top_3_All', 'DEM_topics_top_3_No_promoters', 'CTX_DARs_All', 'CTX_DARs_No_promoters', 'DEM_DARs_All', 'DEM_DARs_No_promoters'])
Dimensionality reduction data
Dimensionality reductions of the cells are stored in the .dr_cell
slot. Reductions comming from the gene expression side of the data are prefixed with the string GEX_
and those comming from the chromatin accessibility side with ACC_
.
IF region based dimensionality reductions were calculated then those will be stored in the .dr_region
slote (not the case in this example).
[21]:
scplus_obj.dr_cell.keys()
[21]:
dict_keys(['GEX_X_pca', 'GEX_X_umap', 'GEX_rep', 'eRegulons_UMAP', 'eRegulons_tSNE'])
Unstructured data
Additional unstructured data will be stored in the .uns
slot. After running the standard scenicplus wrapper function this slot will contain the following entries:
Cistromes
: this contains TFs together with target regions based on the motif enrichment analysis (i.e. prior to running SCENIC+)search_space
: this is a dataframe containing the search space for each gene.region_to_gene
: this is a dataframe containing region to gene links prior to running SCENIC+ (i.e unfiltered/raw region to gene importance scores and correlation coefficients).TF2G_adj
: this is a datafram containing TF to gene links prior to running SCENIC+ (i.e unfiltered/raw TF to gene importance scores and correlation coefficients).
These four slots contain all the data necessary for running the SCENIC+ analysis. The following entries are produced by the SCENIC+ analysis:
eRegulons
: this is the raw output from the SCENIC+ analysis. We will go into a bit more detail for these below.eRegulon_metadata
: this is a dataframe containing the same information aseRegulons
bit in a format which is a bit easier to parse for a human.eRegulon_signatures
: this is a dictionary with target regions and genes for each eReguloneRegulon_AUC
: this slot contains dataframes with eRegulon enrichment scores calculated using AUCell (see below).pseudobulk
: contains pseudobulked gene expression and chromatin accessibility data, this is used to calculated TF to eRegulon correlation values.TF_cistrome_correlation
: contains correlation values between TF expression and eRegulon enrichment scores (seperate entries for target gene and target region based scores).eRegulon_AUC_thresholds
: contains thresholds on the AUC values (eRegulon enrichment scores), this is necessary to be able to visualize the results in SCopeRSS
: contains eRegulon Specificity Scores (RSS), a measure on how cell type specific an eRegulon is.DEGs
: contains Differentially Expressed Genes.DARs
: contains Differentially Accessibile Regions.
[23]:
scplus_obj.uns.keys()
[23]:
dict_keys(['Cistromes', 'search_space', 'region_to_gene', 'TF2G_adj', 'eRegulons', 'eRegulon_metadata', 'eRegulon_signatures', 'eRegulon_AUC', 'Pseudobulk', 'TF_cistrome_correlation', 'eRegulon_AUC_thresholds', 'RSS', 'DEGs', 'DARs'])
The eRegulons entry
The main output of SCENIC+ are eRegulons.
This is initially stored in a list of eRegulon
classes as depicted below.
[12]:
scplus_obj.uns['eRegulons'][0:5]
[12]:
[eRegulon for TF ARID3A in context frozenset({'0.95 quantile', 'Top 15 region-to-gene links per gene', 'BASC binarized', 'positive r2g', '0.9 quantile', '0.85 quantile', 'positive tf2g', 'Cistromes_Unfiltered', 'Top 10 region-to-gene links per gene'}).
This eRegulon has 364 target regions and 278 target genes.,
eRegulon for TF ARID5B in context frozenset({'Top 10 region-to-gene links per gene', '0.95 quantile', 'Top 15 region-to-gene links per gene', 'BASC binarized', 'positive r2g', '0.9 quantile', '0.85 quantile', 'positive tf2g', 'Cistromes_Unfiltered', 'Top 5 region-to-gene links per gene'}).
This eRegulon has 42 target regions and 30 target genes.,
eRegulon for TF ARNT in context frozenset({'0.95 quantile', 'positive r2g', '0.9 quantile', 'positive tf2g', 'Cistromes_Unfiltered'}).
This eRegulon has 29 target regions and 28 target genes.,
eRegulon for TF ARNTL in context frozenset({'Top 10 region-to-gene links per gene', '0.95 quantile', 'Top 15 region-to-gene links per gene', 'BASC binarized', 'positive r2g', '0.9 quantile', '0.85 quantile', 'positive tf2g', 'Cistromes_Unfiltered', 'Top 5 region-to-gene links per gene'}).
This eRegulon has 48 target regions and 42 target genes.,
eRegulon for TF ARNTL2 in context frozenset({'Top 10 region-to-gene links per gene', 'BASC binarized', 'positive r2g', 'Top 5 region-to-gene links per gene', 'positive tf2g', 'Cistromes_Unfiltered', 'Top 15 region-to-gene links per gene'}).
This eRegulon has 15 target regions and 13 target genes.]
each eRegulon has the following information (attributes):
cistrome_name
: name of the cistrome (fromscenicplus.uns['Cistromes']
) from which this eRegulon was created.context
: specifies the binarization method(s) used for binarizing region to gene relationships and wether positive/negative region-to-gene and TF-to-gene relationships were used.gsea_adj_pval
/gsea_enrichment_score
/gsea_pval
/in_leading_edge
: are internal parameters used for generating the eRegulons. The values are lost when generating the final eRegulons because results from several analysis (different binarization methods) are combined.is_extended
: specifies wether extended (i.e. non-direct) motif-to-TF annotations were used.n_target_genes
: number of target genes.n_target_regions
: number of target regions.regions2genes
: region to gene links after running SCENIC+.target_genes
: target genes of the eRegulontarget_regions
: target regions of the eRegulontranscription_factor
: TF name
[40]:
for attr in dir(scplus_obj.uns['eRegulons'][0]):
if not attr.startswith('_'):
print(f"{attr}: {getattr(scplus_obj.uns['eRegulons'][0], attr) if not type(getattr(scplus_obj.uns['eRegulons'][0], attr)) == list else getattr(scplus_obj.uns['eRegulons'][0], attr)[0:5]}")
cistrome_name: ARID3A_(7058r)
context: frozenset({'0.95 quantile', 'Top 15 region-to-gene links per gene', 'BASC binarized', 'positive r2g', '0.9 quantile', '0.85 quantile', 'positive tf2g', 'Cistromes_Unfiltered', 'Top 10 region-to-gene links per gene'})
gsea_adj_pval: None
gsea_enrichment_score: None
gsea_pval: None
in_leading_edge: None
is_extended: False
n_target_genes: 278
n_target_regions: 364
regions2genes: [r2g(region='chr3:185354177-185354677', target='EHHADH', importance=0.12225935425619862, rho=0.09670705521191818, importance_x_rho=0.011823342122227663, importance_x_abs_rho=0.011823342122227663), r2g(region='chr5:78541071-78541571', target='LHFPL2', importance=0.03653514743313229, rho=0.17765184287927935, importance_x_rho=0.006490536271362124, importance_x_abs_rho=0.006490536271362124), r2g(region='chr9:87556784-87557284', target='DAPK1', importance=0.04033415953652628, rho=0.6153329549640674, importance_x_rho=0.024818937573602835, importance_x_abs_rho=0.024818937573602835), r2g(region='chr4:70691607-70692107', target='JCHAIN', importance=0.012896043016525846, rho=0.0755891581507452, importance_x_rho=0.0009748010350949854, importance_x_abs_rho=0.0009748010350949854), r2g(region='chr6:116465231-116465731', target='DSE', importance=0.01788022660248461, rho=0.3107757139522671, importance_x_rho=0.005556740188015474, importance_x_abs_rho=0.005556740188015474)]
subset_leading_edge: <bound method eRegulon.subset_leading_edge of eRegulon for TF ARID3A in context frozenset({'0.95 quantile', 'Top 15 region-to-gene links per gene', 'BASC binarized', 'positive r2g', '0.9 quantile', '0.85 quantile', 'positive tf2g', 'Cistromes_Unfiltered', 'Top 10 region-to-gene links per gene'}).
This eRegulon has 364 target regions and 278 target genes.>
target_genes: ['TFEC', 'DOCK4', 'PDGFC', 'ZNF518A', 'GAS6']
target_regions: ['chr3:185354177-185354677', 'chr10:80443601-80444101', 'chr19:12796456-12796956', 'chr11:34613371-34613871', 'chr3:9650043-9650543']
transcription_factor: ARID3A
The information of all eRegulons is combined in the eRegulon_metadata
dataframe.
[24]:
scplus_obj.uns['eRegulon_metadata'].head()
[24]:
Region_signature_name | Gene_signature_name | TF | is_extended | Region | Gene | R2G_importance | R2G_rho | R2G_importance_x_rho | R2G_importance_x_abs_rho | TF2G_importance | TF2G_regulation | TF2G_rho | TF2G_importance_x_abs_rho | TF2G_importance_x_rho | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ARID3A_+_+_(364r) | ARID3A_+_+_(278g) | ARID3A | False | chr3:185354177-185354677 | EHHADH | 0.122259 | 0.096707 | 0.011823 | 0.011823 | 1.147479 | 1 | 0.114595 | 0.131495 | 0.131495 |
1 | ARID3A_+_+_(364r) | ARID3A_+_+_(278g) | ARID3A | False | chr3:185362861-185363361 | EHHADH | 0.049054 | 0.070493 | 0.003458 | 0.003458 | 1.147479 | 1 | 0.114595 | 0.131495 | 0.131495 |
2 | ARID3A_+_+_(364r) | ARID3A_+_+_(278g) | ARID3A | False | chr3:185280570-185281070 | EHHADH | 0.096006 | 0.057384 | 0.005509 | 0.005509 | 1.147479 | 1 | 0.114595 | 0.131495 | 0.131495 |
3 | ARID3A_+_+_(364r) | ARID3A_+_+_(278g) | ARID3A | False | chr5:78541071-78541571 | LHFPL2 | 0.036535 | 0.177652 | 0.006491 | 0.006491 | 4.805953 | 1 | 0.256177 | 1.231173 | 1.231173 |
4 | ARID3A_+_+_(364r) | ARID3A_+_+_(278g) | ARID3A | False | chr5:78483650-78484150 | LHFPL2 | 0.040108 | 0.215308 | 0.008635 | 0.008635 | 4.805953 | 1 | 0.256177 | 1.231173 | 1.231173 |
For the eRegulon names we use the following convetion:
<TF NAME>_<TF-TO-GENE RELATIONSHIP (+/-)>_<REGION-TO-GENE RELATIONSHIP (+/-)>_(NUMBER OF TARGET REGIONS(r)/GENES(g))
For example the name: ARID3A_+_+_(364r)
and ARID3A_+_+_(278g)
indicates that the we found an eRegulon for the TF ARID3A
which has 364
target regions and 278
target genes, that expression of the TF correlates positively with the expression of all the target genes (first +
sign) and that the accessibility of all target regions correlates positively with the expression of all target regions (seconf +
sign).
Given this convention we can have at maximum 4 (actually 8, see note below) different eRegulons for each TF, however not all four combinations are always found.
The table below describes a possible biological interpretation of each combination:
combinations of signs |
TF-to-gene correlation |
region-to-gene correlation |
interpretation |
---|---|---|---|
“+ +” |
positive |
positive |
This eRegulon is an activator which opens the chromatin of the target regions and induces gene expression of the target genes. |
“+ -” |
positive |
negative |
This eRegulon is an activator which closes the chromatin of the target regions and induces gene expression of the target genes (this is biologically quite implausible). |
“- +” |
negative |
positive |
This eRegulon is a repressor which closes the chromatin of the target regions and represses the expression of the target genes. |
“- -” |
negative |
negative |
This eRegulon is a repressor which opens the chromatin of the target regions and represses the expression of the target genes (this is biologically quite implausible). |
table of possible eRegulon names
Warning:
We suggest to mainly focus on activator eRegulons. Repressor eRegulons contain more false predicitons because it is more difficult to find negative correlations (it relies on the abscence of data instead of the prescence) and they are often found because of the prescence of a different TF from the same familly (i.e. one which has the same/similar DNA binding domain and thus DNA motif) for which the expression is anti-correlated. You can never trust predictions from SCENIC+ blindly but certainly the repressor eRegulons you should handle with caution.
Note:
Each eRegulon can be derived from motifs which are annotated directly to the TF (for example the annotation comes from a ChIP-seq experiment in a cell line of the same species) or the annotation can be extended (the annotation is infered based on orthology or motif similarity). The direct annotations are of higher quality, for that reason the we add the suffix _extended
to eRegulons derived from extended annotations. Because both annotations can exists for the same TF we can have a
maximum of 8 eRegulons per TF (4 from all the combinations described above each of which can be either extended or direct). We will handle all these types below and simplify the output.
Downstream analysis
We have finally finished the SCENIC+ analysis 🎉. Now we can start analysing the results, we will show some example analysis below but really the sky is the limit so feel free to be creative with your newly acquired eRegulons.
[1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import sys
_stderr = sys.stderr
null = open(os.devnull,'wb')
[2]:
import dill
work_dir = 'pbmc_tutorial'
scplus_obj = dill.load(open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'rb'))
Simplifying and filtering SCENIC+ output
Given the multitude of eRegulons that can be generated for each TF (see above) we will first simplify the result by: 1. Only keeping eRegulons with an extended annotation if there is no direct annotation available (given that the confidence of direct motif annotations is in genral higher). 2. Discarding eRegulons for which the region-to-gene correlation is negative (these are often noisy). 3. Renaming the eRegulons so that eRegulons with the suffix TF_+_+
become TF_+
and those with
TF_-_+
become TF_-
.
[4]:
from scenicplus.preprocessing.filtering import apply_std_filtering_to_eRegulons
apply_std_filtering_to_eRegulons(scplus_obj)
Only keeping positive R2G
Only keep extended if not direct
Getting signatures...
Simplifying eRegulons ...
This will create two new entries in the scenicplus object: scplus_obj.uns['eRegulon_metadata_filtered']
and scplus_obj.uns['eRegulon_signatures_filtered']
containing the simplified results. We will use these for downstream analysis.
[5]:
scplus_obj.uns['eRegulon_metadata_filtered'].head()
[5]:
Region_signature_name | Gene_signature_name | TF | is_extended | Region | Gene | R2G_importance | R2G_rho | R2G_importance_x_rho | R2G_importance_x_abs_rho | TF2G_importance | TF2G_regulation | TF2G_rho | TF2G_importance_x_abs_rho | TF2G_importance_x_rho | Consensus_name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ARID3A_+_(364r) | ARID3A_+_(278g) | ARID3A | False | chr3:185354177-185354677 | EHHADH | 0.122259 | 0.096707 | 0.011823 | 0.011823 | 1.147479 | 1 | 0.114595 | 0.131495 | 0.131495 | ARID3A_+_+ |
1 | ARID3A_+_(364r) | ARID3A_+_(278g) | ARID3A | False | chr3:185362861-185363361 | EHHADH | 0.049054 | 0.070493 | 0.003458 | 0.003458 | 1.147479 | 1 | 0.114595 | 0.131495 | 0.131495 | ARID3A_+_+ |
2 | ARID3A_+_(364r) | ARID3A_+_(278g) | ARID3A | False | chr3:185280570-185281070 | EHHADH | 0.096006 | 0.057384 | 0.005509 | 0.005509 | 1.147479 | 1 | 0.114595 | 0.131495 | 0.131495 | ARID3A_+_+ |
3 | ARID3A_+_(364r) | ARID3A_+_(278g) | ARID3A | False | chr5:78541071-78541571 | LHFPL2 | 0.036535 | 0.177652 | 0.006491 | 0.006491 | 4.805953 | 1 | 0.256177 | 1.231173 | 1.231173 | ARID3A_+_+ |
4 | ARID3A_+_(364r) | ARID3A_+_(278g) | ARID3A | False | chr5:78483650-78484150 | LHFPL2 | 0.040108 | 0.215308 | 0.008635 | 0.008635 | 4.805953 | 1 | 0.256177 | 1.231173 | 1.231173 | ARID3A_+_+ |
eRegulon enrichment scores
We can score the enrichment of eRegulons using the AUCell function. This function takes as input a gene or region based ranking (ranking of genes/regions based on the expression/accessibility per cell) and a list of eRegulons.
These values were already calculated in the wrapper function but let’s recalculate them using the filtered output.
[8]:
from scenicplus.eregulon_enrichment import score_eRegulons
region_ranking = dill.load(open(os.path.join(work_dir, 'scenicplus/region_ranking.pkl'), 'rb')) #load ranking calculated using the wrapper function
gene_ranking = dill.load(open(os.path.join(work_dir, 'scenicplus/gene_ranking.pkl'), 'rb')) #load ranking calculated using the wrapper function
score_eRegulons(scplus_obj,
ranking = region_ranking,
eRegulon_signatures_key = 'eRegulon_signatures_filtered',
key_added = 'eRegulon_AUC_filtered',
enrichment_type= 'region',
auc_threshold = 0.05,
normalize = False,
n_cpu = 5)
score_eRegulons(scplus_obj,
gene_ranking,
eRegulon_signatures_key = 'eRegulon_signatures_filtered',
key_added = 'eRegulon_AUC_filtered',
enrichment_type = 'gene',
auc_threshold = 0.05,
normalize= False,
n_cpu = 5)
eRegulon dimensionality reduction
Based on the enrichment scores calculated above we can generate dimensionality reductions (e.g. tSNE and UMAP).
To calculate these dimensionality reductions we use both the regions and gene based enrichment scores.
[9]:
from scenicplus.dimensionality_reduction import run_eRegulons_tsne, run_eRegulons_umap
run_eRegulons_umap(
scplus_obj = scplus_obj,
auc_key = 'eRegulon_AUC_filtered',
reduction_name = 'eRegulons_UMAP', #overwrite previously calculated UMAP
)
run_eRegulons_tsne(
scplus_obj = scplus_obj,
auc_key = 'eRegulon_AUC_filtered',
reduction_name = 'eRegulons_tSNE', #overwrite previously calculated tSNE
)
Let’s visualize the UMAP and tSNE stored respectively in eRegulons_UMAP
and eRegulons_tSNE
, these are calculated based on the combined region and gene AUC values described above.
Let’s also add some nice colours by specifying a color_dictionary.
[10]:
from scenicplus.dimensionality_reduction import plot_metadata_given_ax
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
#specify color_dictionary
color_dict = {
'B_cells_1': "#065143",
'B_cells_2': "#70B77E",
'CD4_T_cells': "#E0A890",
'CD8_T_cells': "#F56476",
'NK_cells': "#CE1483",
'Dendritic_cells': "#053C5E" ,
'FCGR3A+_Monocytes': "#38A3A5",
'CD14+_Monocytes': "#80ED99"
}
fig, axs = plt.subplots(ncols=2, figsize = (16, 8))
plot_metadata_given_ax(
scplus_obj=scplus_obj,
ax = axs[0],
reduction_name = 'eRegulons_UMAP',
variable = 'GEX_celltype', #note the GEX_ prefix, this metadata originated from the gene expression metadata (on which we did the cell type annotation before)
color_dictionary={'GEX_celltype': color_dict}
)
plot_metadata_given_ax(
scplus_obj=scplus_obj,
ax = axs[1],
reduction_name = 'eRegulons_tSNE',
variable = 'GEX_celltype', #note the GEX_ prefix, this metadata originated from the gene expression metadata (on which we did the cell type annotation before)
color_dictionary={'GEX_celltype': color_dict}
)
fig.tight_layout()
sns.despine(ax = axs[0]) #remove top and right edge of axis border
sns.despine(ax = axs[1]) #remove top and right edge of axis border
plt.show()

plot the activity / expression of an eRegulon on the dimensionality reduction
Nex we visualize the gene expression and target gene and region activity of some eRegulons on the tSNE.
[101]:
from scenicplus.dimensionality_reduction import plot_eRegulon
plot_eRegulon(
scplus_obj = scplus_obj,
reduction_name = 'eRegulons_tSNE',
selected_regulons = ['EOMES_+', 'GATA3_+', 'TCF7_+', 'CEBPA_+', 'PAX5_+'],
scale = True,
auc_key = 'eRegulon_AUC_filtered')

We can also plot only the activity of an eRegulon
[104]:
from scenicplus.dimensionality_reduction import plot_AUC_given_ax
fig, ax = plt.subplots(figsize = (8,8))
plot_AUC_given_ax(
scplus_obj = scplus_obj,
reduction_name = 'eRegulons_tSNE',
feature = 'PAX5_+_(119g)',
ax = ax,
auc_key = 'eRegulon_AUC_filtered',
signature_key = 'Gene_based')
sns.despine(ax = ax)
plt.show()

dotplot-heatmap
For eRegulons it is often usefull to visualize both information on the TF/target genes expression and region accessibility at the same time.
A dotplot-heatmap is a useful way to visualize this. Here the color of the heatmap can be used to visualize one aspect of the eRegulon (for example TF expression) and the size of the dot can be used to visualize another aspect (for example the enrichment (AUC value) of eRegulon target regions).
Before we plot the the dotplot-heatmap let’s first select some high quality eRegulons to limit the amount of space we need for the plot. One metric which can be used for selecting eRegulons is the correlation between TF expression and target region enrichment scores (AUC values). Let’s (re)calculate this value based on the simplified eRegulons
We first generate pseudobulk gene expression and region accessibility data, per celltype, to limit the amount of noise for the correlation calculation.
[15]:
from scenicplus.cistromes import TF_cistrome_correlation, generate_pseudobulks
generate_pseudobulks(
scplus_obj = scplus_obj,
variable = 'GEX_celltype',
auc_key = 'eRegulon_AUC_filtered',
signature_key = 'Gene_based')
generate_pseudobulks(
scplus_obj = scplus_obj,
variable = 'GEX_celltype',
auc_key = 'eRegulon_AUC_filtered',
signature_key = 'Region_based')
TF_cistrome_correlation(
scplus_obj,
use_pseudobulk = True,
variable = 'GEX_celltype',
auc_key = 'eRegulon_AUC_filtered',
signature_key = 'Gene_based',
out_key = 'filtered_gene_based')
TF_cistrome_correlation(
scplus_obj,
use_pseudobulk = True,
variable = 'GEX_celltype',
auc_key = 'eRegulon_AUC_filtered',
signature_key = 'Region_based',
out_key = 'filtered_region_based')
[24]:
scplus_obj.uns['TF_cistrome_correlation']['filtered_region_based'].head()
[24]:
TF | Cistrome | Rho | P-value | Adjusted_p-value | |
---|---|---|---|---|---|
0 | MXD1 | MXD1_-_(10r) | -0.448720 | 6.872141e-41 | 1.583580e-40 |
1 | RREB1 | RREB1_-_(91r) | -0.700170 | 6.991867e-119 | 3.407531e-118 |
2 | SREBF2 | SREBF2_+_(191r) | 0.252339 | 4.369639e-13 | 6.939052e-13 |
3 | ELF3 | ELF3_+_(217r) | 0.107363 | 2.360010e-03 | 2.834687e-03 |
4 | KLF8 | KLF8_+_(324r) | 0.005392 | 8.789742e-01 | 8.894632e-01 |
Let’s visualize these correlations in a scatter plot and select eRegulons for which the correlaiton coefficient is above 0.70 or below -0.75
[41]:
import numpy as np
n_targets = [int(x.split('(')[1].replace('r)', '')) for x in scplus_obj.uns['TF_cistrome_correlation']['filtered_region_based']['Cistrome']]
rho = scplus_obj.uns['TF_cistrome_correlation']['filtered_region_based']['Rho'].to_list()
adj_pval = scplus_obj.uns['TF_cistrome_correlation']['filtered_region_based']['Adjusted_p-value'].to_list()
thresholds = {
'rho': [-0.75, 0.70],
'n_targets': 0
}
import seaborn as sns
fig, ax = plt.subplots(figsize = (10, 5))
sc = ax.scatter(rho, n_targets, c = -np.log10(adj_pval), s = 5)
ax.set_xlabel('Correlation coefficient')
ax.set_ylabel('nr. target regions')
#ax.hlines(y = thresholds['n_targets'], xmin = min(rho), xmax = max(rho), color = 'black', ls = 'dashed', lw = 1)
ax.vlines(x = thresholds['rho'], ymin = 0, ymax = max(n_targets), color = 'black', ls = 'dashed', lw = 1)
ax.text(x = thresholds['rho'][0], y = max(n_targets), s = str(thresholds['rho'][0]))
ax.text(x = thresholds['rho'][1], y = max(n_targets), s = str(thresholds['rho'][1]))
sns.despine(ax = ax)
fig.colorbar(sc, label = '-log10(adjusted_pvalue)', ax = ax)
plt.show()
/local_scratch/tmp-vsc33053/ipykernel_31660/1907631034.py:12: RuntimeWarning: divide by zero encountered in log10

[44]:
selected_cistromes = scplus_obj.uns['TF_cistrome_correlation']['filtered_region_based'].loc[
np.logical_or(
scplus_obj.uns['TF_cistrome_correlation']['filtered_region_based']['Rho'] > thresholds['rho'][1],
scplus_obj.uns['TF_cistrome_correlation']['filtered_region_based']['Rho'] < thresholds['rho'][0]
)]['Cistrome'].to_list()
selected_eRegulons = [x.split('_(')[0] for x in selected_cistromes]
selected_eRegulons_gene_sig = [
x for x in scplus_obj.uns['eRegulon_signatures_filtered']['Gene_based'].keys()
if x.split('_(')[0] in selected_eRegulons]
selected_eRegulons_region_sig = [
x for x in scplus_obj.uns['eRegulon_signatures_filtered']['Region_based'].keys()
if x.split('_(')[0] in selected_eRegulons]
#save the results in the scenicplus object
scplus_obj.uns['selected_eRegulon'] = {'Gene_based': selected_eRegulons_gene_sig, 'Region_based': selected_eRegulons_region_sig}
print(f'selected: {len(selected_eRegulons_gene_sig)} eRegulons')
selected: 76 eRegulons
Let’s save these changes we have made to the scenicplus_obj
[68]:
dill.dump(scplus_obj, open(os.path.join(work_dir, 'scenicplus/scplus_obj.pkl'), 'wb'), protocol=-1)
Let’s plot the heatmap-dotplot
[48]:
from scenicplus.plotting.dotplot import heatmap_dotplot
heatmap_dotplot(
scplus_obj = scplus_obj,
size_matrix = scplus_obj.uns['eRegulon_AUC_filtered']['Region_based'], #specify what to plot as dot sizes, target region enrichment in this case
color_matrix = scplus_obj.to_df('EXP'), #specify what to plot as colors, TF expression in this case
scale_size_matrix = True,
scale_color_matrix = True,
group_variable = 'GEX_celltype',
subset_eRegulons = scplus_obj.uns['selected_eRegulon']['Gene_based'],
index_order = ['B_cells_1', 'B_cells_2', 'CD4_T_cells', 'CD8_T_cells', 'NK_cells', 'Dendritic_cells', 'FCGR3A+_Monocytes', 'CD14+_Monocytes'],
figsize = (5, 20),
orientation = 'vertical')

[48]:
<ggplot: (2961768578286)>
overlap of predicted target regions
An interesting aspect of gene regulation is transcription factor cooperativity (i.e. multiple TFs cobinding the same enhancer together driving gene expression).
By looking at the overlap of predicted target regions of TFs we can infer potential cooperativity events.
Let’s look at the overlap of target regions of the top 5 TFs per cell type based on the Regulon Specificity Score (RSS).
First we calculate the RSS for the target regions of the selected eRegulons.
[51]:
from scenicplus.RSS import *
regulon_specificity_scores(
scplus_obj,
variable = 'GEX_celltype',
auc_key = 'eRegulon_AUC_filtered',
signature_keys = ['Region_based'],
selected_regulons = [x for x in scplus_obj.uns['selected_eRegulon']['Region_based'] if '-' not in x],
out_key_suffix = '_filtered')
Let’s visualize the RSS values using a scatter plot
[109]:
plot_rss(scplus_obj, 'GEX_celltype_filtered', num_columns=2, top_n=10, figsize = (5, 10))

Next we select the top 10 eRegulons per cell type
[65]:
flat_list = lambda t: [item for sublist in t for item in sublist]
selected_markers = list(set(flat_list(
[scplus_obj.uns['RSS']['GEX_celltype_filtered'].loc[celltype].sort_values(ascending = False).head(10).index.to_list()
for celltype in scplus_obj.uns['RSS']['GEX_celltype_filtered'].index])))
[67]:
from scenicplus.plotting.correlation_plot import *
region_intersetc_data, Z = jaccard_heatmap(
scplus_obj,
method = 'intersect',
gene_or_region_based = 'Region_based',
use_plotly = False,
selected_regulons = selected_markers,
signature_key = 'eRegulon_signatures_filtered',
figsize = (10, 10), return_data = True, vmax = 0.5, cmap = 'plasma')

Plotting a network
eRegulons can also be visualized in a network. Simple plots can be made using python. For more complicated plots (i.e. containing many nodes and edges) we suggest exporting your network to cytoscape.
Let’s create a very simple network for B cells. We will use the top 1000 highly variable regions and genes in this plot. If you want to use more feautures please export your nework to cytoscape.
[91]:
from pycisTopic.diff_features import find_highly_variable_features
hvr = find_highly_variable_features(scplus_obj.to_df('ACC').loc[list(set(scplus_obj.uns['eRegulon_metadata_filtered']['Region']))], n_top_features=1000, plot = False)
hvg = find_highly_variable_features(scplus_obj.to_df('EXP')[list(set(scplus_obj.uns['eRegulon_metadata_filtered']['Gene']))].T, n_top_features=1000, plot = False)
2022-08-08 17:02:54,730 cisTopic INFO Calculating mean
2022-08-08 17:02:54,783 cisTopic INFO Calculating variance
2022-08-08 17:02:56,192 cisTopic INFO Done!
2022-08-08 17:02:56,327 cisTopic INFO Calculating mean
2022-08-08 17:02:56,340 cisTopic INFO Calculating variance
2022-08-08 17:02:56,503 cisTopic INFO Done!
<Figure size 432x288 with 0 Axes>
<Figure size 432x288 with 0 Axes>
First we format the eRegulons into a table which can be used to create a network using the package networkx
[92]:
from scenicplus.networks import create_nx_tables, create_nx_graph, plot_networkx, export_to_cytoscape
nx_tables = create_nx_tables(
scplus_obj = scplus_obj,
eRegulon_metadata_key ='eRegulon_metadata_filtered',
subset_eRegulons = ['PAX5', 'EBF1', 'POU2AF1'],
subset_regions = hvr,
subset_genes = hvg,
add_differential_gene_expression = True,
add_differential_region_accessibility = True,
differential_variable = ['GEX_celltype'])
/user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
/user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
/user/leuven/330/vsc33053/sdewin/Programs/anaconda3/envs/scenicplus/lib/python3.8/site-packages/anndata/compat/_overloaded_dict.py:106: ImplicitModificationWarning: Trying to modify attribute `._uns` of view, initializing view as actual.
Next we layout the graph.
[93]:
G, pos, edge_tables, node_tables = create_nx_graph(nx_tables,
use_edge_tables = ['TF2R','R2G'],
color_edge_by = {'TF2R': {'variable' : 'TF', 'category_color' : {'PAX5': 'Orange', 'EBF1': 'Purple', 'POU2AF1': 'Red'}},
'R2G': {'variable' : 'R2G_rho', 'continuous_color' : 'viridis', 'v_min': -1, 'v_max': 1}},
transparency_edge_by = {'R2G': {'variable' : 'R2G_importance', 'min_alpha': 0.1, 'v_min': 0}},
width_edge_by = {'R2G': {'variable' : 'R2G_importance', 'max_size' : 1.5, 'min_size' : 1}},
color_node_by = {'TF': {'variable': 'TF', 'category_color' : {'PAX5': 'Orange', 'EBF1': 'Purple', 'POU2AF1': 'Red'}},
'Gene': {'variable': 'GEX_celltype_Log2FC_B_cells_1', 'continuous_color' : 'bwr'},
'Region': {'variable': 'GEX_celltype_Log2FC_B_cells_1', 'continuous_color' : 'viridis'}},
transparency_node_by = {'Region': {'variable' : 'GEX_celltype_Log2FC_B_cells_1', 'min_alpha': 0.1},
'Gene': {'variable' : 'GEX_celltype_Log2FC_B_cells_1', 'min_alpha': 0.1}},
size_node_by = {'TF': {'variable': 'fixed_size', 'fixed_size': 30},
'Gene': {'variable': 'fixed_size', 'fixed_size': 15},
'Region': {'variable': 'fixed_size', 'fixed_size': 10}},
shape_node_by = {'TF': {'variable': 'fixed_shape', 'fixed_shape': 'ellipse'},
'Gene': {'variable': 'fixed_shape', 'fixed_shape': 'ellipse'},
'Region': {'variable': 'fixed_shape', 'fixed_shape': 'diamond'}},
label_size_by = {'TF': {'variable': 'fixed_label_size', 'fixed_label_size': 20.0},
'Gene': {'variable': 'fixed_label_size', 'fixed_label_size': 10.0},
'Region': {'variable': 'fixed_label_size', 'fixed_label_size': 0.0}},
layout='kamada_kawai_layout',
scale_position_by=250)
Finally we can visualize the network.
In this network diamond shapes represent regions and they are color coded by their log2fc value in B cells target genes and TFs are visualized using circles and are labeled.
[94]:
plt.figure(figsize=(10,10))
plot_networkx(G, pos)

We can also export this network to a format which can be opened in Cytoscape.
[110]:
export_to_cytoscape(G, pos, out_file = os.path.join(work_dir, 'scenicplus/network_B_cells.cys'))
This network can be imported using file -> import -> Network from file ...
Also make sure to import the SCENIC+ network layout using file -> import -> Styles from file ...
.
This layout is available under cytoscape_styles/SCENIC+.xml.
Conclusion
This concludes the SCENIC+ tutorial on PBMCs.
In this tutorial we started from a fragments file and a gene expression matrix, which are publicly available on the 10x website.
Based on the gene expression side of the data we annotated cell types and used these annotations for generating pseudobulk ATAC-seq profiles on which we called consensus peask.
This consensus peaks set was used as features for running pycisTopic with which we looked for enhancer candidates and using pycistarget we found which motifs were enriched in these candidates.
Finally, we combined all these results in order to run SCENIC+ for which we did some downstream exploration.
Below is an overview of all the files generated in this tutorial.
[2]:
!tree pbmc_tutorial/
pbmc_tutorial/
|-- bedToBigBed
|-- data
| |-- pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
| |-- pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5
| `-- utoronto_human_tfs_v_1.01.txt
|-- motifs
| |-- CTX_DARs_All
| | |-- B_cells_1.html
| | |-- B_cells_2.html
| | |-- CD14+_Monocytes.html
| | |-- CD4_T_cells.html
| | |-- CD8_T_cells.html
| | |-- Dendritic_cells.html
| | |-- FCGR3A+_Monocytes.html
| | `-- NK_cells.html
| |-- CTX_DARs_No_promoters
| | |-- B_cells_1.html
| | |-- B_cells_2.html
| | |-- CD14+_Monocytes.html
| | |-- CD4_T_cells.html
| | |-- CD8_T_cells.html
| | |-- Dendritic_cells.html
| | |-- FCGR3A+_Monocytes.html
| | `-- NK_cells.html
| |-- CTX_topics_otsu_All
| | |-- Topic1.html
| | |-- Topic10.html
| | |-- Topic11.html
| | |-- Topic12.html
| | |-- Topic13.html
| | |-- Topic14.html
| | |-- Topic15.html
| | |-- Topic16.html
| | |-- Topic2.html
| | |-- Topic3.html
| | |-- Topic4.html
| | |-- Topic5.html
| | |-- Topic6.html
| | |-- Topic7.html
| | |-- Topic8.html
| | `-- Topic9.html
| |-- CTX_topics_otsu_No_promoters
| | |-- Topic1.html
| | |-- Topic10.html
| | |-- Topic11.html
| | |-- Topic12.html
| | |-- Topic13.html
| | |-- Topic14.html
| | |-- Topic15.html
| | |-- Topic16.html
| | |-- Topic2.html
| | |-- Topic3.html
| | |-- Topic4.html
| | |-- Topic5.html
| | |-- Topic6.html
| | |-- Topic7.html
| | |-- Topic8.html
| | `-- Topic9.html
| |-- CTX_topics_top_3_All
| | |-- Topic1.html
| | |-- Topic10.html
| | |-- Topic11.html
| | |-- Topic12.html
| | |-- Topic13.html
| | |-- Topic14.html
| | |-- Topic15.html
| | |-- Topic16.html
| | |-- Topic2.html
| | |-- Topic3.html
| | |-- Topic4.html
| | |-- Topic5.html
| | |-- Topic6.html
| | |-- Topic7.html
| | |-- Topic8.html
| | `-- Topic9.html
| |-- CTX_topics_top_3_No_promoters
| | |-- Topic1.html
| | |-- Topic10.html
| | |-- Topic11.html
| | |-- Topic12.html
| | |-- Topic13.html
| | |-- Topic14.html
| | |-- Topic15.html
| | |-- Topic16.html
| | |-- Topic2.html
| | |-- Topic3.html
| | |-- Topic4.html
| | |-- Topic5.html
| | |-- Topic6.html
| | |-- Topic7.html
| | |-- Topic8.html
| | `-- Topic9.html
| |-- DEM_DARs_All
| | |-- B_cells_1.html
| | |-- B_cells_2.html
| | |-- CD14+_Monocytes.html
| | |-- CD4_T_cells.html
| | |-- CD8_T_cells.html
| | |-- Dendritic_cells.html
| | |-- FCGR3A+_Monocytes.html
| | `-- NK_cells.html
| |-- DEM_DARs_No_promoters
| | |-- B_cells_1.html
| | |-- B_cells_2.html
| | |-- CD14+_Monocytes.html
| | |-- CD4_T_cells.html
| | |-- CD8_T_cells.html
| | |-- Dendritic_cells.html
| | |-- FCGR3A+_Monocytes.html
| | `-- NK_cells.html
| |-- DEM_topics_otsu_All
| | |-- Topic1.html
| | |-- Topic10.html
| | |-- Topic11.html
| | |-- Topic12.html
| | |-- Topic13.html
| | |-- Topic14.html
| | |-- Topic15.html
| | |-- Topic16.html
| | |-- Topic2.html
| | |-- Topic3.html
| | |-- Topic4.html
| | |-- Topic5.html
| | |-- Topic6.html
| | |-- Topic7.html
| | |-- Topic8.html
| | `-- Topic9.html
| |-- DEM_topics_otsu_No_promoters
| | |-- Topic1.html
| | |-- Topic10.html
| | |-- Topic11.html
| | |-- Topic12.html
| | |-- Topic13.html
| | |-- Topic14.html
| | |-- Topic15.html
| | |-- Topic16.html
| | |-- Topic2.html
| | |-- Topic3.html
| | |-- Topic4.html
| | |-- Topic5.html
| | |-- Topic6.html
| | |-- Topic7.html
| | |-- Topic8.html
| | `-- Topic9.html
| |-- DEM_topics_top_3_All
| | |-- Topic1.html
| | |-- Topic10.html
| | |-- Topic11.html
| | |-- Topic12.html
| | |-- Topic13.html
| | |-- Topic14.html
| | |-- Topic15.html
| | |-- Topic16.html
| | |-- Topic2.html
| | |-- Topic3.html
| | |-- Topic4.html
| | |-- Topic5.html
| | |-- Topic6.html
| | |-- Topic7.html
| | |-- Topic8.html
| | `-- Topic9.html
| |-- DEM_topics_top_3_No_promoters
| | |-- Topic1.html
| | |-- Topic10.html
| | |-- Topic11.html
| | |-- Topic12.html
| | |-- Topic13.html
| | |-- Topic14.html
| | |-- Topic15.html
| | |-- Topic16.html
| | |-- Topic2.html
| | |-- Topic3.html
| | |-- Topic4.html
| | |-- Topic5.html
| | |-- Topic6.html
| | |-- Topic7.html
| | |-- Topic8.html
| | `-- Topic9.html
| `-- menr.pkl
|-- scATAC
| |-- candidate_enhancers
| | |-- markers_dict.pkl
| | |-- region_bin_topics_otsu.pkl
| | `-- region_bin_topics_top3k.pkl
| |-- cistopic_obj.pkl
| |-- consensus_peak_calling
| | |-- MACS
| | | |-- B_cells_1_peaks.narrowPeak
| | | |-- B_cells_1_peaks.xls
| | | |-- B_cells_1_summits.bed
| | | |-- B_cells_2_peaks.narrowPeak
| | | |-- B_cells_2_peaks.xls
| | | |-- B_cells_2_summits.bed
| | | |-- CD14_Monocytes_peaks.narrowPeak
| | | |-- CD14_Monocytes_peaks.xls
| | | |-- CD14_Monocytes_summits.bed
| | | |-- CD4_T_cells_peaks.narrowPeak
| | | |-- CD4_T_cells_peaks.xls
| | | |-- CD4_T_cells_summits.bed
| | | |-- CD8_T_cells_peaks.narrowPeak
| | | |-- CD8_T_cells_peaks.xls
| | | |-- CD8_T_cells_summits.bed
| | | |-- Dendritic_cells_peaks.narrowPeak
| | | |-- Dendritic_cells_peaks.xls
| | | |-- Dendritic_cells_summits.bed
| | | |-- FCGR3A_Monocytes_peaks.narrowPeak
| | | |-- FCGR3A_Monocytes_peaks.xls
| | | |-- FCGR3A_Monocytes_summits.bed
| | | |-- NK_cells_peaks.narrowPeak
| | | |-- NK_cells_peaks.xls
| | | |-- NK_cells_summits.bed
| | | `-- narrow_peaks_dict.pkl
| | |-- consensus_regions.bed
| | |-- pseudobulk_bed_files
| | | |-- B_cells_1.bed.gz
| | | |-- B_cells_2.bed.gz
| | | |-- CD14_Monocytes.bed.gz
| | | |-- CD4_T_cells.bed.gz
| | | |-- CD8_T_cells.bed.gz
| | | |-- Dendritic_cells.bed.gz
| | | |-- FCGR3A_Monocytes.bed.gz
| | | |-- NK_cells.bed.gz
| | | |-- bed_paths.pkl
| | | `-- bw_paths.pkl
| | `-- pseudobulk_bw_files
| | |-- B_cells_1.bw
| | |-- B_cells_2.bw
| | |-- CD14_Monocytes.bw
| | |-- CD4_T_cells.bw
| | |-- CD8_T_cells.bw
| | |-- Dendritic_cells.bw
| | |-- FCGR3A_Monocytes.bw
| | `-- NK_cells.bw
| |-- models
| | `-- 10x_pbmc_models_500_iter_LDA.pkl
| `-- quality_control
| |-- bc_passing_filters.pkl
| |-- metadata_bc.pkl
| `-- profile_data_dict.pkl
|-- scRNA
| `-- adata.h5ad
`-- scenicplus
|-- SCENIC+_gene_based.loom
|-- SCENIC+_region_based.loom
|-- eRegulons.bb
|-- eRegulons.bed
|-- gene_ranking.pkl
|-- interact.as
|-- interact.bed.tmp
|-- network_B_cells.cys
|-- r2g.importance.bb
|-- r2g.importance.bed
|-- r2g.rho.bb
|-- r2g.rho.bed
|-- region_ranking.pkl
`-- scplus_obj.pkl
24 directories, 232 files
The files in the folder scATAC/consensus_peak_calling/pseudobulk_bw_files/*
and scATAC/consensus_peak_calling/MACS/*.narrowPeak
are very useful for visualising cell type specific region accessibility in IGV or the UCSC genomebrowser. You can combine this (only in the UCSC genome browser) with scenicplus/eRegulons.bb
, scenicplus/r2g.importance.bb
and/or scenicplus/r2g.rho.bb
to get a more complete view including region to gene links and predicted eRegulon bindin sites.
The files scenicplus/SCENIC+_gene_based.loom
and scenicplus/SCENIC+_region_based.loom
can be visualized in SCope.
The file: scenicplus/network_B_cells.cys
can be visualised in Cytoscape.