Creating custom cistarget database#

In this tutorial we will create a custom cistarget database using consensus peaks.

This involves precomputed scores for all the motifs in our motif collection on a predefined set of regions

We provide precomputed databases for human, mouse and fly. These databases are computed on regulatory regions spanning the genome. Feel free to use these databases, however for the best results we recommend to generate a custom database given that it is highly likely that the precomputed databases don’t cover all the regions in your consensus peak set.

Download create_cistarget_database#

We will start by downloading and installing the create_cistarget_database repository.

[2]:
cd /staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus_development_tutorial/ctx_db
source /staging/leuven/stg_00002/mambaforge/vsc33053/etc/profile.d/conda.sh
conda activate scenicplus_development_tutorial
[6]:
git clone https://github.com/aertslab/create_cisTarget_databases
Cloning into 'create_cisTarget_databases'...
remote: Enumerating objects: 552, done.
remote: Counting objects: 100% (552/552), done.
remote: Compressing objects: 100% (268/268), done.
remote: Total 552 (delta 332), reused 467 (delta 247), pack-reused 0
Receiving objects: 100% (552/552), 179.97 KiB | 4.00 MiB/s, done.
Resolving deltas: 100% (332/332), done.
create_cistarget_databases_dir='/lustre1/project/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus_development_tutorial/ctx_db'

Download cluster-buster#

Cluster-buster will be used to score the regions using our motif collection. We provide a precompiled binary of cluster buster.

[7]:
wget https://resources.aertslab.org/cistarget/programs/cbust
chmod a+x cbust
--2024-03-06 14:39:06--  https://resources.aertslab.org/cistarget/programs/cbust
Resolving resources.aertslab.org (resources.aertslab.org)... 134.58.50.9
Connecting to resources.aertslab.org (resources.aertslab.org)|134.58.50.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3209632 (3.1M)
Saving to: ‘cbust’

cbust               100%[===================>]   3.06M  14.1MB/s    in 0.2s

2024-03-06 14:39:07 (14.1 MB/s) - ‘cbust’ saved [3209632/3209632]

Download motif collection#

Next, we will download the motif collection.

[8]:
mkdir -p aertslab_motif_colleciton
wget -O aertslab_motif_colleciton/v10nr_clust_public.zip https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/v10nr_clust_public.zip
--2024-03-06 14:42:21--  https://resources.aertslab.org/cistarget/motif_collections/v10nr_clust_public/v10nr_clust_public.zip
Resolving resources.aertslab.org (resources.aertslab.org)... 134.58.50.9
Connecting to resources.aertslab.org (resources.aertslab.org)|134.58.50.9|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 89706219 (86M) [application/zip]
Saving to: ‘aertslab_motif_colleciton/v10nr_clust_public.zip’

aertslab_motif_coll 100%[===================>]  85.55M   109MB/s    in 0.8s

2024-03-06 14:42:22 (109 MB/s) - ‘aertslab_motif_colleciton/v10nr_clust_public.zip’ saved [89706219/89706219]

[15]:
cd aertslab_motif_colleciton; unzip -q v10nr_clust_public.zip
cd ..

These are the motif-to-TF annotations for:

  • Chicken: motifs-v10-nr.chicken-m0.00001-o0.0.tbl

  • fly: motifs-v10-nr.flybase-m0.00001-o0.0.tbl

  • human: motifs-v10-nr.hgnc-m0.00001-o0.0.tbl

  • mouse: motifs-v10-nr.mgi-m0.00001-o0.0.tbl

[4]:
ls aertslab_motif_colleciton/v10nr_clust_public/snapshots/
motifs-v10-nr.chicken-m0.00001-o0.0.tbl  motifs-v10-nr.hgnc-m0.00001-o0.0.tbl
motifs-v10-nr.flybase-m0.00001-o0.0.tbl  motifs-v10-nr.mgi-m0.00001-o0.0.tbl

Here are some example motifs, they are stored in cb format.

[3]:
ls -l aertslab_motif_colleciton/v10nr_clust_public/singletons | head
total 42412
-rw-rw-r--+ 1 vsc33053 vsc33053   163 Jan 27  2022 bergman__Adf1.cb
-rw-rw-r--+ 1 vsc33053 vsc33053    75 Jan 27  2022 bergman__Aef1.cb
-rw-rw-r--+ 1 vsc33053 vsc33053    75 Jan 27  2022 bergman__Hr46.cb
-rw-rw-r--+ 1 vsc33053 vsc33053   113 Jan 27  2022 bergman__Kr.cb
-rw-rw-r--+ 1 vsc33053 vsc33053    86 Jan 27  2022 bergman__Su_H_.cb
-rw-rw-r--+ 1 vsc33053 vsc33053    83 Jan 27  2022 bergman__TFAM.cb
-rw-rw-r--+ 1 vsc33053 vsc33053    77 Jan 27  2022 bergman__ap.cb
-rw-rw-r--+ 1 vsc33053 vsc33053   398 Jan 27  2022 bergman__bcd.cb
-rw-rw-r--+ 1 vsc33053 vsc33053    84 Jan 27  2022 bergman__bin.cb
[2]:
cat aertslab_motif_colleciton/v10nr_clust_public/singletons/bergman__Adf1.cb
>bergman__Adf1
0       0       100     0
0       100     0       0
0       0       0       100
0       50      50      0
0       100     0       0
0       50      0       50
0       0       50      50
0       100     0       0
0       50      0       50
0       0       100     0
0       50      0       50
33.33333333     33.33333333     0       33.33333333

Prepare fasta from consensus regions#

Next we will get sequences for all the consensus peaks. We will also add 1kb of background padding, this will be used as a background sequence for cluster-buster. It is completely optional to add this padding, we have noticed that it does not affect the analyses a lot.

[5]:
module load cluster/wice/bigmem
module load BEDTools/2.30.0-GCC-10.3.0

REGION_BED="/staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/pycisTopic_polars_tutorial/outs/consensus_peak_calling/consensus_regions.bed"
GENOME_FASTA="/staging/leuven/res_00001/genomes/homo_sapiens/hg38_ucsc/fasta/hg38.fa"
CHROMSIZES="/staging/leuven/res_00001/genomes/homo_sapiens/hg38_ucsc/fasta/hg38.chrom.sizes"
DATABASE_PREFIX="10x_brain_1kb_bg_with_mask"
SCRIPT_DIR="/staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus_development_tutorial/ctx_db/create_cisTarget_databases"

${SCRIPT_DIR}/create_fasta_with_padded_bg_from_bed.sh \
        ${GENOME_FASTA} \
        ${CHROMSIZES} \
        ${REGION_BED} \
        hg38.10x_brain.with_1kb_bg_padding.fa \
        1000 \
        yes

Lmod is automatically replacing "cluster/genius/dedicated_big_bigmem" with
"cluster/wice/bigmem".


Inactive Modules:
  1) GCCcore/6.4.0                    3) ncurses/6.0-GCCcore-6.4.0
  2) libevent/2.1.8-GCCcore-6.4.0     4) tmux


Activating Modules:
  1) GCCcore/10.3.0

[11]:
head -n 2 hg38.10x_brain.with_1kb_bg_padding.fa
>chr1:818570-819070
TGATTGTAAAGCACGGAATGACTCTTAGAAACTGGGCGTCATTCTTTGTGGTTTTCCAAGCTTGGTCTCTGATGATACTCCAGGTCTTAGGAGACATGCTGAATATTTATTATGCTTACATTCAAGCAACATTAACCCTTAAGGTTGATGTAGCTCCCCGTCTTTTTTTCCCAGAAGGAGGAGCACTGAAGGAACACTTTTCCAGTATGGATTCTTTCCAGCTCCGAGAAGCTGGAGGCACACGGATCCCTCGGCCAGCTCTCATCTATGGACGTGCTGTAGTCACAAGGACTGTGACTAAGGCTCAGTCCCTGAGGACTGCCTTGGCATGGGCTGCTTTAGGCTGTAAACACCCAGTTTTATCCACTTTATGTGAAGAAAGCCAACAAGGGGCATGGAGTGAGTTCCGCAGGTTTTAGCGGCTGCGGCGGCTGGTGCTCAGTGGGGATGATGGCGGGAAGGCGCCTCCctctgtgggccccgaggtctgtgcgggaatcagctctgcagctgtgtccaggggcagccgtagaccacacacggcaggctcacagctctgttccatgagaactttatacacaaaagcagacgggctgggcttggcctctggatcataatctgctgacccctgGGTAAGAAATTTTAAATATTTACTTATTTCTGTTCAACAGAAGGGGTGATATACTGAGGAGTGAATAATGGGAAAGATCTGATTCGGCTGTATCAGGAAGGACTGGTGTAAATTCAACTTATTAACTGAATTCACAGTATTCGTGTTTTATGCCTTTAGGGGTTAAAAATGGGTCACACACGAGCAGCATGCACTTCACTGGCGTGGCAGGGCACCTCAGTGTTTACATGTGTGGTTCCCATGCTTACCAGGGCTGGAGGCCCCTGTGAGTAGTGAAGTGCATGTGGAGTTCTGGATACTTTTCCTGGCTTTCTCTATTTGTGTGAGCTTGTGCAGTTAGAGGTTTGGGCTGAATTTGGGTAGAAATGGGTGGCTCACAGGCTGCAAAAGTTCTGTGGACACTTTTTCCCCCAGCTGATTaatgttgtaaatattagaatattgttacataaaagtctggatttttagtttctttcacattggaatagctgccaacattgggcctgcattcatctctctagggcaacgtcggctgcagctgagatggctgctccccggtggggtgtgtgctcggcctgcagtccccgccctccGGACTCCATTCGCCTCCACTCTCAGGTTTGCACCTCGTCATTGTCTTCTAATTTTGCATCCCTGGACTGCGTGACCTACAAGGCTCTCAGCACAACAAGACTCTATGATTCTGTCTATTGGAACAAAAAGCCAGTGAGGCAAGTGTATCATCCTGTTGATGAATTCACAGCATTAACTCTGGGAGTTGGGGACAGTGTGTATTCTTCCTCCAGACACTCTCTGTTTCTCCTGGATGGAAAGGTTCTGCTACTTGTCCCGTGGTCAGGCCCAgccaatggaacggaatggaagtgactctgccccttattggcagaaactttaaaagccgcacaacgttcctgcaccctcccctctgccatgagcctggcagtgctcaggatgggaaaattatctcacctgggcctgaggatacaggagctacccccagcctgcagtggaagagaagcatggacaagtgattaaactttgtgttttcaagccacagaggttttttgaagttgtttgctacCATGCTTTGTCCCTACAAACACAGTCATGGAGAAGGCCAGTGGCAGAGCCTGAGCCGTTCGCGCATCTGTTCACCAGTATCCAGAATAACAATAGATTTTTGAAACATTCCTGAGAAAATTCTGGGAGTTGCATACCGGCCAGTCTTATTCTCTAAAGTTGTTCCTTCTAAAGGGTGTGATGACCGAAAATTTCAGAAAAGCAAACCACCGCTGAAAGGCAACGTTATTTCTGTTGGCAGAAGGCGGCCTGAGCAATCTAGATTTTCCACGGTTCACCAACTAGTTTTTAAGGAAATATGGCTGTGagaggaataaaacatgattcctacctttaaggaactcagagAAGTGAATTAAAGGAAGTCACAGATCAGACAACCAACCACACAAAGTTTCTAAGAGCAAACTGTTCAGGTCGGCAAGTCActcttatccactgttttgccttctaaggtttcagttactctcagtcagtcatggtccaaaaacattaaatgaaaaattccagaaataaacaatacacacgtgttaaatcatgtttcattctgagtagcttgatgaagtctcatgccgtcccactcagccccacctggggtgtgacacctccctctgtcgagcagatccaccctgtctatactacctgcTTTTCCAGGAGATCCACCCTGTCTAGACTACCTGCGTGGCCAGCAGATCCACCCTATCTACACTACCTGCTTTTCCAGCAGATCCACCCTGTCTACACTACCTGCCTGTCCAGCAGATCAAC

Create cistarget databases#

Now we can create the ranking and score database. This step will take some time so we recommend to run it as a job (i.e. not in jupyter notebooks).

[3]:
ls aertslab_motif_colleciton/v10nr_clust_public/singletons > motifs.txt
[ ]:
OUT_DIR=""${PWD}""
CBDIR="${OUT_DIR}/aertslab_motif_colleciton/v10nr_clust_public/singletons"
FASTA_FILE="${OUT_DIR}/hg38.10x_brain.with_1kb_bg_padding.fa"
MOTIF_LIST="${OUT_DIR}/motifs.txt"

"${SCRIPT_DIR}/create_cistarget_motif_databases.py" \
    -f ${FASTA_FILE} \
    -M ${CBDIR} \
    -m ${MOTIF_LIST} \
    -o ${OUT_DIR}/${DATABASE_PREFIX} \
    --bgpadding 1000 \
    -t 20
Initialize dataframe (436234 regions x 10249 motifs) for storing CRM scores for each regions per motif.
Adding Cluster-Buster CRM scores (1 of 10249) for motif "metacluster_146.2" took 0.204300 seconds.
Adding Cluster-Buster CRM scores (2 of 10249) for motif "metacluster_116.3" took 0.161770 seconds.
Adding Cluster-Buster CRM scores (3 of 10249) for motif "metacluster_157.2" took 0.115518 seconds.
Adding Cluster-Buster CRM scores (4 of 10249) for motif "metacluster_120.1" took 0.216778 seconds.
Adding Cluster-Buster CRM scores (5 of 10249) for motif "metacluster_112.2" took 0.139475 seconds.
Adding Cluster-Buster CRM scores (6 of 10249) for motif "metacluster_166.4" took 0.115991 seconds.
Adding Cluster-Buster CRM scores (7 of 10249) for motif "metacluster_177.3" took 0.146787 seconds.
Adding Cluster-Buster CRM scores (8 of 10249) for motif "metacluster_148.1" took 0.119116 seconds.
Adding Cluster-Buster CRM scores (9 of 10249) for motif "metacluster_46.4" took 0.197134 seconds.
Adding Cluster-Buster CRM scores (10 of 10249) for motif "metacluster_13.3" took 0.118657 seconds.
Adding Cluster-Buster CRM scores (11 of 10249) for motif "metacluster_111.4" took 0.107039 seconds.
Adding Cluster-Buster CRM scores (12 of 10249) for motif "metacluster_121.1" took 0.099567 seconds.
Adding Cluster-Buster CRM scores (13 of 10249) for motif "metacluster_164.1" took 0.121122 seconds.
Adding Cluster-Buster CRM scores (14 of 10249) for motif "metacluster_151.1" took 0.112270 seconds.
Adding Cluster-Buster CRM scores (15 of 10249) for motif "metacluster_124.2" took 0.112771 seconds.
Adding Cluster-Buster CRM scores (16 of 10249) for motif "metacluster_1.9" took 0.114057 seconds.
Adding Cluster-Buster CRM scores (17 of 10249) for motif "metacluster_118.1" took 0.163274 seconds.
Adding Cluster-Buster CRM scores (18 of 10249) for motif "metacluster_57.3" took 0.262836 seconds.
Adding Cluster-Buster CRM scores (19 of 10249) for motif "metacluster_150.6" took 0.119764 seconds.
Adding Cluster-Buster CRM scores (20 of 10249) for motif "metacluster_0.2" took 0.115105 seconds.
Adding Cluster-Buster CRM scores (21 of 10249) for motif "metacluster_149.1" took 0.114256 seconds.
Adding Cluster-Buster CRM scores (22 of 10249) for motif "metacluster_173.2" took 0.110355 seconds.
Adding Cluster-Buster CRM scores (23 of 10249) for motif "metacluster_137.2" took 0.104703 seconds.
Adding Cluster-Buster CRM scores (24 of 10249) for motif "metacluster_115.1" took 0.121014 seconds.
Adding Cluster-Buster CRM scores (25 of 10249) for motif "metacluster_125.2" took 0.106094 seconds.
Adding Cluster-Buster CRM scores (26 of 10249) for motif "metacluster_169.2" took 0.127284 seconds.
Adding Cluster-Buster CRM scores (27 of 10249) for motif "metacluster_101.6" took 0.100963 seconds.
Adding Cluster-Buster CRM scores (28 of 10249) for motif "metacluster_128.2" took 0.119298 seconds.
Adding Cluster-Buster CRM scores (29 of 10249) for motif "metacluster_163.1" took 0.180142 seconds.
Adding Cluster-Buster CRM scores (30 of 10249) for motif "metacluster_123.5" took 0.098548 seconds.
Adding Cluster-Buster CRM scores (31 of 10249) for motif "metacluster_156.3" took 0.211007 seconds.
Adding Cluster-Buster CRM scores (32 of 10249) for motif "metacluster_133.2" took 0.103186 seconds.
Adding Cluster-Buster CRM scores (33 of 10249) for motif "metacluster_176.1" took 0.120749 seconds.
Adding Cluster-Buster CRM scores (34 of 10249) for motif "metacluster_136.3" took 0.172081 seconds.
Adding Cluster-Buster CRM scores (35 of 10249) for motif "metacluster_152.3" took 0.124086 seconds.