Preprocessing the scRNA-seq data

Preprocessing the scRNA-seq data#

In this tutorial we will perform some very basic preprocessing steps.

Download data#

The data used for this tutorial is freely available, and can be downloaded from the 10x genomics website.

[87]:
import os
os.chdir("/staging/leuven/stg_00002/lcb/sdewin/PhD/python_modules/scenicplus_development_tutorial/scRNA_seq_pp")
[3]:
!mkdir -p data
!wget -O data/filtered_feature_bc_matrix.tar.gz https://cf.10xgenomics.com/samples/cell-arc/2.0.0/human_brain_3k/human_brain_3k_filtered_feature_bc_matrix.tar.gz
--2024-03-06 13:46:50--  https://cf.10xgenomics.com/samples/cell-arc/2.0.0/human_brain_3k/human_brain_3k_filtered_feature_bc_matrix.tar.gz
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 2606:4700::6812:ad, 2606:4700::6812:1ad, 104.18.0.173, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|2606:4700::6812:ad|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 170112442 (162M) [application/x-tar]
Saving to: ‘data/filtered_feature_bc_matrix.tar.gz’

data/filtered_featu 100%[===================>] 162.23M  28.5MB/s    in 8.5s

2024-03-06 13:47:00 (19.1 MB/s) - ‘data/filtered_feature_bc_matrix.tar.gz’ saved [170112442/170112442]

[8]:
!cd data; tar -xzf filtered_feature_bc_matrix.tar.gz; cd ..
[45]:
!wget -O data/cell_data.tsv https://raw.githubusercontent.com/aertslab/pycisTopic/polars/data/cell_data_human_cerebellum.tsv
--2024-03-06 14:07:48--  https://raw.githubusercontent.com/aertslab/pycisTopic/polars/data/cell_data_human_cerebellum.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 309204 (302K) [text/plain]
Saving to: ‘data/cell_data.tsv’

data/cell_data.tsv  100%[===================>] 301.96K  --.-KB/s    in 0.04s

2024-03-06 14:07:48 (7.38 MB/s) - ‘data/cell_data.tsv’ saved [309204/309204]

Preprocessing#

We will do some very basic preprocessing steps, for more information we refer the reader to the Scanpy tutorials.

[9]:
import scanpy as sc
[50]:
adata = sc.read_10x_mtx(
    "data/filtered_feature_bc_matrix/",
    var_names = "gene_symbols"
)
[51]:
adata.var_names_make_unique()
[52]:
adata
[52]:
AnnData object with n_obs × n_vars = 3233 × 36601
    var: 'gene_ids', 'feature_types'

We have already annotated this dataset, this is beyond the scope of this tutorial. We will load this annotation here.

[75]:
import pandas as pd
cell_data = pd.read_table("data/cell_data.tsv", index_col = 0)
cell_data
[75]:
VSN_cell_type VSN_leiden_res0.3 VSN_leiden_res0.6 VSN_leiden_res0.9 VSN_leiden_res1.2 VSN_sample_id Seurat_leiden_res0.6 Seurat_leiden_res1.2 Seurat_cell_type
AAACAGCCATTATGCG-1-10x_multiome_brain MOL_B MOL_B (0) MOL_B_1 (0) MOL_B_1 (1) MOL_B_3 (6) 10x_multiome_brain NFOL (1) MOL (1) MOL
AAACCAACATAGACCC-1-10x_multiome_brain MOL_B MOL_B (0) MOL_B_1 (0) MOL_B_3 (5) MOL_B_4 (4) 10x_multiome_brain NFOL (1) NFOL (3) NFOL
AAACCGAAGATGCCTG-1-10x_multiome_brain INH_VIP INH_VIP (6) INH_VIP (8) INH_VIP (8) INH_VIP (10) 10x_multiome_brain INH_VIP (7) INH_VIP (6) INH_VIP
AAACCGAAGTTAGCTA-1-10x_multiome_brain MOL_A MOL_A (1) MOL_A_2 (1) MOL_A_1 (0) MOL_A_2 (0) 10x_multiome_brain NFOL (1) NFOL (3) NFOL
AAACCGCGTTAGCCAA-1-10x_multiome_brain MGL MGL (7) MGL (10) MGL (10) MGL (12) 10x_multiome_brain MGL (8) MGL (9) MGL
... ... ... ... ... ... ... ... ... ...
TTTGTGAAGGGTGAGT-1-10x_multiome_brain INH_VIP INH_VIP (6) INH_VIP (8) INH_VIP (8) INH_VIP (10) 10x_multiome_brain INH_SST (5) INH_SST (8) INH_SST
TTTGTGAAGTCAGGCC-1-10x_multiome_brain AST_CER AST_CER (2) AST_CER (2) AST_CER (2) AST_CER_1 (7) 10x_multiome_brain BG (2) BG (2) BG
TTTGTGGCATGCTTAG-1-10x_multiome_brain MOL_B MOL_B (0) MOL_B_1 (0) MOL_B_1 (1) MOL_B_1 (1) 10x_multiome_brain MOL (0) MOL (1) MOL
TTTGTTGGTGATCAGC-1-10x_multiome_brain MOL_A MOL_A (1) MOL_A_2 (1) MOL_A_1 (0) MOL_A_1 (11) 10x_multiome_brain NFOL (1) NFOL (3) NFOL
TTTGTTGGTGATTTGG-1-10x_multiome_brain INH_SST INH_SST (5) INH_SST (7) INH_SST (7) INH_SST (9) 10x_multiome_brain INH_SST (5) INH_SST (8) INH_SST

2392 rows × 9 columns

We modify the index of this cell type annotation dataframe so that the cell barcode names match with those in the AnnData object.

[76]:
cell_data.index = [cb.rsplit("-", 1)[0] for cb in cell_data.index]
[64]:
adata = adata[list(set(adata.obs_names) & set(cell_data.index))].copy()
[65]:
adata.obs = cell_data.loc[adata.obs_names]
[81]:
adata.var["mt"] = adata.var_names.str.startswith("MT-")
sc.pp.calculate_qc_metrics(
    adata, qc_vars=["mt"], percent_top=None, log1p=False, inplace=True
)

Data normalization#

It’s important to save the non normalized and non scaled matrix in the raw slot!

[66]:
adata.raw = adata
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var.highly_variable]
sc.pp.scale(adata, max_value=10)
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/scenicplus_development_tutorial/lib/python3.11/site-packages/scanpy/preprocessing/_simple.py:843: UserWarning: Revieved a view of an AnnData. Making a copy.
  view_to_actual(adata)
[69]:
adata.obs
[69]:
VSN_cell_type VSN_leiden_res0.3 VSN_leiden_res0.6 VSN_leiden_res0.9 VSN_leiden_res1.2 VSN_sample_id Seurat_leiden_res0.6 Seurat_leiden_res1.2 Seurat_cell_type
ACCTGTTGTGGATTCA-1 MOL_A MOL_A (1) MOL_A_2 (1) MOL_A_1 (0) MOL_A_2 (0) 10x_multiome_brain MOL (0) MOL (0) MOL
CTAAAGCTCCCGCCTA-1 INH_VIP INH_VIP (6) INH_VIP (8) INH_VIP (8) INH_VIP (10) 10x_multiome_brain INH_VIP (7) INH_VIP (6) INH_VIP
AAGGAAGCACATAACT-1 AST_CER AST_CER (2) AST_CER (2) AST_CER (2) AST_CER_2 (5) 10x_multiome_brain BG (2) BG (2) BG
GAGCATGCAATATGGA-1 MOL_A MOL_A (1) MOL_A_2 (1) MOL_A_1 (0) MOL_A_2 (0) 10x_multiome_brain MOL (0) MOL (0) MOL
GCTCACAAGACAAGTG-1 MOL_A MOL_A (1) MOL_A_2 (1) MOL_A_1 (0) MOL_A_2 (0) 10x_multiome_brain MOL (0) MOL (1) MOL
... ... ... ... ... ... ... ... ... ...
ACATTGCAGCGGATTT-1 MOL_B MOL_B (0) MOL_B_1 (0) MOL_B_1 (1) MOL_B_3 (6) 10x_multiome_brain NFOL (1) NFOL (3) NFOL
ACGCCTTTCGACCTGA-1 AST AST+ENDO (9) AST+ENDO (6) AST+ENDO (13) AST (15) 10x_multiome_brain AST+ENDO (6) AST (7) AST
CTCAGGATCCACCTGT-1 MOL_B MOL_B (0) MOL_B_1 (0) MOL_B_1 (1) MOL_B_1 (1) 10x_multiome_brain NFOL (1) MOL (1) MOL
CCAACATAGATAACCC-1 MOL_A MOL_A (1) MOL_A_1 (9) MOL_A_1 (9) MOL_A_1 (11) 10x_multiome_brain NFOL (1) COP (10) COP
ATGTAAGCACTTAGGC-1 MOL_B MOL_B (0) MOL_B_1 (0) MOL_B_1 (1) MOL_B_1 (1) 10x_multiome_brain MOL (0) MOL (0) MOL

2313 rows × 9 columns

[70]:
sc.tl.pca(adata)
sc.pl.pca(adata, color = "Seurat_cell_type")
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/scenicplus_development_tutorial/lib/python3.11/site-packages/scanpy/plotting/_tools/scatterplots.py:378: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/human_cerebellum_scRNA_pp_21_1.png
[77]:
sc.pp.neighbors(adata)
[78]:
sc.tl.umap(adata)
[84]:
sc.pl.umap(adata, color = "Seurat_cell_type")
/lustre1/project/stg_00002/mambaforge/vsc33053/envs/scenicplus_development_tutorial/lib/python3.11/site-packages/scanpy/plotting/_tools/scatterplots.py:378: UserWarning: No data for colormapping provided via 'c'. Parameters 'cmap' will be ignored
  cax = scatter(
_images/human_cerebellum_scRNA_pp_24_1.png
[88]:
adata.write("adata.h5ad")