Skip to contents

Introduction

HPAclusteR is an R package designed by the Human Protein Atlas to streamline the process of gene clustering from transcriptomics data. It provides a modular pipeline for performing PCA, calculating distances, constructing shared nearest neighbor (SNN) graphs, consensus clustering, UMAP, functional annotation, and different publication-ready visualizations. The package is designed to work with AnnDatR objects, making it easy to integrate into workflows for transcriptomics data analysis.

Getting Started

To begin, load the HPAclusteR package:

We will use the built-in example_adata dataset for this vignette. This dataset is a toy example designed to demonstrate the functionality of the package.

💡 HPAclusteR can work with AnnDatR objects created using the AnnDatR package, as long as the data is transposed so that genes are the observations (rows) and samples are the variables (columns). Alternatively, you can use the hc_initialize() function to create a properly formatted object directly from your data frames.

Clustering Pipeline

💡 This part demonstrates the step-by-step usage of the HPAclusteR pipeline. For users who prefer simplicity, the hc_auto_cluster() function can be used to run the entire pipeline in one call.

Step 1: Principal Component Analysis (PCA)

PCA is the first step in the pipeline, used to reduce the dimensionality of the data while retaining the most important features.

adata_res <- hc_pca(example_adata, components = 40)
head(adata_res$obsm$X_pca[, 1:5])  # PCA scores
#>                        PC1       PC2        PC3        PC4        PC5
#> ENSG00000002745  2.5250898 -1.245267  4.9538170 -2.9612872  0.5522311
#> ENSG00000004660 -3.9791370 -4.237506  1.0128910 -2.3799084  0.1714757
#> ENSG00000006047 -3.3357725  5.230900  0.2779654  0.8539579 -1.0689779
#> ENSG00000006059  0.3849729  1.662731  4.2587010 -1.7739603  0.3751746
#> ENSG00000006453  4.3272116  2.237562  2.0914290  0.3764680 -5.5500274
#> ENSG00000006740 -3.4995078 -4.826965 -1.0965416  0.6870355 -3.5294865
adata_res$uns$pca  # Explained variance
#> nipals calculated PCA
#> Importance of component(s):
#>                  PC1    PC2     PC3     PC4     PC5     PC6     PC7     PC8
#> R2            0.2353 0.1895 0.09884 0.07956 0.06474 0.05336 0.02484 0.02105
#> Cumulative R2 0.2353 0.4248 0.52366 0.60323 0.66797 0.72133 0.74617 0.76721
#>                  PC9    PC10    PC11    PC12    PC13     PC14     PC15     PC16
#> R2            0.0169 0.01463 0.01307 0.01153 0.01092 0.009315 0.008165 0.007873
#> Cumulative R2 0.7841 0.79874 0.81180 0.82333 0.83426 0.843573 0.851738 0.859611
#>                   PC17     PC18    PC19    PC20     PC21     PC22     PC23
#> R2            0.007254 0.007048 0.00658 0.00588 0.005541 0.005381 0.005207
#> Cumulative R2 0.866864 0.873913 0.88049 0.88637 0.891914 0.897295 0.902502
#>                   PC24     PC25     PC26     PC27     PC28     PC29     PC30
#> R2            0.004988 0.004932 0.004718 0.004567 0.004482 0.004177 0.003912
#> Cumulative R2 0.907490 0.912421 0.917139 0.921706 0.926188 0.930364 0.934277
#>                   PC31     PC32     PC33     PC34     PC35     PC36     PC37
#> R2            0.003712 0.003605 0.003258 0.003229 0.003022 0.002936 0.002783
#> Cumulative R2 0.937988 0.941593 0.944851 0.948080 0.951102 0.954038 0.956821
#>                   PC38     PC39     PC40
#> R2            0.002728 0.002528 0.002346
#> Cumulative R2 0.959549 0.962078 0.964423
#> 78   Variables
#> 981  Samples
#> 111  NAs ( 0.145 %)
#> 40   Calculated component(s)
#> Data was mean centered before running PCA 
#> Data was NOT scaled before running PCA 
#> Scores structure:
#> [1] 981  40
#> Loadings structure:
#> [1] 78 40

The PCA step reduces the dimensionality of the data to 40 components, which can be adjusted based on the dataset and analysis goals.

Step 2: Distance Calculation

Next, we calculate the pairwise distances between samples using the PCA-reduced data. The hc_distance() function supports multiple distance metrics, such as “euclidean” and “spearman”. We will use the number of components determined by Kaiser’s rule.

adata_res <- hc_distance(
  adata_res, 
  components = hc_kaisers_rule(adata_res), 
  method = "spearman"
)
#> [1] "Kaiser's rule is above 80% variation. Success"
head(adata_res$uns$distance)
#> [1] 0.9545455 1.3545455 0.4727273 0.8636364 1.1818182 1.3909091

Step 3: Shared Nearest Neighbor (SNN) Graph Construction

The SNN graph is constructed to identify clusters of similar samples. This step uses the distance matrix calculated in the previous step.

adata_res <- hc_snn(adata_res, neighbors = 15, prune = 1 / 15)
adata_res$uns$neighbors$snn
#> A Graph object containing 981 cells

The neighbors parameter controls the number of nearest neighbors considered, while prune adjusts the sparsity of the graph.

Step 4: Consensus Clustering

Consensus clustering is performed to identify robust clusters in the data. This step aggregates clustering results from multiple runs, in this case 100.

adata_res <- hc_cluster_consensus(
  adata_res, 
  resolution = 8, 
  method = "louvain",
  n_seeds = 100
)

head(adata_res$obs$cluster)
#> [1] "30" "14" "4"  "11" "21" "14"

The resolution parameter controls the granularity of the clustering. Higher values result in more clusters.

Step 5: UMAP Visualization

UMAP is used to visualize the clusters in a low-dimensional space. This step provides an intuitive way to explore the clustering results. It uses the constructed SNN graph.

adata_res <- hc_umap(adata_res, verbose = FALSE)
head(adata_res$obsm$X_umap)
#>                     UMAP_1     UMAP_2
#> ENSG00000002745  0.2972319  0.5904026
#> ENSG00000004660 -0.9859446 -0.4686754
#> ENSG00000006047  0.9013946 -0.4792866
#> ENSG00000006059  0.1685648  0.4550808
#> ENSG00000006453 -0.4699026 -0.3761350
#> ENSG00000006740 -0.9678060 -0.4816415

Step 6: Cluster Hulls (Optional)

Cluster hulls are calculated to visualize the boundaries of each cluster in the UMAP plot.

adata_res <- hc_cluster_hulls(adata_res, poly_smoothing = 4)
head(adata_res$uns$UMAP_hulls$hulls)
#> # A tibble: 6 × 7
#>   cluster sub_cluster sub_type landmass      X      Y polygon_id
#>   <chr>         <dbl> <chr>       <dbl>  <dbl>  <dbl> <chr>     
#> 1 14                1 primary         1 -0.980 -0.502 14_1_1    
#> 2 14                1 primary         1 -1.00  -0.502 14_1_1    
#> 3 14                1 primary         1 -1.00  -0.474 14_1_1    
#> 4 14                1 primary         1 -0.987 -0.46  14_1_1    
#> 5 14                1 primary         1 -0.959 -0.46  14_1_1    
#> 6 14                1 primary         1 -0.952 -0.495 14_1_1

Step 7: Visualization

Finally, we visualize the UMAP plot with clusters and hulls using the hc_plot_umap() function.

hc_plot_umap(adata_res, plot = "both")

The plot argument can be set to "points", "hulls", or "both" to customize the visualization.

Cluster Annotation

In this part we will annotate our clusters using the KEGG database. Users can use KEGG, GO or even other databases such as the Human Protein Atlas, Panglao, Trrust and Reactome. Check hc_annotate() for further details.

enrichment_res <- hc_annotate(
  adata_res,
  dbs = "KEGG",
  verbose = FALSE
)
#> Warning in clusterProfiler::bitr(unique(clustering_data[["gene"]]), fromType =
#> "ENSEMBL", : 0.82% of input gene IDs are fail to map...
head(enrichment_res$enrichment)
#> # A tibble: 6 × 10
#>   `Cluster ID` Database      `Term ID` Term          GeneRatio BgRatio `P-value`
#>   <chr>        <chr>         <chr>     <chr>         <chr>     <chr>       <dbl>
#> 1 11           KEGG pathways hsa04382  Cornified en… 13/24     40/349    8.68e-8
#> 2 11           KEGG pathways hsa04519  Cadherin sig… 8/24      24/349    5.37e-5
#> 3 11           KEGG pathways hsa04915  Estrogen sig… 4/24      12/349    5.92e-3
#> 4 12           KEGG pathways hsa01100  Metabolic pa… 9/15      67/349    4.12e-4
#> 5 13           KEGG pathways hsa04814  Motor protei… 3/11      11/349    3.37e-3
#> 6 17           KEGG pathways hsa03040  Spliceosome   5/20      13/349    3.53e-4
#> # ℹ 3 more variables: `Adjusted P-value` <dbl>, `Gene IDs` <chr>,
#> #   `Gene names` <chr>
enrichment_res$bubblemap_kegg

We can also perform gene classification within each cluster using the Human Protein Atlas (HPA) logic. This approach assigns specificity and distribution categories to genes based on their expression patterns across sample categories (e.g., tissues), providing insights into gene function and tissue specificity.

classify_res <- hc_classify(
  adata_res,
  sample_categories = "tissue_name"
)
head(classify_res$classification[["1"]])
#> # A tibble: 6 × 5
#>   ENSG            spec_category   spec_sample_categories   tau dist_category   
#>   <chr>           <chr>           <chr>                  <dbl> <chr>           
#> 1 ENSG00000069974 Enhanced        stomach;bone marrow     0.43 Detected in all 
#> 2 ENSG00000072501 Low specificity NA                      0.35 Detected in all 
#> 3 ENSG00000077147 Low specificity NA                      0.28 Detected in all 
#> 4 ENSG00000078177 Enhanced        thymus                  0.64 Detected in many
#> 5 ENSG00000081923 Enhanced        colon;rectum;stomach    0.59 Detected in many
#> 6 ENSG00000101868 Enhanced        thymus                  0.51 Detected in many
classify_res$pie_charts[["1"]]

We can visualize the gene expression patterns within each cluster using heatmaps. Both z-score and scaled expression heatmaps are generated for each cluster, allowing us to explore and compare gene expression profiles across different groups. The confidence of each cluster is also presented in the sidebar.

expr_heatmaps <- hc_plot_expression(adata_res)
expr_heatmaps$zscore[["1"]]

expr_heatmaps$scaled[["1"]]

Cluster Comparison

Finally, we can compare different clustering results using the hc_cluster_compare() function. Here, we will create a second clustering result with a different resolution and compare it to the first result. This time we will use the hc_auto_cluster() function for simplicity.

adata_res2 <- hc_auto_cluster(
  example_adata,
  cluster_resolution = 10,
  verbose = FALSE
)
#> [1] "Kaiser's rule is above 80% variation. Success"


comparison_res <- hc_cluster_compare(
  adata_res,
  adata_res2,
  graph_type = "bipartite"
)
#> Warning: The `size` argument of `element_line()` is deprecated as of ggplot2 3.4.0.
#>  Please use the `linewidth` argument instead.
#>  The deprecated feature was likely used in the HPAclusteR package.
#>   Please report the issue at
#>   <https://github.com/buenoalvezm/HPAclusteR/issues>.
#> This warning is displayed once per session.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
comparison_res$heatmap

comparison_res$network

Conclusion

Explore the package further to uncover the full potential of your transcriptomics data!

💡 Remember that these data are an example toy-dataset with only a sample of the total protein-coding genes. The results in this guide should not be interpreted as real results. The purpose of this vignette is to show you how to use the package and its functions.

sessionInfo()
#> R version 4.5.3 (2026-03-11)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] HPAclusteR_1.0.0
#> 
#> loaded via a namespace (and not attached):
#>   [1] fs_2.0.1                matrixStats_1.5.0       spatstat.sparse_3.1-0  
#>   [4] enrichplot_1.30.5       httr_1.4.8              RColorBrewer_1.1-3     
#>   [7] prabclus_2.3-5          tools_4.5.3             sctransform_0.4.3      
#>  [10] utf8_1.2.6              R6_2.6.1                lazyeval_0.2.3         
#>  [13] uwot_0.2.4              withr_3.0.2             sp_2.2-1               
#>  [16] GGally_2.4.0            gridExtra_2.3           progressr_0.19.0       
#>  [19] cli_3.6.6               Biobase_2.70.0          textshaping_1.0.5      
#>  [22] factoextra_2.0.0        spatstat.explore_3.8-0  fastDummies_1.7.5      
#>  [25] network_1.20.0          scatterpie_0.2.6        slam_0.1-55            
#>  [28] labeling_0.4.3          sass_0.4.10             diptest_0.77-2         
#>  [31] Seurat_5.4.0            tm_0.7-18               S7_0.2.1               
#>  [34] robustbase_0.99-7       spatstat.data_3.1-9     readr_2.2.0            
#>  [37] askpass_1.2.1           ggridges_0.5.7          pbapply_1.7-4          
#>  [40] pkgdown_2.2.0           systemfonts_1.3.2       yulab.utils_0.2.4      
#>  [43] gson_0.1.0              DOSE_4.4.0              R.utils_2.13.0         
#>  [46] parallelly_1.46.1       RSQLite_2.4.6           treemap_2.4-4          
#>  [49] generics_0.1.4          gridGraphics_0.5-1      ica_1.0-3              
#>  [52] spatstat.random_3.4-5   dplyr_1.2.1             GO.db_3.22.0           
#>  [55] Matrix_1.7-4            S4Vectors_0.48.1        abind_1.4-8            
#>  [58] R.methodsS3_1.8.2       lifecycle_1.0.5         yaml_2.3.12            
#>  [61] qvalue_2.42.0           Rtsne_0.17              grid_4.5.3             
#>  [64] blob_1.3.0              promises_1.5.0          crayon_1.5.3           
#>  [67] miniUI_0.1.2            ggtangle_0.1.1          lattice_0.22-9         
#>  [70] cowplot_1.2.0           KEGGREST_1.50.0         sna_2.8                
#>  [73] pillar_1.11.1           knitr_1.51              fgsea_1.36.2           
#>  [76] fpc_2.2-14              future.apply_1.20.2     codetools_0.2-20       
#>  [79] fastmatch_1.1-8         glue_1.8.0              ggiraph_0.9.6          
#>  [82] V8_8.1.0                ggfun_0.2.0             spatstat.univar_3.1-7  
#>  [85] fontLiberation_0.1.0    pcaMethods_2.2.0        data.table_1.18.2.1    
#>  [88] vctrs_0.7.3             png_0.1-9               treeio_1.34.0          
#>  [91] spam_2.11-3             gtable_0.3.6            kernlab_0.9-33         
#>  [94] cachem_1.1.0            xfun_0.57               mime_0.13              
#>  [97] Seqinfo_1.0.0           coda_0.19-4.1           survival_3.8-6         
#> [100] pheatmap_1.0.13         fitdistrplus_1.2-6      ROCR_1.0-12            
#> [103] nlme_3.1-168            ggtree_4.0.5            bit64_4.6.0-1          
#> [106] fontquiver_0.2.1        RcppAnnoy_0.0.23        bslib_0.10.0           
#> [109] irlba_2.3.7             KernSmooth_2.23-26      otel_0.2.0             
#> [112] colorspace_2.1-2        rrvgo_1.22.0            BiocGenerics_0.56.0    
#> [115] DBI_1.3.0               nnet_7.3-20             tidyselect_1.2.1       
#> [118] bit_4.6.0               compiler_4.5.3          curl_7.0.0             
#> [121] xml2_1.5.2              NLP_0.3-2               desc_1.4.3             
#> [124] fontBitstreamVera_0.1.1 plotly_4.12.0           scales_1.4.0           
#> [127] DEoptimR_1.1-4          lmtest_0.9-40           rappdirs_0.3.4         
#> [130] stringr_1.6.0           digest_0.6.39           goftest_1.2-3          
#> [133] spatstat.utils_3.2-2    rmarkdown_2.31          XVector_0.50.0         
#> [136] htmltools_0.5.9         pkgconfig_2.0.3         umap_0.2.10.0          
#> [139] fastmap_1.2.0           rlang_1.2.0             htmlwidgets_1.6.4      
#> [142] shiny_1.13.0            farver_2.1.2            jquerylib_0.1.4        
#> [145] zoo_1.8-15              jsonlite_2.0.0          statnet.common_4.13.0  
#> [148] BiocParallel_1.44.0     mclust_6.1.2            GOSemSim_2.36.0        
#> [151] R.oo_1.27.1             magrittr_2.0.5          modeltools_0.2-24      
#> [154] ggplotify_0.1.3         wordcloud_2.6           dotCall64_1.2          
#> [157] patchwork_1.3.2         Rcpp_1.1.1              ape_5.8-1              
#> [160] ggnewscale_0.5.2        gdtools_0.5.0           reticulate_1.46.0      
#> [163] stringi_1.8.7           MASS_7.3-65             org.Hs.eg.db_3.22.0    
#> [166] plyr_1.8.9              flexmix_2.3-20          ggstats_0.13.0         
#> [169] parallel_4.5.3          listenv_0.10.1          ggrepel_0.9.8          
#> [172] deldir_2.0-4            Biostrings_2.78.0       splines_4.5.3          
#> [175] tensor_1.5.1            hms_1.1.4               igraph_2.2.3           
#> [178] spatstat.geom_3.7-3     RcppHNSW_0.6.0          reshape2_1.4.5         
#> [181] stats4_4.5.3            evaluate_1.0.5          SeuratObject_5.4.0     
#> [184] tzdb_0.5.0              tweenr_2.0.3            httpuv_1.6.17          
#> [187] openssl_2.3.5           RANN_2.6.2              tidyr_1.3.2            
#> [190] purrr_1.2.2             polyclip_1.10-7         future_1.70.0          
#> [193] clue_0.3-68             scattermore_1.2         ggplot2_4.0.2          
#> [196] gridBase_0.4-7          ggforce_0.5.0           xtable_1.8-8           
#> [199] RSpectra_0.16-2         tidytree_0.4.7          tidydr_0.0.6           
#> [202] later_1.4.8             viridisLite_0.4.3       class_7.3-23           
#> [205] ragg_1.5.2              tibble_3.3.1            clusterProfiler_4.18.4 
#> [208] aplot_0.2.9             memoise_2.0.1           AnnotationDbi_1.72.0   
#> [211] IRanges_2.44.0          cluster_2.1.8.2         globals_0.19.1         
#> [214] concaveman_1.2.0