Network inference with random forest

Author

Nathalie Vialaneix

Published

Monday, October 9, 2023

This practical’s aim is to perform network inference using GENIE3 (Huynh-Thu et al. 2010). The dataset used to illustrate this practical is an extract of the expression data and reconstructed network described in (Nicolas et al. 2012), that is related to the bacteria Bacillus subtilis. Note that gene and sample names have been anonymized.

Be also sure to have the proper R packages installed (usage of renv is strongly recommended):

Code

library("reshape2")
library("ggplot2")
library("GENIE3")
library("igraph")
library("PRROC")
library("rfPermute")

Data

Expression data

Expression data is included in the file ../data/expr.csv and are loaded with:

Code

expr <- read.table("../data/expr.csv", sep = "\t", header = TRUE)
dim(expr)

[1] 457 270

The data are organized with genes in rows and samples in columns, the first column being the gene name:

Code

head(expr[, 1:10])

A global visualization of the gene expression is available using a heatmap:

Code

df_heatmap <- melt(expr, id.vars = "Name")
names(df_heatmap) <- c("gene", "sample", "expression")

p <- ggplot(df_heatmap, aes(sample, gene)) +
  geom_tile(aes(fill = expression), color = "white") +
  scale_fill_gradient(low = "yellow", high = "red") +
  ylab("genes ") + xlab("samples") + theme(axis.text = element_blank()) +
  labs(fill = "Expression level")
p

Network inference

Network inference will be performed with GENIE3:

Code

?GENIE3

GENIE3 requires an expression matrix with genes in rows and gene names as row names:

Code

expr_matrix <- as.matrix(expr[, 2:ncol(expr)])
rownames(expr_matrix) <- expr$Name
expr_matrix <- expr_matrix
set.seed(1055)
res_GENIE3 <- GENIE3(expr_matrix, nTrees = 50, verbose = TRUE)

Note: This simulation is given as a mere illustration of the process. However, we can not expect much from the results: * 50 trees is probably not enough to obtain a good performance; * \(\sigma\) factors, which are major regulators in bacteria, are not specified in the original data (where they could have been).

Analysis of results

Loading and displaying the true network

The result is then compared with the ground true network, available in ../data/net.rds as an igraph object:

Code

ref_net <- readRDS("../data/net.rds")
ref_net

IGRAPH afc5784 DN-- 457 710 -- 
+ attr: name (v/c)
+ edges from afc5784 (vertex names):
 [1] W0285->W0285 W0285->F0297 W0285->C0769 W0285->O0665 W0285->X0991
 [6] W0285->K0548 W0285->Y0468 W0285->Z0620 W0285->X0574 W0285->P0189
[11] W0285->V0342 W0285->J0957 W0285->F0308 W0285->X0773 W0285->R0994
[16] W0285->Q0692 W0285->Z0954 W0285->D0661 W0285->S0907 W0285->Y0771
[21] W0285->A0935 W0285->O0136 W0285->V0123 W0285->G0870 W0285->J0696
[26] W0285->S0104 W0285->U0802 W0285->I0168 W0285->F0181 W0285->D0014
[31] W0285->V0861 W0285->N0624 W0285->X0018 W0285->D0148 W0285->Z0131
[36] W0285->P0275 W0285->F0851 W0285->Y0754 W0285->O0313 W0285->Z0109
+ ... omitted several edges

Code

edge_density(ref_net)

[1] 0.003407041

The true network can be displayed using:

Code

par(mar = rep(0, 4))
set.seed(1121)
plot(ref_net, vertex.size = 3, vertex.color = "lightgreen", 
     vertex.frame.color = "lightgreen", edge.color = "grey", 
     vertex.label = rep(NA, vcount(ref_net)))

For the sake of simplicity, we will further use the undirected version of the network. We also extract the corresponding adjacency matrix:

Code

undirected_ref <- as.undirected(ref_net, mode = "collapse")
ref_adj <- as_adj(undirected_ref, sparse = FALSE)
diag(ref_adj) <- 0

Threshold based on weight distribution

Weight distribution of the solution is explored to visually set a relevant threshold:

Code

all_weights <- c(res_GENIE3[upper.tri(res_GENIE3)], 
                 res_GENIE3[lower.tri(res_GENIE3)])
df <- data.frame("weights" = all_weights)
p <- ggplot(df, aes(x = weights)) + geom_histogram() + theme_bw() + 
  scale_x_log10() + geom_vline(xintercept = 1e-2, color = "darkred")
p

The corresponding (undirected) network is then deduced:

Code

net1 <- res_GENIE3
net1[res_GENIE3 < 1e-2] <- 0
net1[res_GENIE3 >= 1e-2] <- 1
net1 <- graph_from_adjacency_matrix(net1, mode = "max")
net1

IGRAPH 7138c93 UN-- 457 12142 -- 
+ attr: name (v/c)
+ edges from 7138c93 (vertex names):
 [1] W0285--C0769 W0285--P0255 W0285--S0077 W0285--K0840 W0285--Y0311
 [6] W0285--N0741 W0285--N0010 W0285--F0096 W0285--N0057 W0285--D0254
[11] W0285--U0658 W0285--Z0569 W0285--J0880 W0285--D0054 W0285--L0362
[16] W0285--E0486 W0285--E0301 W0285--Z0025 W0285--V0132 W0285--Y0083
[21] W0285--R0994 W0285--V0864 W0285--B0497 W0285--K0663 W0285--S0155
[26] W0285--R0764 W0285--O0113 W0285--V0123 W0285--X0082 W0285--C0834
[31] W0285--U0180 W0285--V0928 W0285--G0023 W0285--E0461 W0285--W0606
[36] W0285--N0763 W0285--X0912 W0285--P0487 W0285--E0983 W0285--F0717
+ ... omitted several edges

It has a density equal to:

Code

edge_density(net1)

[1] 0.1165304

which is larger than the density of the true network.

Visually, it also looks quite different:

Code

par(mar = rep(0, 4))
set.seed(1112)
plot(net1, vertex.size = 3, vertex.color = "lightgreen", 
     vertex.frame.color = "lightgreen", edge.color = "grey", 
     vertex.label = rep(NA, vcount(net1)))

It is mostly explained by a very different degree distribution (not using prior information on \(\sigma\) factor is probably instrumental in this difference):

Code

df <- data.frame("true" = degree(ref_net), "GENIE3" = degree(net1))
p <- ggplot(df, aes(x = true, y = GENIE3)) + geom_point(alpha = 0.4) +
  theme_bw()
p

Threshold based on realistic density

Another approach would find a threshold based on a realistic value of the network density (obtained from the true network, here):

Code

ref_density <- edge_density(undirected_ref)
max_weight <- pmax(res_GENIE3[upper.tri(res_GENIE3)],
                   t(res_GENIE3)[upper.tri(res_GENIE3)])
thresh <- quantile(max_weight, probs = 1 - ref_density)
thresh

 99.31955% 
0.05164564

This threshold is used to define a second network:

Code

net2 <- res_GENIE3
net2[res_GENIE3 < thresh] <- 0
net2[res_GENIE3 >= thresh] <- 1
net2 <- graph_from_adjacency_matrix(net2, mode = "max")
net2

IGRAPH ec7c922 UN-- 457 709 -- 
+ attr: name (v/c)
+ edges from ec7c922 (vertex names):
 [1] W0285--C0769 W0285--Y0311 W0285--N0010 W0285--V0123 W0285--F0717
 [6] W0285--T0776 F0297--R0154 C0769--X0082 C0769--L0809 C0769--G0023
[11] C0769--P0487 C0769--X0681 C0769--S0522 C0769--T0353 E0873--L0863
[16] P0255--I0595 P0255--B0357 P0255--V0583 P0255--C0521 O0665--N0927
[21] O0665--S0636 O0665--H0534 R0740--P0524 R0740--U0911 P0524--U0911
[26] J0702--Z0491 J0702--O0570 J0702--I0968 X0991--C0593 X0991--R0994
[31] X0991--R0578 X0991--K0643 D0934--Z0569 D0934--R0910 D0934--H0859
[36] T0196--L0743 T0196--V0933 Y0468--J0269 V0395--J0880 V0395--E0301
+ ... omitted several edges

This network has a density close to the true network, as expected:

Code

edge_density(net2)

[1] 0.006804484

But it still does not match the true network visually:

Code

par(mar = rep(0, 4))
set.seed(1112)
plot(net2, vertex.size = 3, vertex.color = "lightgreen", 
     vertex.frame.color = "lightgreen", edge.color = "grey", 
     vertex.label = rep(NA, vcount(net1)))

And, again, it has a very different degree distribution:

Code

df <- data.frame("true" = degree(ref_net), "GENIE3" = degree(net2))
p <- ggplot(df, aes(x = true, y = GENIE3)) + geom_point(alpha = 0.4) +
  theme_bw()
p

Global evaluation

ROC and PR curves are finally obtained, using the set of weights (maximum between the two weights for each edge) for positive (true) and negative (wrong) edges:

Code

true_edges <- res_GENIE3
true_edges[ref_adj == 0] <- 0
true_edges <- pmax(true_edges[upper.tri(true_edges)], 
                   t(true_edges)[upper.tri(true_edges)])
true_edges <- true_edges[true_edges != 0]

wrong_edges <- res_GENIE3
wrong_edges[ref_adj == 1] <- 0
wrong_edges <- pmax(wrong_edges[upper.tri(wrong_edges)], 
                    t(wrong_edges)[upper.tri(wrong_edges)])
wrong_edges <- wrong_edges[wrong_edges != 0]

roc <- roc.curve(scores.class0 = wrong_edges, scores.class1 = true_edges,
                 curve = TRUE)
roc


  ROC curve

    Area under curve:
     0.501703 

    Curve for scores from  1.636326e-06  to  0.1349187 
    ( can be plotted with plot(x) )

Code

plot(roc)

Code

pr <- pr.curve(scores.class0 = wrong_edges, scores.class1 = true_edges,
               curve = TRUE)
pr


  Precision-recall curve

    Area under curve (Integral):
     0.9929469 

    Area under curve (Davis & Goadrich):
     0.9929468 

    Curve for scores from  1.636326e-06  to  0.1349187 
    ( can be plotted with plot(x) )

Code

plot(pr)

Short illustration of the permutation based approach

rfPermute can be used to obtain a \(p\)-value describing the confidence of each edge. It requires to be run on each gene used as a target for the prediction, in turn, which would be too long for this practical. We just illustrate its use with one of the gene, with a very reduced (and not relevant) number of permutations:

Code

expr_df <- data.frame(t(expr_matrix))
set.seed(1241)
res_rfPermute <- rfPermute(W0285 ~ ., data = expr_df, num.rep = 10, 
                           p = round(sqrt(500)))

The results contain a \(p\)-value field that is summarized below:

Code

res_rfPermute

An rfPermute model

               Type of random forest: regression 
                     Number of trees: 500 
No. of variables tried at each split: 152 
       No. of permutation replicates: 10 
                          Start time: 2023-09-18 13:06:53 
                            End time: 2023-09-18 13:09:12 
                            Run time: 2.32 mins

Code

summary(res_rfPermute$pval[, 2, "scaled"])

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.09091 1.00000 1.00000 0.89872 1.00000 1.00000

Using a threshold of 10% on the \(p\)-value, we obtain the predictors of “W0285” that we can compare to the true neighborhood:

Code

ref_nei <- neighborhood(undirected_ref, order = 1, nodes = "W0285")[[1]]$name
cat("reference:", ref_nei, "\n")

reference: W0285 F0297 C0769 O0665 X0991 K0548 Y0468 Z0620 X0574 P0189 V0342 J0957 F0308 X0773 R0994 Q0692 Z0954 D0661 S0907 Y0771 A0935 O0136 V0123 G0870 J0696 S0104 U0802 I0168 F0181 D0014 V0861 N0624 X0018 D0148 Z0131 P0275 F0851 Y0754 O0313 Z0109 J0269 L0777 H0545 U0226 E0325 F0118 R0995 U0641 E0434 B0166 O0495 N0566 B0883 Z0078 W0606 J0233 N0278 Z0843 G0686 I0949 X0293 Y0904 N0763 N0896 S0139 N0094 Z0444 A0410 B0719 R0480 P0046 Q0690 R0906 C0200 X0912 W0547 C0837 D0649 M0953 U0978 F0161 M0393 H0637 K0305 O0093 O0789 B0300 U0423 A0241 A0437 P0206 S0759 P0407 S0176 R0578 T0352 Y0723 K0383 X0916 J0779 K0466 G0932 W0720 K0219 S0095 N0458 R0154 T0042 L0040 P0487 E0983 S0636 N0704 S0435 F0717 E0542 S0421 A0781 T0869 H0534 C0450 J0612 S0373 G0669 N0586 A0494 A0289 R0296

Code

pred_nei <- names(which(res_rfPermute$pval[, 2, "scaled"] < 0.1))
cat("predicted:", pred_nei, "\n")

predicted: S0077 K0840 Y0311 N0741 N0010 P0156 U0658 E0301 C0799 M0635 Y0083 K0563 V0123 X0082 C0834 E0325 G0023 L0040 F0717 N0671 J0732 B0594 K0756 T0776 S0522 N0430 L0129

The intersection between the two neighborhood is finally obtained with:

Code

intersect(ref_nei, pred_nei)

[1] "V0123" "E0325" "L0040" "F0717"

Note: Again, the previous analysis is not satisfactory. What would be sounder would be to use the results of GENIE3 as a prior list of interactions to test and to perform rfPermute only to edges that have been selected during this prior step.

Session information

Code

sessionInfo()

R version 4.3.1 (2023-06-16)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] rfPermute_2.5.2 PRROC_1.3.1     igraph_1.5.1    GENIE3_1.22.0  
[5] ggplot2_3.4.3   reshape2_1.4.4 

loaded via a namespace (and not attached):
 [1] randomForest_4.7-1.1 gtable_0.3.4         jsonlite_1.8.7      
 [4] dplyr_1.1.3          compiler_4.3.1       tidyselect_1.2.0    
 [7] Rcpp_1.0.11          stringr_1.5.0        swfscMisc_1.6.5     
[10] parallel_4.3.1       scales_1.2.1         yaml_2.3.7          
[13] fastmap_1.1.1        R6_2.5.1             plyr_1.8.8          
[16] labeling_0.4.3       generics_0.1.3       knitr_1.43          
[19] tibble_3.2.1         munsell_0.5.0        pillar_1.9.0        
[22] rlang_1.1.1          utf8_1.2.3           stringi_1.7.12      
[25] xfun_0.40            cli_3.6.1            withr_2.5.0         
[28] magrittr_2.0.3       digest_0.6.33        grid_4.3.1          
[31] rstudioapi_0.15.0    lifecycle_1.0.3      vctrs_0.6.3         
[34] evaluate_0.21        glue_1.6.2           farver_2.1.1        
[37] codetools_0.2-19     fansi_1.0.4          colorspace_2.1-0    
[40] rmarkdown_2.24       tools_4.3.1          pkgconfig_2.0.3     
[43] htmltools_0.5.6

References

Huynh-Thu, Vân Anh., Alexandre Irrthum, Louis Wehenkel, and Pierre Geurts. 2010. “Inferring Regulatory Networks from Expression Data Using Tree-Based Methods.” PLoS ONE 5 (9): e12776. https://doi.org/doi:10.1371/journal.pone.0012776.

Nicolas, Pierre, Ulrike Mäder, Etienne Dervyn, Tatiana Rochat, Aurélie Leduc, Nathalie Pigeonneau, Elena Bidnenko, et al. 2012. “Condition-Dependent Transcriptome Reveals High-Level Regulatory Architecture in Bacillus Subtilis.” Science 335 (6072): 1103–6. https://doi.org/10.1126/science.1206848.