Network-valued data are
data in which the statistical unit is a network itself. This is the data
with which we can make inference on populations of networks
from samples of networks. The nevada
package proposes a specific nvd
class to handle
network-valued data. Inference from such samples is made possible though
a 4-step procedure:
The package focuses for now on the two-sample testing problem and assumes that all networks from both samples share the same node structure.
There are two types of questions that one can ask:
The nevada package offers a dedicated function for answering each of these two questions:
test2_global()
;
for more details, please see Lovato et al.
(2020),test2_local()
;
for more details, please see Lovato et al.
(2021).nvd
class for network-valued dataIn nevada,
network-valued data are stored in an object of class nvd
,
which is basically a list of igraph objects. We
provide:
a constructor nvd()
which allows the user to simulate samples of networks using some of the
popular models from igraph. Currently, one
can use:
The constructor simulates networks with 25 nodes.
There are currently 3 possible matrix representations for a network. Let G be a network with N nodes.
A N x N matrix W is an adjacency matrix for G if element Wij indicates if there is an edge between vertex i and vertex j: $$ W_{ij}= \begin{cases} w_{i,j}, & \mbox{if } (i,j) \in E \mbox{ with weight } w_{i,j}\\ 0, & \mbox{otherwise.} \end{cases} $$
In nevada,
this representation can be achieved with repr_adjacency()
.
The Laplacian matrix L of the network G is defined in the following way:
L = D(W) − W, where D(W) is the diagonal matrix whose i-th diagonal element is the degree of vertex i.
In nevada,
this representation can be achieved with repr_laplacian()
.
The elements of the modularity matrix B are given by
$$ B_{ij} = W_{ij} - \frac{d_i d_j}{2m}, $$ where di and dj are the degrees of vertices i and j respectively, and m is the total number of edges in the network.
In nevada,
this representation can be achieved with repr_modularity()
.
nvd
Instead of going through every single network in a sample to make its
representation, nevada
provides the repr_nvd()
function which does exactly that for an object of class
nvd
.
x <- nvd(model = "gnp", n = 3, model_params = list(p = 1/3))
repr_nvd(x, representation = "laplacian")
#> [[1]]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,] 13 0 0 0 -1 -1 -1 -1 -1 -1 -1 0 0
#> [2,] 0 6 0 0 -1 0 0 0 0 0 0 -1 0
#> [3,] 0 0 3 0 -1 0 0 0 0 0 0 0 0
#> [4,] 0 0 0 7 0 0 0 -1 0 -1 0 0 0
#> [5,] -1 -1 -1 0 13 0 -1 0 0 0 -1 0 -1
#> [6,] -1 0 0 0 0 6 0 -1 0 0 -1 -1 0
#> [7,] -1 0 0 0 -1 0 8 0 0 0 -1 0 0
#> [8,] -1 0 0 -1 0 -1 0 10 -1 0 -1 0 0
#> [9,] -1 0 0 0 0 0 0 -1 7 0 0 0 0
#> [10,] -1 0 0 -1 0 0 0 0 0 9 -1 -1 0
#> [11,] -1 0 0 0 -1 -1 -1 -1 0 -1 9 0 0
#> [12,] 0 -1 0 0 0 -1 0 0 0 -1 0 7 0
#> [13,] 0 0 0 0 -1 0 0 0 0 0 0 0 3
#> [14,] -1 -1 -1 -1 -1 0 0 0 -1 0 0 -1 -1
#> [15,] -1 -1 0 -1 -1 0 0 0 -1 -1 0 0 -1
#> [16,] -1 0 0 -1 0 0 0 0 0 0 0 -1 0
#> [17,] 0 0 0 0 0 0 -1 0 0 0 -1 0 0
#> [18,] 0 0 0 0 0 0 0 -1 -1 -1 0 0 0
#> [19,] 0 0 0 0 0 -1 0 0 0 -1 -1 0 0
#> [20,] -1 0 0 -1 -1 0 -1 0 0 0 0 0 0
#> [21,] 0 0 0 -1 0 -1 -1 -1 0 0 0 -1 0
#> [22,] 0 -1 0 0 -1 0 0 -1 -1 0 -1 -1 0
#> [23,] -1 0 0 0 -1 0 -1 -1 0 -1 0 0 0
#> [24,] 0 0 0 0 -1 0 -1 -1 -1 -1 0 0 0
#> [25,] -1 -1 -1 0 -1 0 0 0 0 0 0 0 0
#> [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
#> [1,] -1 -1 -1 0 0 0 -1 0 0 -1 0 -1
#> [2,] -1 -1 0 0 0 0 0 0 -1 0 0 -1
#> [3,] -1 0 0 0 0 0 0 0 0 0 0 -1
#> [4,] -1 -1 -1 0 0 0 -1 -1 0 0 0 0
#> [5,] -1 -1 0 0 0 0 -1 0 -1 -1 -1 -1
#> [6,] 0 0 0 0 0 -1 0 -1 0 0 0 0
#> [7,] 0 0 0 -1 0 0 -1 -1 0 -1 -1 0
#> [8,] 0 0 0 0 -1 0 0 -1 -1 -1 -1 0
#> [9,] -1 -1 0 0 -1 0 0 0 -1 0 -1 0
#> [10,] 0 -1 0 0 -1 -1 0 0 0 -1 -1 0
#> [11,] 0 0 0 -1 0 -1 0 0 -1 0 0 0
#> [12,] -1 0 -1 0 0 0 0 -1 -1 0 0 0
#> [13,] -1 -1 0 0 0 0 0 0 0 0 0 0
#> [14,] 12 0 0 0 0 0 -1 0 -1 -1 0 -1
#> [15,] 0 10 0 0 0 -1 0 0 -1 -1 0 0
#> [16,] 0 0 6 0 -1 0 -1 -1 0 0 0 0
#> [17,] 0 0 0 5 -1 0 0 0 -1 -1 0 0
#> [18,] 0 0 -1 -1 8 -1 -1 0 0 0 0 -1
#> [19,] 0 -1 0 0 -1 7 -1 0 -1 0 0 0
#> [20,] -1 0 -1 0 -1 -1 9 0 -1 0 0 0
#> [21,] 0 0 -1 0 0 0 0 7 0 -1 0 0
#> [22,] -1 -1 0 -1 0 -1 -1 0 11 0 0 0
#> [23,] -1 -1 0 -1 0 0 0 -1 0 10 0 -1
#> [24,] 0 0 0 0 0 0 0 0 0 0 5 0
#> [25,] -1 0 0 0 -1 0 0 0 0 -1 0 7
#> attr(,"representation")
#> [1] "laplacian"
#>
#> [[2]]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,] 9 0 -1 -1 -1 0 -1 0 -1 -1 0 0 0
#> [2,] 0 7 0 0 -1 -1 -1 0 -1 0 0 0 -1
#> [3,] -1 0 8 0 0 -1 0 -1 0 0 0 -1 0
#> [4,] -1 0 0 8 0 0 0 -1 0 0 -1 0 0
#> [5,] -1 -1 0 0 7 -1 0 0 -1 0 0 0 0
#> [6,] 0 -1 -1 0 -1 10 0 0 0 -1 -1 -1 0
#> [7,] -1 -1 0 0 0 0 6 -1 0 0 0 -1 0
#> [8,] 0 0 -1 -1 0 0 -1 7 0 0 -1 -1 0
#> [9,] -1 -1 0 0 -1 0 0 0 7 0 0 -1 0
#> [10,] -1 0 0 0 0 -1 0 0 0 8 0 0 -1
#> [11,] 0 0 0 -1 0 -1 0 -1 0 0 9 -1 -1
#> [12,] 0 0 -1 0 0 -1 -1 -1 -1 0 -1 11 0
#> [13,] 0 -1 0 0 0 0 0 0 0 -1 -1 0 8
#> [14,] 0 0 0 -1 0 -1 0 0 0 0 0 0 0
#> [15,] 0 0 0 0 0 -1 0 0 0 -1 0 -1 -1
#> [16,] 0 0 0 -1 0 0 0 0 0 -1 -1 -1 0
#> [17,] 0 0 -1 0 0 0 0 0 0 0 0 0 -1
#> [18,] -1 0 0 0 -1 0 0 0 0 -1 0 -1 0
#> [19,] -1 0 0 0 0 0 0 -1 0 0 0 0 -1
#> [20,] 0 0 0 -1 -1 0 0 0 -1 0 -1 0 -1
#> [21,] 0 0 -1 0 0 0 -1 0 -1 0 -1 0 0
#> [22,] 0 -1 -1 -1 0 0 0 0 0 0 -1 -1 0
#> [23,] -1 -1 -1 0 -1 -1 0 -1 0 0 0 0 0
#> [24,] 0 0 0 -1 0 0 -1 0 0 -1 0 0 0
#> [25,] 0 0 0 0 0 -1 0 0 -1 -1 0 -1 -1
#> [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
#> [1,] 0 0 0 0 -1 -1 0 0 0 -1 0 0
#> [2,] 0 0 0 0 0 0 0 0 -1 -1 0 0
#> [3,] 0 0 0 -1 0 0 0 -1 -1 -1 0 0
#> [4,] -1 0 -1 0 0 0 -1 0 -1 0 -1 0
#> [5,] 0 0 0 0 -1 0 -1 0 0 -1 0 0
#> [6,] -1 -1 0 0 0 0 0 0 0 -1 0 -1
#> [7,] 0 0 0 0 0 0 0 -1 0 0 -1 0
#> [8,] 0 0 0 0 0 -1 0 0 0 -1 0 0
#> [9,] 0 0 0 0 0 0 -1 -1 0 0 0 -1
#> [10,] 0 -1 -1 0 -1 0 0 0 0 0 -1 -1
#> [11,] 0 0 -1 0 0 0 -1 -1 -1 0 0 0
#> [12,] 0 -1 -1 0 -1 0 0 0 -1 0 0 -1
#> [13,] 0 -1 0 -1 0 -1 -1 0 0 0 0 -1
#> [14,] 2 0 0 0 0 0 0 0 0 0 0 0
#> [15,] 0 7 0 -1 0 0 0 0 0 -1 0 -1
#> [16,] 0 0 8 -1 0 0 -1 -1 0 0 -1 0
#> [17,] 0 -1 -1 7 0 0 0 0 -1 -1 -1 0
#> [18,] 0 0 0 0 6 -1 -1 0 0 0 0 0
#> [19,] 0 0 0 0 -1 5 -1 0 0 0 0 0
#> [20,] 0 0 -1 0 -1 -1 9 0 0 0 -1 0
#> [21,] 0 0 -1 0 0 0 0 6 -1 0 0 0
#> [22,] 0 0 0 -1 0 0 0 -1 9 -1 0 -1
#> [23,] 0 -1 0 -1 0 0 0 0 -1 9 0 0
#> [24,] 0 0 -1 -1 0 0 -1 0 0 0 6 0
#> [25,] 0 -1 0 0 0 0 0 0 -1 0 0 7
#> attr(,"representation")
#> [1] "laplacian"
#>
#> [[3]]
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
#> [1,] 5 0 0 0 0 0 0 0 0 0 0 0 0
#> [2,] 0 5 0 0 0 0 0 0 0 -1 0 -1 0
#> [3,] 0 0 8 -1 -1 -1 -1 0 -1 0 0 0 0
#> [4,] 0 0 -1 10 -1 0 0 -1 -1 0 0 0 0
#> [5,] 0 0 -1 -1 4 0 0 0 0 0 0 0 0
#> [6,] 0 0 -1 0 0 7 0 -1 0 0 0 0 -1
#> [7,] 0 0 -1 0 0 0 7 0 0 0 0 -1 0
#> [8,] 0 0 0 -1 0 -1 0 8 -1 -1 0 0 0
#> [9,] 0 0 -1 -1 0 0 0 -1 8 -1 -1 -1 -1
#> [10,] 0 -1 0 0 0 0 0 -1 -1 5 0 0 0
#> [11,] 0 0 0 0 0 0 0 0 -1 0 4 -1 0
#> [12,] 0 -1 0 0 0 0 -1 0 -1 0 -1 8 0
#> [13,] 0 0 0 0 0 -1 0 0 -1 0 0 0 5
#> [14,] 0 0 0 0 0 -1 -1 -1 0 0 0 0 0
#> [15,] -1 0 0 0 0 -1 0 0 -1 0 0 0 0
#> [16,] -1 0 -1 -1 -1 0 0 0 0 -1 0 -1 -1
#> [17,] -1 -1 0 -1 0 -1 -1 0 0 0 0 -1 0
#> [18,] -1 -1 0 -1 0 -1 0 -1 0 0 -1 0 0
#> [19,] -1 0 0 -1 0 0 0 -1 0 0 0 0 -1
#> [20,] 0 0 0 -1 -1 0 -1 -1 0 -1 0 -1 -1
#> [21,] 0 -1 0 0 0 0 -1 0 0 0 0 0 0
#> [22,] 0 0 -1 0 0 0 -1 0 0 0 0 0 0
#> [23,] 0 0 -1 0 0 0 0 0 0 0 0 0 0
#> [24,] 0 0 0 -1 0 0 0 0 0 0 0 -1 0
#> [25,] 0 0 0 0 0 0 0 0 0 0 -1 0 0
#> [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
#> [1,] 0 -1 -1 -1 -1 -1 0 0 0 0 0 0
#> [2,] 0 0 0 -1 -1 0 0 -1 0 0 0 0
#> [3,] 0 0 -1 0 0 0 0 0 -1 -1 0 0
#> [4,] 0 0 -1 -1 -1 -1 -1 0 0 0 -1 0
#> [5,] 0 0 -1 0 0 0 -1 0 0 0 0 0
#> [6,] -1 -1 0 -1 -1 0 0 0 0 0 0 0
#> [7,] -1 0 0 -1 0 0 -1 -1 -1 0 0 0
#> [8,] -1 0 0 0 -1 -1 -1 0 0 0 0 0
#> [9,] 0 -1 0 0 0 0 0 0 0 0 0 0
#> [10,] 0 0 -1 0 0 0 -1 0 0 0 0 0
#> [11,] 0 0 0 0 -1 0 0 0 0 0 0 -1
#> [12,] 0 0 -1 -1 0 0 -1 0 0 0 -1 0
#> [13,] 0 0 -1 0 0 -1 -1 0 0 0 0 0
#> [14,] 3 0 0 0 0 0 0 0 0 0 0 0
#> [15,] 0 3 0 0 0 0 0 0 0 0 0 0
#> [16,] 0 0 10 0 -1 -1 0 0 0 -1 0 0
#> [17,] 0 0 0 9 -1 0 0 0 0 -1 0 -1
#> [18,] 0 0 -1 -1 10 0 0 0 -1 0 0 -1
#> [19,] 0 0 -1 0 0 6 0 -1 0 0 0 0
#> [20,] 0 0 0 0 0 0 10 -1 -1 0 0 -1
#> [21,] 0 0 0 0 0 -1 -1 4 0 0 0 0
#> [22,] 0 0 0 0 -1 0 -1 0 4 0 0 0
#> [23,] 0 0 -1 -1 0 0 0 0 0 4 -1 0
#> [24,] 0 0 0 0 0 0 0 0 0 -1 3 0
#> [25,] 0 0 0 -1 -1 0 -1 0 0 0 0 4
#> attr(,"representation")
#> [1] "laplacian"
It is possible to choose which distance consider in the analysis. Let G and H be two networks with N nodes each and suppose that X and Y are the matrix representations of G and H, respectively. The user can currently choose among 4 distances: Hamming, Frobenius, spectral and root-Euclidean.
$$ \rho_H(G,H)=\frac{1}{N(N-1)}\sum_{i \neq j}^N \bigl\arrowvert X_{i,j}-Y_{i,j} \bigr\arrowvert. $$
In nevada,
this distance can be computed with dist_hamming()
.
$$ \rho_F(G,H) = \left\| X - Y \right\|_F^2 = \sum_{i \neq j}^N \bigl ( X_{i,j}-Y_{i,j} \bigr )^2. $$
In nevada,
this distance can be computed with dist_frobenius()
.
$$ \rho_S(G,H)=\sum_{i \neq j}^N \bigl ( \Lambda^X_{i,j}-\Lambda^Y_{i,j} \bigr )^2, $$ where ΛX and ΛY are the diagonal matrices with eigenvalues on the diagonal given by the spectral decomposition of the matrix representations of G and H.
In nevada,
this distance can be computed with dist_spectral()
.
ρRE(G, H) = ∥X1/2 − Y1/2∥F2.
Note that this distance is not compatible with all matrix representations as it requires that the representation be semi-positive definite.
In nevada,
this distance can be computed with dist_root_euclidean()
.
nvd
Pre-computation of the matrix of pairwise distances for samples of
networks alleviates the computational burden of permutation testing.
This is why nevada
provides the convenient dist_nvd()
function which does exactly that for an object of class
nvd
.
The nevada package has been designed to work well with the flipr package, which handles the permutation scheme once suitable representation, distance and test statistics have been chosen. The most efficient way to two-sample testing with network-valued data pertains to use statistics based on inter-point distances, that is pairwise distances between observations.
A number of test statistics along this line have been proposed in the
literature, including ours (Lovato et al.
2020). As these test statistics rely on inter-point distances,
they are not specific to network-valued data. As such, they can be found
in flipr. We
adopt the naming convention that a test statistic function shall start
with the prefix stat_
. All statistics based on inter-point
distances are named with the suffix _ip
. Here is the list
of test statistics based on inter-point distances that are currently
available in flipr:
stat_student_ip()
and its alias stat_t_ip()
implement a Student-like test
statistic based on inter-point distances proposed by Lovato et al. (2020);stat_fisher_ip()
and its alias stat_f_ip()
implement a Fisher-like test
statistic based on inter-point distances proposed by Lovato et al. (2020);stat_bg_ip()
implements the statistic proposed by Biswas and
Ghosh (2014);stat_energy_ip()
implements the class of energy-based statistics as proposed by Székely and Rizzo (2013);stat_cq_ip()
implements the statistic proposed by S. X. Chen
and Qin (2010);stat_mod_ip()
implements a statistic that computes the mean of inter-point
distances;stat_dom_ip()
implements a statistic that computes the distance between the medoids of
the two samples, possibly standardized by the pooled corresponding
variances.There are also 3 statistics proposed in H. Chen, Chen, and Su (2018) that are based on a similarity graph built on top of the distance matrix:
There are also Student-like statistics available only for Frobenius distance for which we can easily compute the Fréchet mean. These are:
In addition to the test statistic functions already implemented in flipr and nevada, you can also implement your own function. Test statistic functions compatible with flipr should have at least two mandatory input arguments:
data
which is either a concatenated list of size nx + ny
regrouping the data points of both samples or a distance matrix of size
(nx + ny) × (nx + ny)
stored as an object of class dist
.indices1
which is an integer vector of size nx storing the
indices of the data points belonging to the first sample in the current
permuted version of the data.The flipr
package provides a helper function
use_stat(nsamples = 2, stat_name = )
which makes it easy
for users to create their own test statistic ready to be used by nevada.
This function creates and saves a .R
file in the
R/
folder of the current working directory and populates it
with the following template:
#' Test Statistic for the Two-Sample Problem
#'
#' This function computes the test statistic...
#'
#' @param data A list storing the concatenation of the two samples from which
#' the user wants to make inference. Alternatively, a distance matrix stored
#' in an object of class \code{\link[stats]{dist}} of pairwise distances
#' between data points.
#' @param indices1 An integer vector that contains the indices of the data
#' points belong to the first sample in the current permuted version of the
#' data.
#'
#' @return A numeric value evaluating the desired test statistic.
#' @export
#'
#' @examples
#' # TO BE DONE BY THE DEVELOPER OF THE PACKAGE
stat_{{{name}}} <- function(data, indices1) {
n <- if (inherits(data, "dist"))
attr(data, "Size")
else if (inherits(data, "list"))
length(data)
else
stop("The `data` input should be of class either list or dist.")
indices2 <- seq_len(n)[-indices1]
x <- data[indices1]
y <- data[indices2]
# Here comes the code that computes the desired test
# statistic from input samples stored in lists x and y
}
For instance, a flipr-compatible version of the t-statistic with pooled variance will look like:
stat_student <- function(data, indices1) {
n <- if (inherits(data, "dist"))
attr(data, "Size")
else if (inherits(data, "list"))
length(data)
else
stop("The `data` input should be of class either list or dist.")
indices2 <- seq_len(n)[-indices1]
x <- data[indices1]
y <- data[indices2]
# Here comes the code that computes the desired test
# statistic from input samples stored in lists x and y
x <- unlist(x)
y <- unlist(y)
stats::t.test(x, y, var.equal = TRUE)$statistic
}
Test statistics are passed to the functions
test2_global()
and test2_local()
via the
argument stats
which accepts a character vector in
which:
stat_
prefix
(e.g. "original_edge_count"
or
"student_euclidean"
).stat_
prefix but adding
the flipr:
prefix (e.g.,
"flipr:student_ip"
).stat_
prefix but adding the
pkg:
prefix.x <- nvd(model = "gnp", n = 10, model_params = list(p = 1/3))
y <- nvd(model = "k_regular" , n = 10, model_params = list(k = 8L))
test2_global(
x = x,
y = y,
representation = "laplacian",
distance = "frobenius",
stats = c("flipr:student_ip", "flipr:fisher_ip"),
seed = 1234
)$pvalue
#> [1] 0.0009962984
Note that you can also refer to test statistic function from nevada
using the naming "nevada:original_edge_count"
as you would
do for test statistics from flipr.
This is mandatory for instance if you have not yet loaded nevada in
your environment via library(nevada)
.
In permutation testing, the choice of a test statistic determines the
alternative hypothesis, while the null hypothesis is always that the
distributions that generated the observed samples are the same. This
means that if you were to use the Student statistic
stat_student_ip()
for instance, then what you would be
actually testing is whether the means of the distributions are
different. If you’d rather be sensitive to differences in variances of
the distributions, then you should go with the Fisher statistic
stat_fisher_ip()
.
You can also be sensitive to multiple aspects of a distribution when
testing via the permutation framework. This is achieved under the hood
by the flipr
package which implements the so-called non-parametric
combination (NPC) approach proposed by Pesarin and Salmaso (2010) when you provide more
than one test statistics in the stats
argument. You can
read this
article to know more about its implementation in flipr.
The bottom line is that, for example, you can choose both the Student
and Fisher statistics to test simultaneously for differences in mean and
in variance.