Graph creation

Idealized random networks can be automatically generated by the application, or networks can be uploaded by the user for analysis. SpectralNET can automatically generate random Erdos-Renyi graphs [

7], Barabasi-Albert (scale-free) graphs [

8], re-wiring Barabasi-Albert graphs [

9], Watts-Strogatz (small-world) graphs [

10], or hierarchical graphs [

11]. Each automatically generated graph type is customizable with algorithmic parameters. SpectralNET was designed with extensibility in mind, so that users may request additional random graph types provided they submit a succinct algorithm to the author or create their own.

Networks can be uploaded by the user in the form of a Pajek file [

12] or a tab-delimited text file with one edge per line (see

additional file 1: HumanPPI_nodenodeweight.txt for an example network definition file defining a network of human protein-protein interactions). Raw data files can also be uploaded to the application, where each line of data is represented as a labeled vertex. Vertices can be connected with edge weights equal to the square of the correlation of their associated input data, or according to their Euclidean distance as defined by the Eigenmap algorithm [

13]. If raw data is uploaded by the user, principal component analysis (PCA) [

14,

15] can optionally be performed on the data before calculating edge weights.

Graph analysis

After processing the input network, SpectralNET displays for the user a wide variety of graph-analytic metrics. For example, the degree and clustering coefficient is displayed for each vertex. The degree of a vertex is the number of edges incident upon that vertex; for weighted graphs, SpectralNET calculates this as the sum of these edges' weights. The clustering coefficient of a node represents the proportion of its neighbors that are connected to each other, and is calculated for a node *i *as:

where

*n*_{i }denotes the number of edges connecting neighbors of node

*i *to each other, and

*k*_{i }denotes the number of neighbors of node

*i *[

16]. In addition, the minimum, average, and maximum distances of each vertex are displayed, which are defined as the shortest, average, and maximum distances, respectively, from the node to any other node in the graph. The components of the adjacency, Laplacian, and normalized Laplacian eigenvectors corresponding to the vertex are also shown, where the adjacency matrix is defined as the matrix

*A *with the following elements:

the Laplacian matrix is defined as the matrix L with the following elements:

where

*w*(

*e*) denotes the weight of edge

*e*; and the normalized Laplacian matrix is defined as the matrix

with the following elements:

where

*d*_{i }denotes the degree of node

*i *[

5]. It should be noted that Chung defines the Laplacian matrix as the normalized form above, but we use the more commonly found definition (for an example, see Mohar [

17]).

Many large networks derived from biological data are composed of multiple subgraphs that are not always connected together. SpectralNET computes many properties based on the selected or "active" connected component. For the active connected component, its size and average diameter are displayed in addition to graphs of degree distribution [

18], clustering coefficient by degree, and average distance by degree [

19]. Graphs of eigenvalues, eigenvectors, inverse participation ratios, and spectral densities of the three matrix types are also displayed. The inverse participation ratio is defined for each eigenvector as:

where

*e*_{j }represents the eigenvector. Spectral density, or the density of the eigenvalues, is plotted for each eigenvalue as

on the horizontal axis and

on the vertical axis, with the function p defined on any eigenvalue as:

where

*λ *is the eigenvalue and

*δ *represents the delta function, implemented as described above [

20]. Most graphs can be mouse-clicked to select the vertex corresponding to a desired data point, and eigenvalue graphs can be sorted by value or by vertex degree. All calculated graph metrics can be exported as a tab-delimited text file for further analysis.

Visualization and dimensionality reduction

The main graph display window of SpectralNET offers two interactive graphical networks displays that support zooming and allow vertex selection by mouse-click. The default display view is the resulting graph processed by the Fruchterman-Reingold algorithm [

21], which positions vertices by force-directed placement. The other available display is the network's Laplacian embedding, which locates vertices in two-dimensional Euclidean space using the corresponding second and third Laplacian eigenvector components (the first eigenvector component of the Laplacian matrix is degenerate). Exportation of the other Laplacian eigenvector components allows for visualization in higher dimensions.

In conjunction with uploaded raw data, Laplacian embedding allows the user to see a reduced-dimensionality view of high-dimensionality input, once this input is converted into a network. If the user chooses to process input data using the Eigenmap algorithm, Laplacian embedding shows the reduced-dimensionality result [

13]. Dimensionality reduction has proven to be a useful tool in computational chemistry and bioinformatics; for example, Agrafiotis [

22] used multidimensional scaling (MDS) to reduce the dimensionality of combinatorial library descriptors, and Lin [

15] used PCA to analyze single nucleotide polymorphisms from genomic data. We chose to implement Laplacian embedding rather than MDS or other algorithms in SpectralNET because of promising results in the field of machine learning [

23]. Although dimensionality reduction is especially useful for analyzing high-dimensional data, Laplacian embedding is an elegant display choice for any input network (see the next section for an example using a scale-free biological network). For a simpler (linear) dimensionality-reduced view of the input data, SpectralNET also has the option of viewing the results of PCA (though this view is not available when a network definition file, such as a Pajek file, is used). Both Laplacian embedding and PCA can be viewed in three dimensions with a Virtual Reality Modeling Language (VRML) viewer.

Example analysis of a randomly-generated small-world network and a biological scale-free network

SpectralNET provides an easy-to-use interface for creating a randomly generated small-world network. All that is required is to supply the desired number of nodes, the desired number of neighbors to which to connect each node, and the desired random probability that an edge is re-wired. For this example we create a network with 300 nodes in which each node is connected to four neighbors, and edges are rewired with 4% probability.

The default view of the graph is its Fruchterman-Reingold display, which, as noted above, uses force-directed placement to draw graph nodes (Figure ). While the Fruchterman-Reingold display offers a quickly generated view of large networks, relatively little information about the global organization of the network is observable in the display of this small-world network (one cannot tell, for example, that the graph is a small-world network by its Fruchterman-Reingold display alone). In order to see the graph as drawn by the Laplacian eigenvector components of each node, the "Laplacian Embedding" radio button underneath the graph display is selected. In contrast to the Fruchterman-Reingold display, the Laplacian embedding of this small-world network (Figure ) conveys significantly more information about its topology. In this display, it is clear that the small-world network was generated by placing neighboring nodes next to each other in a ring-like fashion – the theoretical ring-structure is represented literally in the Laplacian embedding.

Real-world biological networks are also amenable to topological analysis using Laplacian embeddings. In order to generate a suitable biological network to analyze, the MIPS Mammalian Protein-Protein Interaction Database [

26] was downloaded and parsed into a node-node-weight file for import into SpectralNET (see:

additional file 1: HumanPPI_nodenodeweight.txt). The Laplacian embedding of the largest connected component of the resulting graph (Figure ) shows a central hub of highly connected proteins connected to four connected branches. Spectral analysis similar to that performed below shows that the network is scale-free in nature, as is further evidenced by the fact that there are many more low-degree proteins than high-degree proteins, with the relationship between number of proteins and protein degree following a power-law distribution (data not shown). The scale-free nature of this network suggests that highly-connected proteins in the central hub may perform a coordinating role for the proteins in this interaction network. Examining the most highly connected protein in the central hub of the network (indicated in Figure ) shows that, indeed, it is the transcriptional co-activator SRC-1, which receives and augments signals from multiple pathways [

27]. Readers with further interest in topological analysis of biological networks are encouraged to read Farkas et al. [

28] for a global analysis of the transcriptional regulatory network of

*S. cerevisiae *or Jeong et al. [

29] for an analysis of the protein interaction network of the yeast.

In addition to the graphical display of networks, SpectralNET enables analysis of spectral properties of input networks, which can shed light on graph topology. One way this can be achieved is to compare a small-world network similar to, but not identical to, the randomly generated small-world network described above. This graph is a small-world network created by attaching complete subgraphs, varying in size from three to six nodes, to nodes arrayed in a ring (see

additional file 2: Small-world_nodenodeweight for the network definition file, originally described by Comellas [

24]) (Figure ). The spectral properties of this graph can be used to help identify the topology of the original graph, in this case by comparing their adjacency and Laplacian spectral densities (Figure ) [

5,

20]. Spectral density measures the density of surrounding eigenvalues at each eigenvalue and serves as an especially useful metric of global graph topology. The plot of these values for the example network is most similar to the corresponding plots for a Watts-Strogatz network (in this network, there are 500 nodes connected to 6 neighbors, with a re-wiring probability of 1%), despite the fact that there are only 33 nodes in the example network. Thus, even when an example network has relatively few nodes, comparison of spectral properties of the graph to idealized graphs can yield clues about network's topology.

Dimensionality reduction of a real-world chemical dataset to analyze QSAR

In addition to performing spectral analysis of networks, SpectralNET can also perform dimensionality reduction on chemical datasets to analyze quantitative structure activity relationships (QSAR). In this example, we upload a set of chemical descriptor data into SpectralNET and analyze it using the Laplacian Eigenmap algorithm originally developed by Belkin and Niyogi [

13]. This dataset contains one small molecule, each created by the same diversity-oriented synthesis pathway [

25], per row of the input file. Each column of the data represents a different

*molecular descriptor *– metrics used to capture an aspect of the compound, such as volume, surface area, number of rings, etc.

The Laplacian Eigenmap algorithm in SpectralNET connects these small molecules to their *K*-nearest neighbors (measured by Euclidean distance), where *K *is an algorithmic parameter supplied by the user. In this example, we choose *K *= 7 to yield a reasonable number of edges in the resulting graph. Weights are assigned to each edge in one of two ways – every edge can have a weight of one, or weights can be assigned to edges by the following formula:

where

*W*_{ij }represents the weight of an edge connecting edges

*i *and

*j *and

*t *is an algorithmic parameter [

13]. For the molecular descriptor dataset, edge weights of one were chosen (it should be noted that when applying the second method to this dataset, increasing values of

*t *eventually resulted in convergence to the same result as this method around

*t *= 20,000). SpectralNET also offers the choice of performing PCA on input data before performing the Laplacian Eigenmap algorithm, which is performed by default and remains enabled for this example.

The resultant Laplacian embedding of the graph, which can be viewed by selecting the "Laplacian Embedding" radio button underneath the graph view pane, is the reduced dimensionality result of the Laplacian Eigenmap algorithm (Figure ). Like PCA, the Laplacian Eigenmap algorithm performs dimensionality reduction on an input dataset such that relationships among the data are captured by fewer dimensions. Unlike PCA, however, it is not a linear transformation of the data, and the resulting non-linear dimensionality reduction can offer a more powerful view of the data than does PCA.

Because Laplacian Eigenmaps is a local, rather than global, algorithm, it seeks to preserve local topological features of the data in its reduced-dimensionality space [

13]. Thus, it is difficult to compare its performance relative to a linear, global algorithm like PCA without labeled features on which to classify the data and a rigorous comparison across multiple datasets and datatypes. However, by visual inspection of points clustered together in the Laplacian Eigenmap result (from the highlighted areas in Figure ), one can see that they are structurally similar relative to a set of random compounds selected from the space as a whole (Figure ), and the two outlier groups visible in the original image are also chemically similar (data not shown). The same dataset plotted on its first two principal components (via PCA) yields no significant clustering comparable to that of Laplacian Eigenmaps with instead one large and a second smaller diffuse cluster visible (Figure ). Additional support for nonlinear QSAR methods comes from Douali et al. [

30], which found that a nonlinear QSAR approach using neural networks predicted activities very well, outperforming other methods found in the literature. A more rigorous comparison of these algorithms in the context of molecular descriptor data is ongoing.