**Manual**

**Note**: Program requires SBW 2.6.0 or later

Here I shall describe a new user interface for biochemical networks that has been designed to integrate both tasks of interest to biologists - namely, simulating a model and analyzing the data. The data simulation is carried out by a stochastic simulator, whose parameters such as simulation start and end times, as well as data or time sampling options can be set prior to starting the simulation. This being a tool designed for statistical analysis, users can specify the number of runs of the model that the simulation should generate. Once the data for the specified number of runs has been generated, the tool then computes correlations between various species, along with their power spectral densities and transfer functions. In particular the following capabilities are supported:

**Data Generation:** In the current version of the user interface, the user can generate data for a selected number of runs once the SBML model is loaded. The model queries the stochastic simulator for data between a starting time and ending time. There are two ways in which such data can be generated. The default option is a grid based data generation method, in which the simulator computes the species numbers on a uniform time grid. This is required in order to compute ensemble averages and power spectral densities. The grid interval is therefore the parameter that controls the resolution of the results. The second option for data generation is based on species sample sizes, where the solution at each time step is advanced until one or more species sizes reach the target sample size. (It can be seen that setting the sample size to 1 recovers the raw data for a run)

The user is also provided with an option to ignore the transient data when performing analysis on the generated data. This is controlled by means of a scale factor that multiplies the minimum time estimated by taking inverse of the fastest reaction rate. The true end time thus a sum of the desired end time and the time computed as needed to reach steady state. This can also be compared by plotting the deterministic results over the stochastic data. Further, the user can continue the selected run again, which increments the current time, and uses the species numbers at the end of the previous run to initialize the current run, or alternatively, the simulation can be reset to original initial conditions.

**Probability Density:** Each data set for a given run and a selected species has a probability density from which it is realized. The tool computes the histogram for each run as well for the entire population of runs for each species in the network. Users can look at the histograms of the data for each run, as they are being generated, or alternatively, the population histogram after all the runs have been generated. Often, structure in the time history of the data can be seen in the shape of the histogram, and is an indicator of the dynamics involved. Further, the bin size can be altered, leading to more informative distributions if there are large number of data points, as is the case when large numbers of species are involved.

**Ensemble Averages:** The ensemble averages for all the species participating in the network are computed upon completion of the data generation. These can be accessed in the Ensemble Averages tab, and show the ensemble average in the top plot and the ensemble standard deviation in the bottom plot. Additional details for the species of interest (selected in the display options panel) such as population mean, standard deviation, variance and the coefficient of variation are displayed in the panel below the plots. The coefficient of variation is an entity of particular interest to biologists, and can be used to infer spread in the standard deviation across the population. Further, it should be noted that, upon reaching steady state, the values of variance should roughly correspond to the values of the population mean, reflecting the underlying Possionian assumption the stochastic simulation.

**Auto and Cross Correlations:** Any statistical analysis on data should involve the study of correlations, as they contain information about how various participants in the network are coupled dynamically. For any given data run, one can construct auto-correlations of various species as well as the cross-correlations between them. These are computed for each run, and then averaged across the population to yield averaged correlation matrices. for a network with M species, there are in total M auto-correlation matrices, and M(M-1) cross-correlation matrices. The correlations are computed from the covariance matrix by subtracting the ensemble means, and scaling by the ensemble standard deviations. It can also be noted that the auto-correlations are symmetric about the origin, while the cross-correlations do not share this property. The user has the option to enable or disable plotting of auto-correlations for each run as they are computed.

**Power Spectral Densities:** The power spectral densities (PSDs) are a good indicator of the frequency content of the network. In other words, if a network has an oscillatory behavior, the PSDs have a peak at the frequency of oscillation. These are computed by taking the Fast Fourier Transform (FFT) of the auto or cross-correlation matrix. In general, the PSD is a complex valued entity, whose absolute magnitude can be obtained by taking the magnitude of each of its elements. In the case of auto-correlations, the symmetry involved results in a real valued power spectral density function, which has no imaginary values. The cross-spectrum, which is the FFT of the cross-correlation, can on the other hand, have complex entries, and hence also involves a phase angle. The tool computes the PSDs by calling the FFT routines available in the publicly available Exocortex.DSP library. The attached figure shows the power spectral densities for a linear chain model where a boundary node has been converted to a floating node.

It should be pointed out that the user has the option of iteratively refining the power spectra, as each correlation set is computed from each data run. This is a useful option when the data generated is large, and involve a large number of populations. and yields the broad picture of the power spectral densities with the first few data runs, with subsequent runs reducing noise at higher frequencies.

**Transfer Functions:** The transfer function for a given network can be defined as the gain between a selected input and output nodes. This is sometimes easy to compute analytically for small networks, but for larger networks, one has to take recourse to methods that generate the power spectrum and thereby compute the transfer function. Transfer functions have long been studied in biological systems, but only recently has it possible to study them at the level of gene protein networks. Much of the design of electrical circuits involves designing filters that behave according to a specified transfer function between input and output, and it is thus hoped that this tool will enable such a design process, especially, with the advent and growth of Synthetic Biology. The transfer functions for the presented example are shown in the attached figure. It can be seen that there is a fair amount of noise at high frequencies. This can be reduced further by using a larger population size (here only 50 runs have been used), as well as a finer grid size (here 1) to bring out finer details in the power spectra as well as the transfer functions.

**Injecting Noise:** Often events of interest happen at the boundaries of any network, and hence it is desirable to know the relationship between an interior node (where the output is measured) and the boundary node (which can be treated as input). This relationship is given by te transfer function, and is given a ratio involving the power spectral densities and cross-spectra at the input and output. However, in the case of stochastic simulation, the limitations imposed by SBML do not allow a boundary node to be treated directly as an internal species. A boundary node thus stays constant, and hence cannot have a power spectrum. We have therefore developed a simple SBW module (SBMLModifier) that interfaces to libSBML and converts one or more boundary nodes in a network to floating type, without affecting the states of any other nodes in the network. This is done by adding a fictitious boundary node, along with associated reactions to keep the converted boundary node at the same level. Additionally, the user has the option to control the mean level of noise at this converted node. The attached figure details how the SBML modification works on a simple example, and shows how the transfer functions can be computed for this example.

**Testing Application:** This is a companion user interface to test the exactness of the stochastic simulator being used. This is done by making use of the Discrete Stochastic Model Test Suite (DSMTS) that involves a set of models that extensively test the capabilities of the simulator. Documentation for this test suite can be obtained from Calibayes. This test suite has been implemented in a testing application module called edu.kgi.StochSimTest that is part of the Systems Biology Workbench. Current implementation is restricted to the Gillespie simulator bundled with SBW, but future version will allow users to test the exactness of their own simulators, and thereby identify problems.

A paper on this has been submitted to *Bioinformatics*. *Title: Stochastic Simulation GUI for Biochemical Networks.*

**Collaborators:** H.M. Sauro.