CSE: A prototype for a cell simulation environment

This page describes a prototypical environment for simulating the biochemistry of a cell at the molecular level. This system offers a web-based environment for defining simulations, and examining and archiving simulation results, as well as an extremely simple (and biochemically unrealistic) simulation "engine" that actually conducts the simulations.

Biochemical activity is determined by a list of reactions operating on a list of reactants to produce a list of products. At each step reactions occur with a user-defined probability attached to each reaction, and the number of molecules of each species is recorded for later display. (Note that this approach focuses on individual molecules, rather than molecule concentrations. As a result, this approach lies somewhere between a purely logical representation of biochemical networks and a classical chemical representation based on sets of differential equations.)

The web interface allows users to supply a set of reactions and an initial set of molecules upon which those reactions operate, and then start the simulation and examine simulation results in a variety of display formats. In fact, this work is focused more on the development of an effective environment for defining simulation data sets, and managing and displaying results than on the simulation process itself.

The following topics are presented in this document:

A simple example: producing mRNA for Protein X

For example, a set of reactions required to produce a molecule of mRNA for the protein product X, could look like:

Reaction
number
Reaction Probability/
rate
0 Join TFIID with the Transcription Binding Protein. .7
1 Bind to the TATA box of Gene.X, Segment 1. .7
2 Build the transcription initiation complex. .7
3 Build the transcription initiation complex under enhancement. .99
4 Phosphorylate PolymeraseII to activate it. .7
5 Transcribe the first mRNA segment for Gene X .7
6 Transcribe the second mRNA segment for Gene X .7
7 Transcribe the final mRNA segment for Gene X, add the termination complex, and release the transcript. .7
8 Break up the transcription complex. .7
9 Degrade mRNA to force an equilibrium. .7

The complete reaction list includes specific reactants, catalysts, and products.

Note that Gene X is assumed to be composed of 3 separate segments (X-1, X-2, X-3) and produces mRNA.X in a segmented fashion. This is done to allow multiple Polymerase II molecules to operate on the same Gene in the usual transcription "pipeline". There are other ways to simulate such parallel activity, but this one is relatively simple.

Please ignore the inaccuracies in this reaction set. The intent of this document is simply to convey the basic operating principles of the simulator. However, such errors do serve illustrate the ways in which the simulation process may or may not represent reality, and, indeed, one of the main thrusts of this work is to understand what aspects of real biochemistry can be and/or must be accurately represented to build useful simulation tools.

Now these reaction can be applied to any collections of molecules, and for this example, they were applied to the following set of cellular components:

Compound Number
of
Molecules
ATP 100
ADP 10
PO4 1
D 1
TBP 1
TATA:Gene.X-1 1
Gene.X-2 1
Gene.X-3 1
ABF 1
PolyII 1
EH 1
PO4 100
mRNAPrecursor 100
TermII 1
Gene.XActivator 0

Running a simulation will produce multiple intermediate and final products. The reactions and molecules listed above will produce the following products within 500 simulation steps:

D/TBP/TATA:Gene.X-1
D/TBP/TATA:Gene.X-1/ABF/PolyII/EH
D/TBP/TATA:Gene.X-1/ABF/PolyII/EH/PO4
EH
Gene.X-2
Gene.X-3
Gene.XActivator
PO4
PolyII/EH/PO4/Gene.X-2
PolyII/EH/PO4/Gene.X-3
PolyII/EH/PO4/Gene.X-3/TermII
TATA:Gene.X-1
mRNAPrecursor
mRNAX
mRNAX-1
mRNAX-1.2

The simulator can display the results of the simulation in a variety of formats. In fact, as of this writing, most project activity has focussed on prototype construction and exploration of various display formats.

One of the most useful formats is a table of simple 2-D graphs, each showing the molecule counts as a function of step number for every molecule species included in the simulation. For the example above, one such graph showing the counts of mRNA molecules coding for protein X, ATP, ADP, and the mRNA precursor would look like:

ADP
ATP
mRNAPrecursor
mRNAX

Step number

Note that the reaction sequence depends upon ATP and the production of mRNA for ProteinX ceases when the supply of ATP is exhausted, so the amount of mRNA for X (plotted in yellow) remains constant beyond that step.

Display options

Display options supported by the CSE provide support for

The CSE can create tables of graphs of molecule counts with respect to timestep of every molecule included in the simulation, or allow the user to choose a subset of molecule species to graph.

The CSE can also produce text tables (without graphs) showing the number of molecules of each type available at the end of each step. This format seems most useful for debugging, or studying activity in close detail.

The package also supports an Applet, called viewReactionHistories2, that shows a list of reactions and a list of molecules both of which change over time as the simulation proceeds from step to step. From step to step the font size of the name of each molecule grows or shrinks as a function of the number of molecules of that type in the molecule pool. The font size of each reaction name also changes as a function of the number of reactions of each type occurring during a step. This kind of animation attempts to give the user a sense of the overall activity induced by a reaction set. An example is shown later.

It is also possible to get phase space graphs of molecule counts as an alternative to the graphs of molecule counts with respect to time step. These graphs should help users distinguish between significant behavior differences and stochastic noise generated by multiple runs of the same reaction and molecule sets.

Correlograms of individual molecule histories can be selected to help evaluate the dynamic behavior of individual models. A "correlogram" plots autocorrelation values against observation lags. An autocorrelation is a simple correlation of a sequence of observations against a partial copy of itself. For example, for an autocorrelation with a lag of one, the sequence X(1) through X(n-1) will be correlated against the sequence X(2) through X(n), where "n" is the number of elements in the sequence. Each correlogram shows the autocorrelation value resulting from successive lag values of 0 (which produces an autocorrelation value of 1.0) through n/4.

In addition, individual molecule histories can be "differenced" to remove periodicity and the resulting data sequence can be displayed as a phase plot of differenced molecule counts. This allows closer inspection of system dynamics.

Several other display types are contemplated, as are modifications to help users deal with larger reaction sets. For example, the viewReactionHistories Applet could be modified to produce Kiviat graphs and modified bar graphs in place of the text-based displays.

More work is necessary to develop methods for depicting behavior variation. One approach is to conduct multiple runs with the same starting data, and graph a mean and confidence interval for each molecular species. In some cases this may obscure multimodal stochastic behavior as individual simulations select alternate developmental or behavior pathways. To avoid obscuring trends, it may be preferable to create a single graph for each molecule containing a separate line for the behavior of that molecule during each run. Aggregation with respect to time would probably stand out upon inspection.

Adding reactions to produce Protein X

A more complicated set of reactions can be used to simulate the production of a protein, say Protein X. In this example, reactions have been added to the previous reaction set to represent the creation of RNA used to build ribosomes and tRNA for use during simulated translation, as well as to allow a chaperone to fold the "straight protein," ubiquitin to tag it, and a proteasome to degrade it. (It would be possible to add reaction steps to represent post-transcriptional modification of mRNA and transport out of the nucleus, but such steps have been omitted for brevity. Such steps would be necessary when modeling expression pathways controlled by processes acting to inhibit or enhance such activities.)

Reaction
number
Reaction Probability/rate
10 Make a tRNA molecule. .7
11  Make a 5S rRNA. .7
13  Make the large transcript for rRNA5.8-18-28S. .7
14  Cut the large transcript into 5.8, 18 and 28S rRNAs .7
15  Make the large ribosomal unit .7
16  Make translation initiation factor 2. .7
17  Begin the translation initiation complex .7
18  Assemble eIF-4. .7
19  More translation initiation complex. .7
20  Add the Large Ribosome subunit to the initiation complex. .7
21  Begin adding amino acids. .7
22  Release the protein and break up the translation complex. .07
23  Fold Protein X into proper configuration. .7
24  Begin process to degrade Protein X molecule .2
25  Degrade a Protein X molecule .3

The following compounds were included to begin the simulation:

Compound Number
of
Molecules
  Compound Number
of
Molecules
ATP 50      ADP 10
PO4 1      D 1
TBP 1      TATA:Gene.X-1 1
Gene.X-2 1      Gene.X-3 1
ABF 1      PolyII 1
EH 1      PO4 100
mRNAPrecursor 20      TermII 1
Gene.XActivator 0      Gene.tRNA 1
PolyIII 1      tRNAPrecursor 50
rRNAPrecursor 50      Gene.rRNA5S 1
Gene.rRNA5.8-18-28S 1      PolyI 1
ribonucleaseMRP 1      eIF-2-Protein1 1
eIF-2-Protein2 1      eIF-2-Protein3 1
eIF-4A 1      eIF-4E 1
eIF-4G 1      ProteinPrecursor 50
tRNAmet 1      GTP 100
PADP 1      CRF 1
Chaperone 1      Ubiquitin 1
Proteasome 1            

Here is a graph showing the amount of some of the compounds produced by the simulation with the molecule complement shown above:

Protein.X
ProteinPrecursor
mRNA.X
mRNAPrecursor

Step number

Note that the number of molecules of protein X rises fairly consistently through the course of the simulation, although the amount of mRNA.X ebbs and flows as it is created, incorporated into complexes, and degraded over time.

Here is a chart from a run where two proteins are produced: proteins X and Y. The base-line transcription probabilities for the mRNAs of both proteins is .01 and the probabilities for production in the presence of an activator are both .7. In this run only the activator for Y is present:

Protein.X
Protein.Y
ProteinPrecursor
mRNA.X
mRNA.Y
mRNAPrecursor

In this run production of protein Y leveled off at around 85 molecules, and protein X at around 10 molecules, clearly due to the greater number of mRNA molecules present for protein Y. Another way to think about this is that at steady state the reaction set allocated about 85 percent of the available "protein precursor" to protein Y. Cells must manage their available resources in a similar fashion, "deciding" how much of their raw materials to distribute towards each life process.

Here the number of molecules of mRNA climbs to a plateau where it oscillates, as detailed in this phase space chart for mRNA.Y (taken from a later run with the same parameters):

Molecule Phase plot (Molecule count at step N, molecule count at step N+1)
mRNA.Y

Note that the data points in this graph have been slightly altered, or "perturbed," so as to distribute them closely around their actual point values. This allows readers to get a sense for the number of times each point is visited. (Non-perturbed graphs are also available.)

The correlogram for mRNA Y is not very interesting, so to show an example correlogram, here is the correlogram for the second (virtual) component of Gene Y from a similar run:

Molecule Correlogram for steps 0 through 501
(X axis: Autocorrelation lags from 0 to the smaller of N/4 or 2000)
Gene.Y-2

Here is a screen-capture of an example run creating Proteins X and Y simultaneously from another simulation run:

Note that both list windows are independently scollable, and the replay can proceed in either a continuous or stepwise fashion.

Note also that this "view" of a reaction history may be joined by other views in later versions of the script. For example, molecule counts can be presented as bar charts, pie charts, Kiviat graphs, etc. and lines can be added connecting individual reactions with reactants (in one color) and products (in another color), while retaining the independent window scrolling.

How to use the cell simulation environment

The Cell simulation system is accessible via http://people.cc.ku.edu/~grobe/cell-sim.html. To run a simulation you must define a set of reactions and a set of molecules upon which those reactions will operate.

You may choose these sets from a collection of test sets via pull-down lists, or upload your own sets from your desktop.

If you choose to upload your own reaction set you must create a text file (NOT a formatted file such as MS Word .doc files) containing several lines for each reaction you wish to define. Each line must begin with one of the following parameter types:

For example, if A and B react to form C in the presence of an enzyme E with probability .7, you might define the reaction as:

title: Combine A with B to make C.
reactants: A,B
catalysts: E
products: C
duration: 3               # i'm not sure about this duration.
probabilities: .7
comment:  This reaction takes three time steps to complete.

All information after any pound sign (#) on an input line will be considered as commentary and ignored by the program. If a line ends with a comma (,), the next line will be treated as an extension of the first line. This allows users to list each reactant, catalyst, product, etc. on a separate line, if desired. The example reactant line above COULD be written as:

reactants: A,
           B

If you choose to upload your own molecule set, you must create a text file containing a list of molecule species along with the number of molecules of each species to start with and, optionally, the time step after which those molecules will be injected into the model. For the sample reaction set above, you might use:

A=10
B=10
E=1        # start with just one molecule of E.
E=2,100    # add two more molecule after 100 time steps.

When you access the Cell simulation web page you will be given an opportunity to type in the names of your local files or "Browse" through your desktop file system to choose them by double-clicking on their icons. When you specify your other simulation parameters, and press the "Do it" button, your files will be uploaded to the host system, where they will be put in temporary files to be used by the Cell simulation engine. (The temporary files will be deleted after they are used.)

Simulation algorithm

To perform the simulation the simulation engine manipulates lists of molecule counts. Some of those lists include:

The initial molecule complement is copied into a list of current molecules available for reacting (or into the list of molecules already reacting, if a molecule is scheduled for deferred injection into the system).

There are three execution phases at each time step:

Since reactants are taken out of the list of molecules available for reacting while they are actually engaged in a reaction, there may not be enough reactant molecules available at during a time step to perform all the reactions predicted during the previous phase.

With the current version of this Cell simulation system, backward and forward versions of each reaction must be entered explicitly. That is, two reactions must be defined, one proceeding forward and the other backwards. For example, the backwards version of the reaction defined earlier will look like:

title: Make A and B from C under catalysis by E.
reactants: C
catalysts: E
products: A, B
duration: 1
probabilities: .2
comment:  This is the backwards version of the other reaction.

Since the specified probabilities are "probabilities of transition," they do not reach a steady state unless there are two or more acting in opposition. For the previous example, if the numbers of A and B molecules are identical, roughly 80% of them will be converted to C at steady state. (It would be possible to increase reaction rates as a function of the numbers of reactants available, but this has not yet been implemented.)

With the current software it appears that pairs of reverse reactions both specified with large probabilities, say .9 forward and .9 backwards, yield weird results because they keep large proportions of a molecule population reacting constantly. Pairs close to (.1, .1) seem marginal, and pairs with larger probabilities, but a ratio of 10 to 1, seem safe, and large reaction durations exacerbate this problematic behavior.

Caveat: This current simulation method will certainly not provide "realistic" results in the sense of accurately representing real biochemistry. It MAY, however, express users' intuitive notions of reaction processes, and capture enough of the dynamics of networks of reactions to help users understand the emergent, or overall, behavior of such networks.

Several researchers have observed that network structure appears to be more important than exact biochemical fidelity in this kind of modelling, but the degree of fidelity required for useful results remains to be seen. At any rate, enhancements to the current approach are already being considered. In particular, plans have been made to apply an approach inspired by S-systems (Voit, 2003) to computing reaction probabilities dynamically. This will inject non-linearity into the system, which should allow it to model events triggered by bursts of activity within regulatory networks.

Project software

This system involves programs written in both Perl and Java, running on both Unix and Windows operating systems. It employs the standard Web client-server CGI model, enhanced by Java 2 Applets, as well as the standard Java servlet model.

The main program is a 3000-line Perl program (cse.pl) that runs in both Unix and Windows CGI environments. This program collects form data from the user via a Web browser, conducts the simulation, archives results on disk for later display, and/or displays results immediately.

A separate 1100-line Perl program (display.pl) displays archived results as requested, and a 300-line Perl program (make-graph.pl) helps manage graph selection and display for both cse.pl and display.pl.

Detailed graphs are produced by two 400-line Perl programs (ping_graph.pl and phase-grapher.pl) running on either Unix or Windows and compressed (thumbnail) graphs are produced by a 500-line Java servlet (Grapher.java) running on a Windows-based Tomcat servlet processor. The graphing routine for auto correlations (auto-corr-grapher.pl) is a descendent of ping_graph.pl, and the routine for difference plots is a descendent of difference-grapher.pl.

There are several other utility routines that will not be discussed here (lister.pl, setup-setup-difference-grapher.pl, setup-difference-grapher.pl, make-difference-x-values.pl, etc.).

Simulation results can also be displayed by an Applet (viewReactionHistories2.java) composed of classes totaling about 1600 lines of Java.

These programs use both XML and Perl dump files to exchange data. XML has shown itself to be quite useful and appropriate for these applications, but data can be more easily moved between two Perl programs by using the Perl dump format.

All programs except grapher.pl were written by the author. grapher.pl was written by Boone Bradley on commission.

Other such projects

There have been quite a few attempts to develop models of cell biochemistry suitable for simulation. This work has been going on in fits and starts over the past 20, 30 or even 40 years, but seems to have become more popular over the past 5 years or so. In fact, there is even an effort underway to integrate multiple simulation engines of this kind within a single interface, so that users may derive benefits from each approach with minimal reworking of their system descriptions.

The approach described in this page is somewhat unique among simulations operating at the molecule-level in it's focus upon reactions transforming molecules, rather than upon individual molecules undergoing reactions.

For a (non-exhaustive) list of some other simulation projects, including two pointers to reviews of the field, see http://condor.cc.ku.edu/~grobe/cell-sim/projects.html.

Some more complex examples

This page has presented a short description of the prototype for a system that simulates some aspects of the biochemistry of the cell. It describes the data required to run a simulation, how to prepare data for the simulator, some of the display capabilities of the current version of the simulator, and gives a brief description of the simulation algorithm. It is hoped that such a system could help users understand the operation of gene expression pathways, particularly those intractable to human intuition.

For example, consider the 3 protein feed forward (FF) chain in which:

The dynamics of eight varieties of this chain are discussed in some detail in Mangan, S. and U. Alon,"Structure and Function of the feed-forward loop network motif", PNAS,100:21 [2003]. The authors show that this chain is relatively unresponsive to short periods of stimulation by an activator that enhances the transcription of protein X; that is, protein Z is only produced if X activation is persistent.

The Cell Simulation system can be used to illustrate this behavior (though its reaction kinetics are mostly bogus). To do that, a reaction set generating the three proteins was constucted using individual sequences similar to the examples above, modified to include 30 time-step delays while moving newly translated mRNA to the cytosol and newly translated activator protein back to the nucleus.

The following graph shows that only X and Y are produced when the activator is present for only 25 time steps.

Protein X
Protein Y
X activator

Step number

and the next table shows what happens when the activator is present for 200 time steps.

Protein X
Protein Y
Protein Z
X activator

Step number

Protein Z was produced during only 1 of 9 simulations with the brief stimulus, but was produced during all 10 simulations with the 200 time step stimulus.

These results seem quite intuitive, so that simulation merely reaffirms intuition. Consider, however, the same chain modified so that protein Z is activated by protein Y alone, but in binding-site competition with protein Z, a so-called "incoherent" feed forward loop. In this case, intuition seems less helpful, but Mangan, et al. have shown that the production of protein Z will pulse in ways that are probably not illustratable by the CSE, because they require more realistic (and non-linear) reaction kinetics which the CSE simulation engine does not presently represent.

As another example, consider the loop formed by activating the transcription of Protein X with Protein Z. Here is a graph of a single run.

To look at the dynamics of this simulation, the frequency graph of the count of Protein Z molecules was plotted on a log-log scale. In addition, the same data was subjected to Detrended Fluctuation Analysis.

DFA was first proposed in Peng C-K, Buldyrev SV, Havlin S, Simons M, Stanley HE, Goldberger AL, Mosaic organization of DNA nucleotides. Phys Rev E 1994;49:1685-1689, and was obtained from Physionet. According to Peng, et al. in " Quantification of scaling exponents and crossover phenomena in nonstationary heartbeat time series (CHAOS, Vol. 5, No. 1, 1995)", a straight line "indicates the presence of scaling," and the slope of the line indicates the nature of that scaling, and slopes of, or close to, 1.0 indicate a 1/f noise relationship, a series possessing dependencies at many, or all, time scales examined,

Summary

This page has shown some of the display capability and intended uses of a web-enabled simulation environment. The focus has been on developing a usable, intuitive interface, rather than on faithful recreation of cell biochemistry.

Scaling this approach to larger and more complex reaction sets will present a significant challenge, but most likely can make a significant contribution to understanding the systemic functioning of the cell.

However, this work is just beginning and much effort will be required to assess the accuracy and usefulness of this system and systems like it. One of the first steps will be to provide more realistic kinetics. This will probably require most reactions to be specified with forward and backwards reaction constants as well as an analog to the nonlinear behavior induced by reactant and product concentrations. The current plan is to treat reactions as controlled by reaction constants that essentially represent the probability of a reaction given a collision among reactant molecules, and consider the probability of collision as a function of the product of the number of molecules of each reactant involved in a reaction. (One alternative would be to use a single parameter for each reaction that is taken as the ratio of products to reactants at equilibrium, possibly along with a "velocity factor" specifying how many time steps to use to reach that equilibrium ratio.)

Note that this system can be used to simulate processes from other domains. For example, it has been used to simulate computer behavior

Michael Grobe
December 2002
October 13, 2003
November 3, 2003
December 2003
January 2006