Running grap-models v0.13.3

GRAP

🧑🏻‍💻 Design principles

In the development of GRAP, we tried to solve several issues in the field of GSMM reconstructions:

  1. Two ID systems (namespaces) are mainly used in GSMMs. One is SEED, the other is BiGG. SEED can be unconfortable to work with, as it uses numerical codes as metabolite and reaction IDs; therefore they are not human-readable. BiGG IDs, in contrast, resemble the metabolite / reaction they are representing. For instance, glucose-1-phosphate is g1p in BiGG, while 00089 in SEED. This makes a big difference, for instance when trying to read metabolic maps, such as those drawn with Escher.

  2. BiGG is not a consistent biochemistry database; rather, it’s a collection of models assumed to use the same IDs. This has consequences: the same metabolite (same ID) can have multiple definitions of formula / charge; metabolites can be duplicated (several IDs for the same chemical entity); the same reaction (same ID) can be defined with different reversibility; reactions can be duplicated (eg when involving duplicated metabolites); and more.

  3. Given (2), BiGG-based manually-curated reference GSMMs are not perfecctly integrable between each other: they cannot be really merged together to create a consistent universe. CarveMe followed this path, and indeed its universe has many mass / charge unbalanced reactions. The presence of unbalanced reactions can lead to the violation of the stoichiometric consistency.

  4. Current BiGG v1.6 has just 22 bacterial strains not part of the Escherichia/Shigella species complex, therefore genes stored in BiGG are scarce and biased toward model species. This means that reconstruction tools using the BiGG gene database will dramatically underrepresent gene diversity and thus metabolic diversity.

  5. Aggressive gap-filling algorithms are often used in tools that provide “simulation-ready” models, so the number of orphan reactions (ie, without associated genes) remaining in output GSMMs is often excessively high. The problem is usually accentuated when reconstructing far-from-model species and, given (4), it can be even worse for BiGG-based tools (eg CarveMe).

  6. Reconstruction tools usually don’t keep trace of automatically gap-filled reactions, nor why / at which step of the reconstruction process they were introduced. A known exception is gapseq, which has its own tracing system — but requires some coding in R to be interpreted.

  7. The gold-standard for manual curation is probably Thiele2010. To have quality GSMM, automatically produced draft models have to be manually curated. This is true both for template-based and current universe-based methods. Considering (3), manual-curation results as highly organism- / project-specific. When starting a new project / when changing the reference GSMM to model new strains, the process of the manual curation needs to be initiated from scratch. It’s a rather repetitive job. Curation efforts are hardly recycled.

  8. The manual curation is eased when an Escher map is available for the strains to model. Considering (3), it’s not possible to have a one-map-fits-all. Therefore, like manual curation, the drawing of Escher maps is highly organism- / project-specific: another repetitive and disposable job to do.

  9. Given (4) and (5), and in general with closed / immutable universes, the accuracy of reconstruction can be compromised by the introduction of wrong reaction / transport mechanisms. This inaccuracy is often unseen when comparing predictions with experimental data: to give an example, it’s possible to have a true-positive match with Biolog® data even if the substrate is modeled to enter the cell with a proton symport instead of the expected ATP-binding cassette.

  10. Original enzymatic complexdefinitions can be violated: reactions could be introduced in the draft GSMM even if essential subunits are not found in the genome. This leads to inaccurate GPRs or, even worse, to the introduction of spurious reactions.

  11. Biolog® data are useful to curate the network topology of a GSMM, not just to validate it. An embedded, automated model curation driven by Biolog® data is therefore desirable.

  12. In genome-scale modeling, it's common to re-use bioass equations coming from few well-studied species, because biomass data are scarce or because thier integration in GSMMs is not so straightforward (despide efforts like BOFdat and Beck2018. Custom experimental biomass data, when available, should be easily integrated.

  13. A reconstruction tool should be easy to install and use. Mandatory dependencies should not include paid software. Moreover, the tool should be adequately fast, even in small personal Windows / MacOS laptops. In addition, to be more inclusive towards pure wet-lab colleagues, the tool should not require programming skills to be used.

  14. Online tools are convenient but usually they are not as flexible as command-line tools (eg they could prevent batch executions). Moreover, online tools may not respect privacy of data. The option to use the tool locally from the command-line should therefore be always available.

icon_new.png

Considering the issues above, GRAP was developed with the following design principles:

  • Reconstructions are based on a single, compartment-agnostic universe. Unlike other famous tools like CarveMe and gapseq, the universe was manually built from scratch, and it still expanded and manually curated applying continuous sanity checks. By design our universe, and all GSMMs derived from it, do not contain unbalanced reactions, stoichiometric inconsistencies, nor duplicated metabolites.

  • The universe uses human-readable IDs, BiGG-compliant where possible. If a metabolite or reaction is still missing from BiGG, a novel BiGG-like ID will be assigned.

  • The manual curation paradigm is shifted: no more applied to output strain-specific GSMMs, but directly on the universe and then propagated downwards. Moreover, the development of the universe is collaborative: anyone can contribute by adding new contents or curating existing contents. We designed a collaborative workflow based on Google collaborative services (Google Drive, Google Sheets). Universe expansion can be directed towards an organism of interest, accelerating the universe coverage for that organism.

  • The universe is based on KEGG, but users can easily extend it beyond KEGG contents. KEGG was chosen as our reference for two equally important reasons: (1) Compared to MetaCyc, KEGG is still free to access and use. (2) Reactions are conveniently linked to general orthologs, described by 'K' codes: something similar is currently missing in MetaCyc. This bidirectional link orthologs <-> reactions was exploided in GRAP, as described in the next point.

  • Reactions of the universe are linked to general orthologs, not to specific gene sequences as usual. Therefore, every manual curation effort is never wasted: universal reactions linked to general orthologs are written once and for all; they will be compatible with every strain of every phylum (including archaea and eukaryotes…). In other words, each curation effort is recycled endlessly.

  • The universe is continuously mirrored with a collection of hand-drawn Escher maps adhering to conventional disposition of reactions and metabolites, resembling that of KEGG pathway maps. Like for the development of the universe, the drawing of our Escher maps is also a collaborative effort opened to anyone. It's also possible to add totally custom maps, going beyond those shown in KEGG.

  • Gap-fillings are always conservative, and they are traced in dedicated Excel files. If needed, they can be skipped completely. Moreover, informative logs are produced during each step of the reconstruction.

  • The biomass equation is adapted automatically: (1) the presence of non-universal biomass precursors is deduced on-the-fly based to the gene content; coefficients of biomass precursors are automatically adjusted based on provided experimental biomass data. Biomass consistency (ie, the sum of components matching exactly 1.0 gDW) is always granted.

  • The network topology is automatically curated if experimental Biolog® (binarized) datasets are provided (plates PM1-PM4A).

  • Experimental data used to constrain GSMMs (media recipes, uptake/secretion rates, Biolog® data, biomass data) are stored in a shared space, queried by GRAP at each run. Users can add new experimental data, which will be available for everyone. Alternatively, users can work with their local, private dataset.

  • GRAP comes as a fast and platform independent command-line interface (CLI). In addition, an online graphical user interface (GUI) -this website- is also provided with a basic subset of features.