.. _workflow_overview:

=================
Workflow overview
=================

This guide can help you start working with the Cadbiom command line.


.. figure:: _static/demo_workflow/workflow.svg
   :scale: 85 %
   :alt:
   :align: center

   Workflow for the Cadbiom framework.
   Four steps can be identified:
   **Model building**, covered by the standalone tool named *biopax2cadbiom*;
   **Assisted query reconstruction**,
   **Causality search**,
   **Analysis and visualisation** covered by the Cadbiom package named *cadbiom-cmd*.


The analysis of a BioPAX model is performed according to **four main steps**.

The **first step** of the Cadbiom framework relies on the *biopax2cadbiom* package.
It allows querying BioPAX ontologies stored in public or local endpoints in order to interpret them into Cadbiom models.

The results of the conversions of some databases are available below: `Prebuilt models`_.


The **second step** brings together several commands under the name of *query design*.
This step is dedicated to query construction. A query is a boolean logical formula that describes the
combination of molecules for which regulators will be searched.

The characteristics of a Cadbiom model and its content (list of identifiers of biological compounds,
list of boundary compounds, and list of genes) can be exported. Among the essential data, the user
will find mappings between various databases thanks to the preservation of all cross-references with
public databases (such as HGCN, Uniprot, Chebi) provided in the BioPAX.
Indeed, Cadbiom uses internal identifiers to guarantee the uniqueness of the biomolecules in its models.
A mapping of Cadbiom identifiers with the standard database identifiers is necessary to allow a user to build queries.

Examples of commands are described below: `Query the Cadbiom model`_.


The **third step** of the workflow is focused on *causality search*.
In this step, the user explores the dynamics of a Cadbiom model. He identifies sets of biomolecules that
represent the initial conditions of the system, leading to the activation of the desired entities via controlled
biochemical transformations described in the model.
As detailed in the results section, the notion of activation is intrinsically linked to the non-deterministic
semantics associated with guarded transitions. This analysis therefore generates several families of controllers
associated with trajectories composed by the entities activated during a search.

Examples of commands are described below: `Search for molecules`_.


The **fourth step**, *visualization and analysis* is designed to analyze the families
of controllers generated in the previous step. A matrix of occurrences details the presence of controllers in
the solutions obtained. Heatmaps allow estimating the diversity of the trajectories leading to the studied
phenotype (i.e. the boolean query). Trajectory graphs allow visualizing all the transformations undergone
by the intermediate molecules from the control molecules to the target molecules.

Examples of commands are described below: `Processing of the generated files`_.


Prebuilt models
===============

The majority of models are pre-generated from the PathwayCommons `archive <https://www.pathwaycommons.org/archives/PC2/>`_
and are made available on this page with their respective characteristics.

ACSN maps are preprocessed before there interrogation as RDF data.
The maps are downloadable on `acsn.curie.fr <https://acsn.curie.fr/ACSN2/downloads.html>`_.
The steps are explained on the `GitLab <https://gitlab.inria.fr/DYLISS/biopax2cadbiom/-/issues/6>`_.


Advanced users will be able to create their own models from a SPARQL endpoint as explained in the chapter
`Creation of a Cadbiom model from a BioPAX endpoint`_.

Obtaining these data from the models is also explained in the chapter `Query the Cadbiom model`_.


Here is a table of the number of the BioPAX types found in various databases:

.. csv-table::
    :header: Metrics - Databases    , PID           , Kegg      , Reactome      , ACSN          , CTD

    Entity                          ,               ,           ,               , 27426\*       ,
    Pathway                         , 745           , 122       , 2139          , 13            ,
    Gene                            ,               ,           ,               ,               ,
    PhysicalEntity                  ,               ,           , 1447          , 0/11922\*  ,
    Protein                         , 6194          , 1872      , 24676         , 6851          , 7244
    Complex                         , 4137          ,           , 11715         , 2323/2358\*   , 20167
    SmallMolecule                   , 173           , 1664      , 3574          , 554/0\*       , 9682
    Dna                             ,               ,           , 645           , 1030          , 193
    Rna                             , 22            ,           , 292           , 1164/0\*      , 35038
    Interaction                     ,               ,           ,               ,               ,
    BiochemicalReaction             , 1824          , 1782      , 10197         , 6863          , 3005
    ComplexAssembly                 , 2722          ,           ,               , 1743          , 20167
    TemplateReaction                , 1492          ,           , 39            ,               , 23401
    Transport                       , 312           ,           ,               , 699           , 1738
    MolecularInteraction            ,               , 4         ,               ,               ,
    Degradation                     ,               ,           , 17            ,               , 20531
    TransportWithBiochemicalReaction, 154           ,           ,               ,               ,
    Catalysis                       , 3800          , 1782      , 4929          , 6186          , 58972
    Control                         , 322           ,           , 1516          ,               , 121183
    TemplateReactionRegulation      , 2023          ,           , 43            ,               , 453724
    Modulation                      ,               ,           , 24            ,               , 13586

\*: Number before cleaning of incorrect/duplicate types in triplestore.
i.e. the entire database, for the Entity type.

Here is a table presenting some metrics interpreted according to the raw BioPAX data obtained:

.. csv-table::
    :header: Metrics - Databases            , PID           , Kegg      , Reactome      , ACSN   , CTD

    Retrieved PhysicalEntities              , 10526         , 3536      , 42349         , 1716   , 78832
    Total of duplicated entities            , 699           , 135       , 2341          , 74     ,
    Nb of groups of duplicated entities     , 339           , 61        , 696           , 23     ,
    Classes                                 , 403           , 0         , 4229          , 0      , 13614
    Used classes                            , 304           , 0         , 1611          , 0      , 2494
    Nested classes                          , 23            , 0         , 574           , 0      ,
    Classes with ModificationFeatures       , 157           , 0         , 0             , 0      ,
    Classes/complexes                       , 66            , 0         , 669           , 0      ,
    Final number of entities (after processing) ,           ,           ,               ,        ,
                                                                                                 ,
    Retrieved Reactions                     , 6504          , 1786      , 10253         , 9305   , 68842
    Proteins involved as reactants          , 3304          , 0         , 4777          , 4840   , 4742
    SmallMolecules involved as reactants    , 128           , 1571      , 2604          , NA     , 4234
    Complexes involved as reactants         , 3733          , 0         , 7642          , 2217   , 9047
                                                                                                 ,
    Retrieved Controls                      , 6145          , 1782      , 6512          , 6186   , 647465
    Catalysis control                       , 3800          , 1782      , 4929          , 494    , 58972
    Reactions with similar entities as reactants and products , 50 , 934 , 94           , 333    , 789
    Controls of other controls              , 0             , 0         , 24            , 0      , 76521


Here is the the characteristics of the generated models:

.. csv-table::
    :header: Metrics - Databases, PID  , Kegg     , Reactome  , ACSN    , CTD

    Cadbiom entities        , 9788     , 2604     , 22841        , 10313   , 69428
    Genes                   , 788      , 2        , 932        , 1035    , 25375
                            , :download:`csv <./_static/demo_files/models_mars_2020/genes_pid.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_genes_pid.json>`, :download:`csv <./_static/demo_files/models_mars_2020/genes_kegg.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_genes_kegg.json>`, :download:`csv <./_static/demo_files/models_mars_2020/genes_reactome.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_genes_reactome.json>`, :download:`csv <./_static/demo_files/models_mars_2020/genes_acsn.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_genes_acsn.json>`, :download:`csv <./_static/demo_files/models_mars_2020/genes_ctd.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_genes_ctd.json>`
    Boundaries              , 3925     , 1420     , 6958        , 3693    , 34816
                            , :download:`csv <./_static/demo_files/models_mars_2020/boundaries_pid.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_boundaries_pid.json>`, :download:`csv <./_static/demo_files/models_mars_2020/boundaries_kegg.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_boundaries_kegg.json>`,  :download:`csv <./_static/demo_files/models_mars_2020/boundaries_reactome.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_boundaries_reactome.json>`, :download:`csv <./_static/demo_files/models_mars_2020/boundaries_acsn.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_boundaries_acsn.json>`, :download:`csv <./_static/demo_files/models_mars_2020/boundaries_ctd.csv>`/:download:`json <./_static/demo_files/models_mars_2020/model_summary_boundaries_ctd.json>`

    Transitions             , 11036    , 5220     , 53016        , 11394   , 43859
    Events                  , 7501     , 1570     , 57494        , 8819    , 45878
    Models                  , :download:`bcx file <./_static/demo_files/models_mars_2020/model_pid_without_scc.bcx>`, :download:`bcx file <./_static/demo_files/models_mars_2020/model_kegg_without_scc.bcx>`, :download:`bcx file <./_static/demo_files/models_mars_2020/model_reactome_without_scc.bcx>`, :download:`bcx file <./_static/demo_files/models_mars_2020/model_acsn_without_scc.bcx>` , :download:`bcx file <./_static/demo_files/models_mars_2020/model_ctd_without_scc.tar.xz>`


Note: Use *"Save as"* after a right click on the download links if your browser tries to open files in a new tab...

..
    Colonne Physical Entities:
    Nombre d'entités ou biomolécules extraites depuis la base de données au format BioPAX.
    Cf norme.
    D'après la hiérarchie des classes BioPAX décrite dans la norme, ces entités peuvent être des protéines, des complexes, des petites molécules, des séquences d'ARN ou d'ADN (les sous-classes conernées sont les suivantes: Protein, SmallMolecule, Rna, Complex, Dna).

    Colonnes CL-CL used et Nested CL

    Il arrive fréquemment que ces molécules soient réunies dans des objets BioPAX génériques de types divers pour faciliter
    la lecture/représentation visuelle de phénomènes biologiques (et donc leur compréhension).

    Par exemple des entités radicalement différentes sont souvent regroupées dans le but de représenter des phénomènes tels
    des changements de compartiments cellulaires.
    La volonté de l'intégrateur n'est donc pas systématiquement d'associer des molécules similaires entre elles.

    La plupart des classes formées réunissent des entités qui ont des rôles dans des réactions biochimiques (réactants);
    leur nombre est affiché sous le nom "CL used". La différence entre CL et CL used s'explique par la formation de
    classes dont les entités n'ont aucun rôle dans le réseau décrit.

    Certaines bases de données utilisent fréquemment la notion de classe; nous pouvons alors observer des classes imbriquées
    les unes dans les autres. Ce phénomène entrave grandement la démarche de "dégénéricisation" (TODO: terme Inria-centré d'après google...)
    qui consiste à retrouver les données originelles afin de produire des analyses fiables sur les données.
    Cette problématique est également présente dans les travaux impliquant des techniques de FBA.
    En effet, en abrégeant le parcours naturel des flux, la présence de classes mène à des réseaux non fonctionnels,
    incapables de produire de la biomasse.

    La colonne Nested CL comptabilise le nombre de classes ayant au moins un membre qui est lui-même une classe.


    Colonne PE inheritance of CL attributes

    Les entités sont parfois porteuses de modifications post-traductionnelles caractérisées par l'ajout d'atomes ou
    de groupements via des liaisons covalentes (Ex: adénylation, phosphorylation, etc.).
    Ces modifications sont spécifiées via un attribut des entités nommé ModificationsFeatures.
    Il s'agit de la formalisation du jeu d'activations et de désactivations que subissent les molécules biologiques.

    Dans biopax2cadbiom nous nous efforçons de faire en sorte que les entités membres d'une classe héritent des modifications post-traductionnelles des classes mères auxquelles elles appartiennent.
    Ceci a pour effet de créer ex-nihilo de nouvelles entités lorsque ces attributs ne sont pas déjà possédés par les entités membres.
    La colonne "PE inheritance of CL attributes" donne le nombre d'entités dans le modèle après ce traitement.

    Remarque: Les localisations cellulaires ou extracellulaires des classes sont aussi concernées par cette opération.

    Colonne PE duplicates-groups

    Les modifications post-traductionnelles participent à l'identification d'entités redondantes dans la base de données.
    En effet, bien qu'implémentant des modèles de données que sont les ontologies (descriptions formelles des données),
    ces bases de données NoSQL (triplestores dans notre cas) sont particulièrement exposées à l'apparition de redondances des
    informations lors de processus de mises à jour (automatiques ou manuels).
    La normalisation est manifestement un point faible de ces méthodes de stockage; la dépendance forte
    entre attributs et objets d'une part, et la redondance des données d'autre part,
    induisent des anomalies de mise à jour et facilitent l'introduction d'incohérences.

    TODO: entité = Document au sens NoSQL, au lieu de table au sens SQL.
    Un document peut en contenir d'autres et se suffit à lui-même alors que la seule manière de reconstruire l'information en SQL
    est de réaliser des jointures entre tables.

    En effet, on constate par exemple que des entités sont créées spécifiquement pour une réaction
    alors que des entités en tout point semblables sont déjà présentes dans la base de données
    (elles-mêmes utilisées pour des réactions uniques/indépendantes).
    Par ailleurs, toute mise à jour de ces entités devra être à tout prix simultanée et globale pour garantir la cohérence des données.

    Notons qu'il est dommage de constater ces dégradations de la qualité de l'information alors que des processus de normalisation
    et de modélisation relationnelle développés depuis les années 70 et ayant largement faits leurs preuves pourraient être utilisés.

    TODO: pour ces données (BioPAX/ontologies au sens large) pourtant hautement hiérarchisées.


    Dans biopax2cadbiom nous considérons les critères d'unicité suivants:

        entityType
        entityRef
        name
        components_uris
        location_uri
        modificationFeatures

    La colonne "PE duplicates-groups" comptabilise le nombre d'entités considérées comme dupliquées et le nombre de groupes ainsi formés.
    Seront conservées arbitrairement les premières entités de chaque groupe. Pour conserver la traçabilité de cette opération,
    des fichiers temporaires sont générés; listant les entités du modèle, leurs caractéristiques ci-dessus, ainsi que leurs groupes respectifs.
    Le modèle cadbiom produit conserve également les URI des entités similaires dans chacune de ses entités.

    Cf. examples.html#virtual-case-11


    Colonne R

    Interactions

    La classe Interactions regroupe tous les phénomènes biochimiques faisant intervenir des biomolécules.
    Dans ces interactions on trouve les classes Conversion, MolecularInteraction, Control, GeneticInteraction, TemplateReaction.
    La classe Conversion est elle-même héritée par BiochemicalReaction, ComplexAssembly, Transport et Degradation.

    Les objets dérivant de la classe Conversion sont considérés comme des réactions biochimiques stricto-sensu.
    Elles font intervenir des réactifs et des produits et peuvent être régulées par des biomolécules.

    La classe TemplateReaction est utilisée (du moins dans la base de données PID)
    pour modéliser l'expression d'un gène (entité non présente dans cette base de données)
    depuis une séquence d'ADN (transcription) ou depuis une séquence d'ARN (traduction).
    Nous créons donc une entité fictive dérivée du produit de cette réaction que nous considèrerons comme un gène.
    Ces entités fictives ne sont donc produites par aucune réaction et sont ainsi positionnées "en périphérie du modèle" généré.

    TODO: => compter les occurrences dans les autres BDD
    TODO: vérifier la structure de cette interaction au travers des diverses BDD (jamais de réactif comme sur PID ?)


    Les classes MolecularInteraction, GeneticInteraction ne sont actuellement pas supportées;
    le premier type d'interaction est trop imprécis pour être pris en compte:
    il s'agit de simples listes de participants (pas de notion de produit ou de réactif).
    Aussi, le sens d'évolution chimique n'est pas spécifié.
    Tandis que le deuxième type n'a pas été rencontré dans les données considérées.

    TODO: => Cf doc GeneticInteraction (Avant page 10). et type Gene non encore exploité...


    À propos du sens des Interactions (tableau direction des interactions):
    Le sens des interactions est rarement mentionné; par défaut nous estimons qu'il est de la gauche vers la droite
    (c.-à-d. que les réactifs donnent les produits); nous ne supportons d'ailleurs pas d'autres sens actuellement.

    En théorie le sens d'une réaction devrait être inféré à partir des données de thermodynamique
    accompagnant les réactions ou via FBA. En pratique l'enthalpie libre ou énergie de Gibbs est rarement renseignée
    et nous sommes forcés d'adopter un comportement par défaut.


    Colonnes "Proteins in R" et "SMLM in Reactions"
    Ces colonnes comptabilisent les occurrences de protéines et de petites molécules en tant que réactants dans des réactions.
    Elles donnent des informations précieuses sur le contenu de la base de données (orientée signalisation ou métabolisme cellulaire).


    La classe Control est elle-même héritée par Catalysis, TemplateReactionRegulation, et Modulation.
    La classe Catalysis contient un élément contrôleur (une biomolécule) et un élément contrôlé (une Conversion ou TemplateReaction).
    La classe Modulation régule exclusivement un autre objet de type Control et est destinée à modéliser des combinaisons et des cascades
    de contrôles pour une réaction.
    Ce dernier type d'objet est uniquement rencontré dans la base de données Reactome et n'est pas encore pris en considération
    compte tenu de sa faible occurrence.

    Colonne CL in CTRL
    Les objets dérivés du type Control définissent en somme les conditions nécessaires à la réalisation d'une réaction.
    Les informations qu'ils apportent servent exclusivement à construire les expressions logiques présentes dans les gardes du formalisme des transitions gardées.
    Plusieurs objets de type Control peuvent réguler une même réaction et les attributs contrôleurs de ces objets
    peuvent être des classes.
    Dans ce dernier cas, la condition sera complétée par l'ensemble des membres de la classe liés par des opérateurs OU.
    Si l'objet exerce in contrôle de type inhibition, un ET NON sera ajouté de façon à les inclure.

    D'une manière générale, nous partons du postulat que la présence d'un seul activateur et l'absence du moindre inhibiteur
    sont des conditions nécessaires pour qu'une réaction se déroule.

    Cf. examples.html#virtual-case-5

    Colonne CTRL w/ many controllers

    Note: La norme spécifie qu'une classe Catalysis ou Modulation ne doit comporter qu'un seul contrôleur.
    Seuls les objets Control et TemplateReactionRegulation "semblent" pouvoir en posséder plusieurs.
    Quoiqu'il en soit les objets de Catalysis dans la base de données Kegg en comportent plusieurs.

    ... et cela n'est actuellement pas supporté.


    Colonne Entities on both sides of R

    Certaines réactions possèdent des entités conservées de part et d'autre de l'équation;
    nous prenons le parti de considérer ces entités comme des catalyseurs essentiels à son déroulement.
    Nous retirons ces entités de l'équation et les ajoutons en tant que contrôleurs de la transformation biochimique.


    Tableau Cadbiom models:

    Colonne "Cadbiom entities" et "Cadbiom transitions"

    Nombre d'entités et de transitions générées suite à la conversion des données BioPAX.
    Le nombre d'entité est augmenté par le processus de transfert des attributs de classes sur les entités membres de ces classes
    mais est diminué grâce au processus de regroupement des entités similaires.


Query the Cadbiom model
=======================

What is in the model?
---------------------

..
   Partir de BDD préselectionnées => peut générer 1 liste de gènes ainsi que leurs attributs dans un csv.
   Il faut interroger le fichier csv pour construire le fichier de formules logiques.
   => Remettre l'utilisateur au centre


    Comme il a été vu plus haut, un utilisateur peut au choix construire son propre modèle à partir d'un
    triplestore quelconque hébergeant des données BioPAX, ou utiliser un des modèles pré-conçus mis à
    disposition plus haut sur cette page.

    Il peut ensuite parcourir ces modèles pour en extraire des données à propos de toutes les biomolécules,
    des gènes ou des frontières (elements at the periphery of the model). Parmis les données essentielles,
    l'utilisateur trouvera des listes de correspondances entre diverses bases de données.

    En effet, Cadbiom utilise des identifiants internes pour garantir l'unicité des biomolécules dans ses modèles.
    Un mapping des identifiants Cadbiom avec les identifiants standards des bases de données (telles que
    Uniprot, HUGO, HGNC, etc.) est alors nécessaire pour permettre à l'utilisateur de forger ses propres requêtes d'intérêt
    (pour plus d'informations sur la recherche de molécules, voir le chapitre suivant:
    `Search for molecules`_).
    Cette étape de mapping souvent fastidieuse est facilitée par le module décrit ci-dessous:
    `Get a mapping between Cadbiom identifiers and those from external databases`_.


As mentioned above, a user can either build his own model from any triplestore hosting BioPAX data,
or use one of the pre-designed models provided earlier on this page.

He can then browse these models to extract data about all biomolecules,
genes or boundaries (elements at the periphery of the model).
Among the essential data, the user will find lists of mappings between various databases.

Indeed, Cadbiom uses internal identifiers to guarantee the unicity of biomolecules in its models.
A mapping of the Cadbiom identifiers with the standard database identifiers (such as Uniprot, HUGO, etc.)
is then necessary to allow the user to forge his own requests of interest
(for more information on the search for molecules, see the next chapter:
`Search for molecules`_).
This often tedious mapping step is facilitated by the module described below:
`Get a mapping between Cadbiom identifiers and those from external databases`_.


.. _get_model_info:

Get model information
~~~~~~~~~~~~~~~~~~~~~

To get information about the biological entities in the model, the subcommand ``model_info`` can be used
(see `the documentation of the model_info command <./command_line_usage.html#model_info>`_) :

.. code-block:: bash

   $ cadbiom_cmd model_info model_without_scc.bcx --all_entities --json --csv


**Arguments:**

* ``--all_entities`` or ``--boundaries`` or ``--genes``: Retrieve data for specific places/entities of the model.

* ``--json``: Create a JSON formated file containing data about previously filtered places/entities
  of the model, and a full summary about the model itself (boundaries, transitions, events, entities locations, entities types).

* ``--csv``: Create a CSV file containing data about previously filtered places/entities of the model.

Example of JSON file for PID (:download:`model_summary_genes_pid.json <_static/demo_files/models_mars_2020/model_summary_genes_pid.json>`):

    .. code-block:: javascript

        {
            'modelFile': 'string',
            'modelName': 'string',
            'events': int,
            'entities': int,
            'boundaries': int,
            'transitions': int,
            'entitiesLocations': {
                'cellular_compartment_a': int,
                'cellular_compartment_b': int,
                ...
            },
            'entitiesTypes': {
                'biological_type_a': int,
                'biological_type_b': int;
                ...
            },
            'entitiesData': {
                [{
                    'cadbiomName': 'string',
                    'immediateSuccessors': ['string', ...],
                    'uri': 'string',
                    'entityType': 'string',
                    'entityRef': 'string',
                    'location': 'string',
                    'names': ['string', ...],
                    "xrefs": {
                        'external_database_a': ['string', ...],
                        'external_database_b': ['string', ...],
                        ...
                    }
                }],
                ...
            }
        }

Such a file could facilitate a work of visualization or a possible mapping of identifiers
because it centralizes in a standardized manner most of the information about BioPAX entities.

It can also be used to easily identify the entities of the model that we would like to remove
in a future translation (Example: energy metabolism molecules such as ATP, ADP, GTP, GDP).
Indeed, these molecules are ubiquitous and unnecessarily complexify the conditions of realization
of the reactions in the model and thus its analysis by the solver.


Simplified example of CSV file (:download:`genes_pid.csv <_static/demo_files/models_mars_2020/genes_pid.csv>`):

.. csv-table::
    :header: cadbiomName        , immediateSuccessors , names , uri               , entityType    , location              , uniprot knowledgebase , chebi

    VCAM1_integral_to_membrane  ,                     , VCAM1 , Protein_xxx       , Protein       , integral to membrane  , P19320                ,
    ATP                         , ADP                 , ATP   , SmallMolecule_xxx , SmallMolecule ,                       ,                       , CHEBI:15422|CHEBI:22249


.. Le champ le plus important est probablement immediateSuccessors, car il liste les successeurs immédiats
   rencontrés pour chaque entité du modèle. Cette colonne contient les entités qui doivent être requêtées
   par un utilisateur qui voudrait explorer leur régulation avec le framework. Ce sont ces entités qui doivent
   constituer les requêtes (formules logiques booléennes) utilisées dans la section :ref:`search_molecules`.

**The most important field is probably `immediateSuccessors`**, as it lists the immediate successors encountered
for each entity in the model. This column contains the Cadbiom identifiers of entities that should be requested
by a user who would like to explore their regulation with the framework. It is these identifiers that should constitute
the queries (boolean logical formula) used in the :ref:`search_molecules` section.


Get information about the graph based on the model
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To build a graph based on the model and get information about it, the subcommand ``model_graph``
(see `the documentation of the model_graph command <./command_line_usage.html#model_graph>`_) can be used:

.. code-block:: bash

   $ cadbiom_cmd model_graph model_without_scc.bcx --graph --centralities --json

**Arguments:**

* ``--centralities``: Get centralities for each node of the graph (degree, in_degree, out_degree, closeness, betweenness).

* ``--graph``: Translate the model into a GraphML formated file which can be opened in Cytoscape.

* ``--json``: Create a JSON formated file containing a summary of the graph based on the model.


Example of JSON file (:download:`graph_summary_pid.json <_static/demo_files/models_mars_2020/graph_summary_pid.json>`):

    .. code-block:: javascript

        {
            'modelFile': 'string',
            'modelName': 'string',
            'events': int,
            'entities': int,
            'transitions': int,
            'graph_nodes': int,
            'graph_edges': int,
            'centralities': {
                'degree': {
                    'entity_1': float,
                    'entity_2': float
                },
                'in_degree': {
                    'entity_1': float,
                    'entity_2': float
                },
                'out_degree': {
                    'entity_1': float,
                    'entity_2': float
                },
                'betweenness': {
                    'entity_1': float,
                    'entity_2': float
                },
                'closeness': {
                    'entity_1': float,
                    'entity_2': float
                },
            }
        }

The corresponding GraphML file can be downloaded here :download:`pid.graphml <_static/demo_files/models_mars_2020/pid.graphml>`.

Examples of use and style dedicated to opening the GraphML file in `Cytoscape <https://cytoscape.org/>`_ are available
on the repository: `examples <https://gitlab.inria.fr/DYLISS/cadbiom/tree/master/examples>`_

Also a dedicated a module dedicated to viewing the models in `Gephi <https://gephi.org/>`_ is also available as explained later.


Get a mapping between Cadbiom identifiers and those from external databases
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This function exports a CSV formated file presenting the list of known Cadbiom identifiers for each given
external identifier.

.. code-block:: bash

    $ cadbiom_cmd identifiers_mapping --external_identifiers P02452 COL1A1 P12830 CDH1 model_pid_without_scc.bcx


**Arguments:**

* ``--external_identifiers``: Multiple external identifiers to be mapped.


Example of CSV file (:download:`mapping.csv <_static/demo_files/mapping_examples.csv>`):

.. csv-table::
    :header: External identifiers, Cadbiom identifiers

    P12830,E_cadherin_early_endosome|E_cadherin_cytoplasm|E_cadherin_gene|...
    CDH1,E_cadherin_early_endosome|E_cadherin_cytoplasm|E_cadherin_gene|...
    P02452,COL1A1_gene|COL1A1
    COL1A1,COL1A1_gene|COL1A1


.. _search_molecules:

Search for molecules
====================

With a given model, Cadbiom allows to explain how to obtain an entity or a set of entities from boundaries
(elements at the periphery of the model). These sets are called **Minimal Activation Conditions (MAC)**.

Ultimately, the software answers to the question: *"Is it possible to find an initialization
such that the given state/property happens?"*

Searching a complex query with a combination of entities requires to express it
in a boolean formula with the names of the entities as variables.
The logical operators available are ``or``, ``and``, ``not``.

See the section :ref:`get_model_info` in order to find out which entities to query in the model.

The subcommand `solutions_search <./command_line_usage.html#solutions_search>`_ is designed to compute MACs.


Example:

We are looking for entities involved in the production of extracellular matrix molecules.
Some combinations of these entities are gathered in the following file: :download:`logical_formulas.txt <_static/demo_files/logical_formulas.txt>`.

This file is then loaded in the solver with the following command:

.. code-block:: bash

    $ cadbiom_cmd solutions_search model_without_scc.bcx --input_file logical_formulas.txt --continue


**Arguments:**

Most of the time, the number of steps to reach a solution is significant and therefore,
the necessary computation time ensues. Fortunately, it can be limited with ``--steps``.
Moreover, a stopped calculation can be resumed later thanks to ``--continue``.

* ``--input_file``: Multiple jobs can be launched in parallel if the user provides a file with one
  boolean formula per line. In this case, each processor core will be dedicated to
  the calculation of one boolean formula (within the limit of the number of available cores).

.. * ``--all_macs``: Solver will try to search all macs with 0 to the maximum of allowed steps.

* ``--continue``: Resume previous computations; if there is a mac file from a previous work,
  last frontier places/boundaries will be reloaded.

---

The program produces two categories of files; a quick description is provided here
but for further information please see chapter :ref:`cadbiom_file_format_spec`):


* ``*mac.txt`` files: Each line contains a MAC solution. Here is an example taken from the file
  corresponding to the query ``COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP``
  (:download:`Mac file <_static/demo_files/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac.txt>`).

  .. code-block:: text

    COL1A1_gene COL3A1_gene COL5A1_gene dNp63a__tetramer__nucleus_v2 ET1_extracellular_region ETA FOXM1B_nucleus GTF3A MMP2_gene
    COL1A1_gene COL3A1_gene COL5A1_gene dNp63a__tetramer__nucleus_v2 ET1_extracellular_region ETA Fra1_1O_2O_nucleus GTF3A JUN_2O_2O_nucleus MMP2_gene


* ``*mac_complete.txt`` files: Each MAC solution is followed by the successions of events
  fired at each step to obtain them. Here is an example taken from the file
  corresponding to the query ``COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP``
  (:download:`Mac complete file <_static/demo_files/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete.txt>`).

  .. code-block:: text

    COL1A1_gene COL3A1_gene COL5A1_gene dNp63a__tetramer__nucleus_v2 ET1_extracellular_region ETA FOXM1B_nucleus GTF3A MMP2_gene
    % _h_5848 _h_1070 _h_4563 _h_4049
    % _h_5603 _h_1051
    COL1A1_gene COL3A1_gene COL5A1_gene dNp63a__tetramer__nucleus_v2 ET1_extracellular_region ETA Fra1_1O_2O_nucleus GTF3A JUN_2O_2O_nucleus MMP2_gene
    % _h_5848 _h_1070 _h_4283
    % _h_4174 _h_4049 _h_1051


As we can see, these files are not particularly meaningful, which is why the next chapter
`Processing of the generated files`_
will discuss how to use the high-level features available to handle them.


Processing of the generated files
=================================

Visualize the trajectories of each solution
-------------------------------------------

In the same way that we can generate a graph of the model,
we can generate a graph explaining the path taken for each solution found by the solver.
We are therefore reconstructing the path between boundaries of the model (the components of the solutions)
and entities of interest in order to explain their production.

This command requires the model and solution files of type ``*.mac_complete.txt``.
We will take the example of the file seen in the previous chapter.


Example:

   .. code-block:: bash

      $ cadbiom_cmd solutions_2_graph model_without_scc.bcx \
        "./result/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete.txt"

---

The program produces GraphML files in the folder ``./graphs/``. These files can be opened in Cytoscape.

.. figure:: _static/demo_files/04-26-26_0_COL1A1_gene_COL3A1_gene_COL5A1_gene_dNp63a__tetramer__nucleus_v2_ET1_extrac.svg
    :scale: 85 %
    :alt:
    :align: center

    Cytoscape screenshot of the graph of the first solution for the query ``COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP``.


.. figure:: _static/demo_files/04-26-26_1_COL1A1_gene_COL3A1_gene_COL5A1_gene_dNp63a__tetramer__nucleus_v2_ET1_extrac.svg
    :scale: 85 %
    :alt:
    :align: center

    Cytoscape screenshot of the graph of the second solution for the query ``COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP``.


We notice that the paths taken are relatively similar and independent for each of the 4 entities searched
(4 cliques are displayed). However, the solver has found an alternative path to the first solution in order to produce MMP2.

The more solutions listed, the more complex they become and require different boundaries.
We also note that the reactions most described in the literature such as the translation of the MMP2 gene
are the most likely to have a significant number of referenced modulators.
It is around these reactions that the complexity of the model is revealed.

.. figure:: _static/demo_files/04-26-36_343_ATF2_1i_COL1A1_gene_COL3A1_gene_COL5A1_gene_dNp63a__tetramer__nucleus_v2_Er.svg
    :scale: 85 %
    :alt:
    :align: center

    Cytoscape screenshot of the graph of the 343rd solution for the query ``COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP``.


Global visualization of all paths found for a query
---------------------------------------------------

solutions_2_common_graph
    Create a GraphML formated file containing a unique representation of **all**
    trajectories corresponding to **all** solutions in each complete MAC file (\*mac_complete files).
    This is a function to visualize paths taken by the solver from the boundaries to the entities of interest.

Example:

   .. code-block:: bash

      $ cadbiom_cmd queries_2_common_graph model_without_scc.bcx \
        "./result/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete.txt"


Generated graph (:download:`graph<_static/demo_files/05-59-30__COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP.graphml>`):


.. figure:: _static/demo_files/common_graph.svg
    :scale: 85 %
    :alt:
    :align: center

    Cytoscape screenshot of common weighted and directed graph of solutions for the query ``COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP``.


Programmatic processing and decompilation of solution files
-----------------------------------------------------------

queries_2_json
    Create a JSON formated file containing **all** data from complete MAC files
    (\*mac_complete files). The file will contain frontier places/boundaries and decompiled
    steps with their respective events for each solution.
    This is a function to quickly search all transition attributes involved in a solution.


Example:

   .. code-block:: bash

      $ cadbiom_cmd solutions_2_json model_without_scc.bcx \
        "./result/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete.txt"


Example of JSON file (:download:`decompiled_mac_complete<_static/demo_files/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete_decomp.json>`):

    .. code-block:: text

        [{
            "solution": "boundary_1 boundary_2",
            "steps": [
                [{
                    "event": "_h_1",
                    "transitions": [{
                        "ext": "place_x",
                        "ori": "boundary_1"
                    }]
                }],
            ]
        },
        ...
        ]


Search of interactions between molecules in trajectories
--------------------------------------------------------

Search of interactions between molecules present in the trajectories and the boundaries
with distinction of genes and various stimuli (non genes).

json_2_interaction_graph
    Make an interaction weighted graph based on the
    searched molecule of interest. Read decompiled
    solutions files (*.json* files produced by the
    directive 'solutions_2_json') and make a graph of the
    relationships between one or more molecules of
    interest, the genes and other frontier
    places/boundaries found among all the solutions.

Example:

   .. code-block:: bash

      $ cadbiom_cmd json_2_interaction_graph model_without_scc.bcx \
        "./decompiled_solutions/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete_decomp.json" \
        PKC_1a
      INFO: Opening files...
      INFO: Files processed: 1
      INFO: Building graph...
      INFO: Graph generated in 0.0767540931702


Generated graph (:download:`interaction graph between PKC_1a and boundaries<_static/demo_files/interaction_graph.graphml>`):

.. figure:: _static/demo_files/interact_graph.svg
    :scale: 85 %
    :alt:
    :align: center

    Cytoscape screenshot of the interaction graph of the entity ``PKC_1a`` for the query ``COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP``.


Occurrences matrix
------------------

solutions_2_occcurrences_matrix
    Create a matrix of occurrences counting entities in
    the solutions found in \*mac.txt files in the given
    path.


Example:

   .. code-block:: bash

      $ cadbiom_cmd solutions_2_occcurrences_matrix output/pid_last_nov_model_without_scc.bcx \
        ./result/ \
      INFO: Files processed: 44


:download:`occurrences matrix<_static/demo_files/occurrence_matrix.csv>`
:download:`transposed occurrences matrix<_static/demo_files/occurrence_matrix_t.csv>`


Clustermaps
-----------

We can visualize co-occurrences of the boundaries within the solutions obtained by
creating a ClusterMap (hierarchically-clustered heatmap) for boundaries
found in the solutions.


Example:

   .. code-block:: bash

      $ cadbiom_cmd queries_2_clustermap \
        "./result/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete.txt"


      pid_last_nov_model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac


Generated clustermap (:download:`clustermap.svg <_static/demo_files/clustermap.svg>`)

.. figure:: _static/demo_files/clustermap.png
    :scale: 90 %
    :alt:
    :align: center

    Hierarchically-clustered heatmap of the occurrences of the boundaries found in the solutions for the query
    ``COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP``.


Avanced users - Creation of models
==================================

.. _create_from_biopax_endpoint:

Creation of a Cadbiom model from a BioPAX endpoint
--------------------------------------------------

*biopax2cadbiom* is a standalone module also integrated in the Cadbiom GUI that
converts BioPAX ontologies to Cabiom models.
You will find a full help about the installation and usage at `biopax2cadbiom <http://cadbiom.genouest.org/doc/biopax2cadbiom/index.html>`_.
Do not miss the chapter `How to make queries on an endpoint like Pathway Commons? <http://cadbiom.genouest.org/doc/biopax2cadbiom/tutorial.html#how-to-make-queries-on-an-endpoint-like-pathway-commons>`_

Let's take the example of a `PID database (Pathway Interaction Database) <https://github.com/NCIP/pathway-interaction-database>`_ conversion with the following command:

.. code-block:: bash

    $ biopax2cadbiom model \
    --graph_uris http://pathwaycommons.org \
    --provenance_uri http://pathwaycommons.org/pc2/pid \
    --triplestore http://rdf.pathwaycommons.org/sparql/


**Quick explanations of arguments:**

* The parameter ``--graph_uris`` provides the URI of the graphs queried (and optionally
  of the BioPAX ontology if it is hosted separately).

* It is thus necessary to filter the RDF triples according to their origin with the
  the optional parameter ``--provenance_uri``.
  By setting it, the program will filter entities, reactions, pathways thanks to their ``dataSource`` BioPAX attribute.

* The URL of the endpoint is specified with ``--triplestore``.


**Result:**

The generated model will be placed in the ``./output`` folder;
we will focus on the model with specific alterations
intended to optimize the operation of Cadbiom and its solver: ``./output/model_without_scc.bcx``.

(:download:`pid_model_without_scc.bcx <./_static/demo_files/models_mars_2020/model_pid_without_scc.bcx>`)

.. note:: Small graphs can be queried and converted from the GUI of Cadbiom.

   .. figure:: _static/demo_workflow/gui_biopax_import.jpg
      :scale: 65 %
      :alt: Cadbiom GUI import BioPAX
      :align: center

      Screenshot of the import tool for BioPAX graphs in the GUI of Cadbiom.


.. |RDF| replace:: Resource Description Framework
.. |URI| replace:: Uniform Resource Identifier
.. |OWL| replace:: Web Ontology Language
.. |BioPAX| replace:: Biological Pathway Exchange
.. |MAC| replace:: Minimal Activation Conditions