Workflow overview¶

This guide can help you start working with the Cadbiom command line.

Workflow for the Cadbiom framework. Four steps can be identified: Model building, covered by the standalone tool named biopax2cadbiom; Assisted query reconstruction, Causality search, Analysis and visualisation covered by the Cadbiom package named cadbiom-cmd.

The analysis of a BioPAX model is performed according to four main steps.

The first step of the Cadbiom framework relies on the biopax2cadbiom package. It allows querying BioPAX ontologies stored in public or local endpoints in order to interpret them into Cadbiom models.

The results of the conversions of some databases are available below: Characteristics of biopax ressources.

The second step brings together several commands under the name of query design. This step is dedicated to query construction. A query is a boolean logical formula that describes the combination of molecules for which regulators will be searched.

The characteristics of a Cadbiom model and its content (list of identifiers of biological compounds, list of boundary compounds, and list of genes) can be exported. Among the essential data, the user will find mappings between various databases thanks to the preservation of all cross-references with public databases (such as HGCN, Uniprot, Chebi) provided in the BioPAX. Indeed, Cadbiom uses internal identifiers to guarantee the uniqueness of the biomolecules in its models. A mapping of Cadbiom identifiers with the standard database identifiers is necessary to allow a user to build queries.

Examples of commands are described below: Query the Cadbiom model.

The third step of the workflow is focused on causality search. In this step, the user explores the dynamics of a Cadbiom model. He identifies sets of biomolecules that represent the initial conditions of the system, leading to the activation of the desired entities via controlled biochemical transformations described in the model. As detailed in the results section, the notion of activation is intrinsically linked to the non-deterministic semantics associated with guarded transitions. This analysis therefore generates several families of controllers associated with trajectories composed by the entities activated during a search.

Examples of commands are described below: Search for molecules.

The fourth step, visualization and analysis is designed to analyze the families of controllers generated in the previous step. A matrix of occurrences details the presence of controllers in the solutions obtained. Heatmaps allow estimating the diversity of the trajectories leading to the studied phenotype (i.e. the boolean query). Trajectory graphs allow visualizing all the transformations undergone by the intermediate molecules from the control molecules to the target molecules.

Examples of commands are described below: Processing of the generated files.

Characteristics of biopax ressources¶

The majority of models are pre-generated from the PathwayCommons archive and are made available on this page with their respective characteristics.

ACSN maps are preprocessed before there interrogation as RDF data. The maps are downloadable on acsn.curie.fr. The steps are explained on the GitLab.

Advanced users will be able to create their own models from a SPARQL endpoint as explained in the chapter Creation of a Cadbiom model from a BioPAX endpoint.

Obtaining these data from the models is also explained in the chapter Query the Cadbiom model.

Here is a table of the number of the BioPAX types found in various databases:

Metrics - Databases	PID	Kegg	ACSN
Entity			27426*
Pathway	745	122	13
Gene
PhysicalEntity			0/11922*
Protein	6194	1872	6851
Complex	4137		2323/2358*
SmallMolecule	173	1664	554/0*
Dna			1030
Rna	22		1164/0*
Interaction
BiochemicalReaction	1824	1782	6863
ComplexAssembly	2722		1743
TemplateReaction	1492
Transport	312		699
MolecularInteraction		4
Degradation
TransportWithBiochemicalReaction	154
Catalysis	3800	1782	6186
Control	322
TemplateReactionRegulation	2023
Modulation

*: Number before cleaning of incorrect/duplicate types in triplestore. i.e. the entire database, for the Entity type.

Here is a table presenting some metrics interpreted according to the raw BioPAX data obtained:

Metrics - Databases	PID	Kegg	ACSN
Retrieved PhysicalEntities	10526	3536	1716
Total of duplicated entities	699	135	74
Nb of groups of duplicated entities	339	61	23
Classes	403	0	0
Used classes	304	0	0
Nested classes	23	0	0
Classes with ModificationFeatures	157	0	0
Classes/complexes	66	0	0
Final number of entities (after processing)

Retrieved Reactions	6504	1786	9305
Proteins involved as reactants	3304	0	4840
SmallMolecules involved as reactants	128	1571	NA
Complexes involved as reactants	3733	0	2217

Retrieved Controls	6145	1782	6186
Catalysis control	3800	1782	494
Reactions with similar entities as reactants and products	50	934	333
Controls of other controls	0	0	0

Prebuilt models¶

Here is the the characteristics of the generated models:

Metrics - Databases	PID	Kegg	ACSN
Cadbiom entities	9788	2604	10313
Genes	788	2	1035
	`csv`/`json`	`csv`/`json`	`csv`/`json`
Boundaries	3925	1420	3693
	`csv`/`json`	`csv`/`json`	`csv`/`json`

Transitions	11036	5220	11394
Events	7501	1570	8819
Models	`bcx file`	`bcx file`	`bcx file`

Note: Use “Save as” after a right click on the download links if your browser tries to open files in a new tab…

Query the Cadbiom model¶

What is in the model?¶

As mentioned above, a user can either build his own model from any triplestore hosting BioPAX data, or use one of the pre-designed models provided earlier on this page.

He can then browse these models to extract data about all biomolecules, genes or boundaries (elements at the periphery of the model). Among the essential data, the user will find lists of mappings between various databases.

Indeed, Cadbiom uses internal identifiers to guarantee the unicity of biomolecules in its models. A mapping of the Cadbiom identifiers with the standard database identifiers (such as Uniprot, HUGO, etc.) is then necessary to allow the user to forge his own requests of interest (for more information on the search for molecules, see the next chapter: Search for molecules). This often tedious mapping step is facilitated by the module described below: Get a mapping between Cadbiom identifiers and those from external databases.

Get model information¶

To get information about the biological entities in the model, the subcommand model_info can be used (see the documentation of the model_info command) :

$ cadbiom_cmd model_info model_without_scc.bcx --all_entities --json --csv

Arguments:

--all_entities or --boundaries or --genes: Retrieve data for specific places/entities of the model.
--json: Create a JSON formated file containing data about previously filtered places/entities of the model, and a full summary about the model itself (boundaries, transitions, events, entities locations, entities types).
--csv: Create a CSV file containing data about previously filtered places/entities of the model.

Example of JSON file for PID (model_summary_genes_pid.json):

{
    'modelFile': 'string',
    'modelName': 'string',
    'events': int,
    'entities': int,
    'boundaries': int,
    'transitions': int,
    'entitiesLocations': {
        'cellular_compartment_a': int,
        'cellular_compartment_b': int,
        ...
    },
    'entitiesTypes': {
        'biological_type_a': int,
        'biological_type_b': int;
        ...
    },
    'entitiesData': {
        [{
            'cadbiomName': 'string',
            'immediateSuccessors': ['string', ...],
            'uri': 'string',
            'entityType': 'string',
            'entityRef': 'string',
            'location': 'string',
            'names': ['string', ...],
            "xrefs": {
                'external_database_a': ['string', ...],
                'external_database_b': ['string', ...],
                ...
            }
        }],
        ...
    }
}

Such a file could facilitate a work of visualization or a possible mapping of identifiers because it centralizes in a standardized manner most of the information about BioPAX entities.

It can also be used to easily identify the entities of the model that we would like to remove in a future translation (Example: energy metabolism molecules such as ATP, ADP, GTP, GDP). Indeed, these molecules are ubiquitous and unnecessarily complexify the conditions of realization of the reactions in the model and thus its analysis by the solver.

Simplified example of CSV file (genes_pid.csv):

cadbiomName	immediateSuccessors	names	uri	entityType	location	uniprot knowledgebase	chebi
VCAM1_integral_to_membrane		VCAM1	Protein_xxx	Protein	integral to membrane	P19320
ATP	ADP	ATP	SmallMolecule_xxx	SmallMolecule			CHEBI:15422\|CHEBI:22249

The most important field is probably `immediateSuccessors`, as it lists the immediate successors encountered for each entity in the model. This column contains the Cadbiom identifiers of entities that should be requested by a user who would like to explore their regulation with the framework. It is these identifiers that should constitute the queries (boolean logical formula) used in the Search for molecules section.

Get information about the graph based on the model¶

To build a graph based on the model and get information about it, the subcommand model_graph (see the documentation of the model_graph command) can be used:

$ cadbiom_cmd model_graph model_without_scc.bcx --graph --centralities --json

Arguments:

--centralities: Get centralities for each node of the graph (degree, in_degree, out_degree, closeness, betweenness).
--graph: Translate the model into a GraphML formated file which can be opened in Cytoscape.
--json: Create a JSON formated file containing a summary of the graph based on the model.

Example of JSON file (graph_summary_pid.json):

{
    'modelFile': 'string',
    'modelName': 'string',
    'events': int,
    'entities': int,
    'transitions': int,
    'graph_nodes': int,
    'graph_edges': int,
    'centralities': {
        'degree': {
            'entity_1': float,
            'entity_2': float
        },
        'in_degree': {
            'entity_1': float,
            'entity_2': float
        },
        'out_degree': {
            'entity_1': float,
            'entity_2': float
        },
        'betweenness': {
            'entity_1': float,
            'entity_2': float
        },
        'closeness': {
            'entity_1': float,
            'entity_2': float
        },
    }
}

The corresponding GraphML file can be downloaded here pid.graphml.

Examples of use and style dedicated to opening the GraphML file in Cytoscape are available on the repository: examples

Also a dedicated a module dedicated to viewing the models in Gephi is also available as explained later.

Get a mapping between Cadbiom identifiers and those from external databases¶

This function exports a CSV formated file presenting the list of known Cadbiom identifiers for each given external identifier.

$ cadbiom_cmd identifiers_mapping --external_identifiers P02452 COL1A1 P12830 CDH1 model_pid_without_scc.bcx

Arguments:

--external_identifiers: Multiple external identifiers to be mapped.

Example of CSV file (mapping.csv):

External identifiers	Cadbiom identifiers
P12830	E_cadherin_early_endosome\|E_cadherin_cytoplasm\|E_cadherin_gene\|…
CDH1	E_cadherin_early_endosome\|E_cadherin_cytoplasm\|E_cadherin_gene\|…
P02452	COL1A1_gene\|COL1A1
COL1A1	COL1A1_gene\|COL1A1

Search for molecules¶

With a given model, Cadbiom allows to explain how to obtain an entity or a set of entities from boundaries (elements at the periphery of the model). These sets are called Minimal Activation Conditions (MAC).

Ultimately, the software answers to the question: “Is it possible to find an initialization such that the given state/property happens?”

Searching a complex query with a combination of entities requires to express it in a boolean formula with the names of the entities as variables. The logical operators available are or, and, not.

See the section Get model information in order to find out which entities to query in the model.

The subcommand solutions_search is designed to compute MACs.

Example:

We are looking for entities involved in the production of extracellular matrix molecules. Some combinations of these entities are gathered in the following file: logical_formulas.txt.

This file is then loaded in the solver with the following command:

$ cadbiom_cmd solutions_search model_without_scc.bcx --input_file logical_formulas.txt --continue

Arguments:

Most of the time, the number of steps to reach a solution is significant and therefore, the necessary computation time ensues. Fortunately, it can be limited with --steps. Moreover, a stopped calculation can be resumed later thanks to --continue.

--input_file: Multiple jobs can be launched in parallel if the user provides a file with one boolean formula per line. In this case, each processor core will be dedicated to the calculation of one boolean formula (within the limit of the number of available cores).

--continue: Resume previous computations; if there is a mac file from a previous work, last frontier places/boundaries will be reloaded.

—

The program produces two categories of files; a quick description is provided here but for further information please see chapter Cadbiom File Format Specification):

*mac.txt files: Each line contains a MAC solution. Here is an example taken from the file corresponding to the query COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP (Mac file).

COL1A1_gene COL3A1_gene COL5A1_gene dNp63a__tetramer__nucleus_v2 ET1_extracellular_region ETA FOXM1B_nucleus GTF3A MMP2_gene
COL1A1_gene COL3A1_gene COL5A1_gene dNp63a__tetramer__nucleus_v2 ET1_extracellular_region ETA Fra1_1O_2O_nucleus GTF3A JUN_2O_2O_nucleus MMP2_gene

*mac_complete.txt files: Each MAC solution is followed by the successions of events fired at each step to obtain them. Here is an example taken from the file corresponding to the query COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP (Mac complete file).

COL1A1_gene COL3A1_gene COL5A1_gene dNp63a__tetramer__nucleus_v2 ET1_extracellular_region ETA FOXM1B_nucleus GTF3A MMP2_gene
% _h_5848 _h_1070 _h_4563 _h_4049
% _h_5603 _h_1051
COL1A1_gene COL3A1_gene COL5A1_gene dNp63a__tetramer__nucleus_v2 ET1_extracellular_region ETA Fra1_1O_2O_nucleus GTF3A JUN_2O_2O_nucleus MMP2_gene
% _h_5848 _h_1070 _h_4283
% _h_4174 _h_4049 _h_1051

As we can see, these files are not particularly meaningful, which is why the next chapter Processing of the generated files will discuss how to use the high-level features available to handle them.

Processing of the generated files¶

Visualize the trajectories of each solution¶

In the same way that we can generate a graph of the model, we can generate a graph explaining the path taken for each solution found by the solver. We are therefore reconstructing the path between boundaries of the model (the components of the solutions) and entities of interest in order to explain their production.

This command requires the model and solution files of type *.mac_complete.txt. We will take the example of the file seen in the previous chapter.

Example:

$ cadbiom_cmd solutions_2_graph model_without_scc.bcx \
  "./result/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete.txt"

—

The program produces GraphML files in the folder ./graphs/. These files can be opened in Cytoscape.

Cytoscape screenshot of the graph of the first solution for the query COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP.

Cytoscape screenshot of the graph of the second solution for the query COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP.

We notice that the paths taken are relatively similar and independent for each of the 4 entities searched (4 cliques are displayed). However, the solver has found an alternative path to the first solution in order to produce MMP2.

The more solutions listed, the more complex they become and require different boundaries. We also note that the reactions most described in the literature such as the translation of the MMP2 gene are the most likely to have a significant number of referenced modulators. It is around these reactions that the complexity of the model is revealed.

Cytoscape screenshot of the graph of the 343rd solution for the query COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP.

Global visualization of all paths found for a query¶

solutions_2_common_graph: Create a GraphML formated file containing a unique representation of all trajectories corresponding to all solutions in each complete MAC file (*mac_complete files). This is a function to visualize paths taken by the solver from the boundaries to the entities of interest.

Example:

$ cadbiom_cmd queries_2_common_graph model_without_scc.bcx \
  "./result/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete.txt"

Generated graph (graph):

Cytoscape screenshot of common weighted and directed graph of solutions for the query COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP.

Programmatic processing and decompilation of solution files¶

queries_2_json: Create a JSON formated file containing all data from complete MAC files (*mac_complete files). The file will contain frontier places/boundaries and decompiled steps with their respective events for each solution. This is a function to quickly search all transition attributes involved in a solution.

Example:

$ cadbiom_cmd solutions_2_json model_without_scc.bcx \
  "./result/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete.txt"

Example of JSON file (decompiled_mac_complete):

[{
    "solution": "boundary_1 boundary_2",
    "steps": [
        [{
            "event": "_h_1",
            "transitions": [{
                "ext": "place_x",
                "ori": "boundary_1"
            }]
        }],
    ]
},
...
]

Search of interactions between molecules in trajectories¶

Search of interactions between molecules present in the trajectories and the boundaries with distinction of genes and various stimuli (non genes).

json_2_interaction_graph: Make an interaction weighted graph based on the searched molecule of interest. Read decompiled solutions files (.json files produced by the directive ‘solutions_2_json’) and make a graph of the relationships between one or more molecules of interest, the genes and other frontier places/boundaries found among all the solutions.

Example:

$ cadbiom_cmd json_2_interaction_graph model_without_scc.bcx \
  "./decompiled_solutions/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete_decomp.json" \
  PKC_1a
INFO: Opening files...
INFO: Files processed: 1
INFO: Building graph...
INFO: Graph generated in 0.0767540931702

Generated graph (interaction graph between PKC_1a and boundaries):

Cytoscape screenshot of the interaction graph of the entity PKC_1a for the query COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP.

Occurrences matrix¶

solutions_2_occcurrences_matrix: Create a matrix of occurrences counting entities in the solutions found in *mac.txt files in the given path.

Example:

$ cadbiom_cmd solutions_2_occcurrences_matrix output/pid_last_nov_model_without_scc.bcx \
  ./result/ \
INFO: Files processed: 44

occurrences matrix transposed occurrences matrix

Clustermaps¶

We can visualize co-occurrences of the boundaries within the solutions obtained by creating a ClusterMap (hierarchically-clustered heatmap) for boundaries found in the solutions.

Example:

$ cadbiom_cmd queries_2_clustermap \
  "./result/model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac_complete.txt"


pid_last_nov_model_without_scc_COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP_mac

Generated clustermap (clustermap.svg)

Hierarchically-clustered heatmap of the occurrences of the boundaries found in the solutions for the query COL1A1 and COL3A1 and COL5A1 and MMP2 and not PERP.

Avanced users - Creation of models¶

Creation of a Cadbiom model from a BioPAX endpoint¶

biopax2cadbiom is a standalone module also integrated in the Cadbiom GUI that converts BioPAX ontologies to Cabiom models. You will find a full help about the installation and usage at biopax2cadbiom. Do not miss the chapter How to make queries on an endpoint like Pathway Commons?

Let’s take the example of a PID database (Pathway Interaction Database) conversion with the following command:

$ biopax2cadbiom model \
--graph_uris http://pathwaycommons.org \
--provenance_uri http://pathwaycommons.org/pc2/pid \
--triplestore http://rdf.pathwaycommons.org/sparql/

Quick explanations of arguments:

The parameter --graph_uris provides the URI of the graphs queried (and optionally of the BioPAX ontology if it is hosted separately).
It is thus necessary to filter the RDF triples according to their origin with the the optional parameter --provenance_uri. By setting it, the program will filter entities, reactions, pathways thanks to their dataSource BioPAX attribute.
The URL of the endpoint is specified with --triplestore.

Result:

The generated model will be placed in the ./output folder; we will focus on the model with specific alterations intended to optimize the operation of Cadbiom and its solver: ./output/model_without_scc.bcx.

(pid_model_without_scc.bcx)

Note

Small graphs can be queried and converted from the GUI of Cadbiom.

Screenshot of the import tool for BioPAX graphs in the GUI of Cadbiom.