{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Work with models\n", "\n", "Here is an example\\* of work with the Cadbiom API to process entities and transitions stored in a Cadbiom model.\n", "\n", "\\*: see notes at the end of this page.\n", "\n", "## Model handling\n", "### Load the model\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "# Fix print function for Dino-Python\n", "from __future__ import print_function\n", "#import mpld3\n", "##mpld3.enable_notebook(local=True) # currently not happy\n", "#mpld3.enable_notebook()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import the Cadbiom parser\n", "from cadbiom.models.guard_transitions.translators.chart_xml import MakeModelFromXmlFile\n", "\n", "def load_model(model_path):\n", " \"\"\"Load a model from a model file\"\"\"\n", " parser = MakeModelFromXmlFile(model_path)\n", " return parser.model\n", "\n", "model = load_model(\"_static/demo_files/models_mars_2020/model_pid_without_scc.bcx\")\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Extract data\n", "\n", "The model obtained is of the `ChartModel` type described in the [documentation](http://cadbiom.genouest.org/doc/cadbiom/library_doc.html#cadbiom.models.guard_transitions.chart_model.ChartModel).\n", " \n", "Transitions are available via the attribute `transition_list`, and places or nodes are available via the attribute `node_dict`.\n", "\n", "Here we are interested in boundaries of the model..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def get_boundaries(model):\n", " \"\"\"Build boundaries set from the model\n", "\n", " Boundary = a place without incoming transition\n", " Boundaries of the model = all places - places with incoming transitions\n", "\n", " :return: Set of boundaries\n", " :rtype: \n", " \"\"\"\n", " all_biomolecules = set(model.node_dict.keys())\n", " outgoing_biomolecules = set(transition.ext.name for transition in model.transition_list)\n", "\n", " boundaries = all_biomolecules - outgoing_biomolecules\n", "\n", " return boundaries\n", "\n", "boundaries = get_boundaries(model)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Filter data\n", "\n", "At this point we are interested in the locations of the boundaries elements by focusing on the complex type elements.\n", "Each place in the model has metadata serialized as JSON (http://www.json.org) and accessible via the attribute *note* of these objects. This attribute is described in [the Cadbiom file format specification](http://cadbiom.genouest.org/doc/cadbiom/file_format_specification.html#metadata-content-elements)." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import json\n", "from collections import Counter\n", "\n", "locations_count = Counter()\n", "for boundary in boundaries:\n", " try:\n", " json_data = json.loads(model.node_dict[boundary].note)\n", " except ValueError:\n", " # Absence of metadata\n", " if \"__start__\" in boundary:\n", " # Skip virtual nodes\n", " continue\n", " json_data = dict()\n", " \n", " # Filter complexes\n", " if json_data.get(\"entityType\") != \"Complex\":\n", " continue\n", "\n", " # Unknown location\n", " locations_count[json_data.get(\"location\", \"unknown\")] += 1\n", "\n", "print(\"Retrieved locations:\", locations_count)\n", "\n", "# Calculation of the respective percentages of each location\n", "nb_locations = sum(locations_count.values())\n", "locations_percents = {k: float(v * 100) / nb_locations for k, v in locations_count.items()}\n", "\n", "print(\"Locations percentages:\", locations_percents)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Display: Pie charts visualization\n", "\n", "The data must be ordered in descending order to display properly.\n", "Let's make a function in order to avoid code duplication." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "\n", "def make_pandas_pie_chart(locations_percents, nb_values=None, legend=True, title=\"\"):\n", " \"\"\"Display pie chart with the help of pandas\n", " \n", " :key nb_values: Number of boundaries considered (for the title)\n", " :key legend: Enable the legend (default: True)\n", " :key title: Figure title\n", " :type nb_values: \n", " :type key: \n", " :type title: \n", " \"\"\"\n", " # Title formatting\n", " title = title % nb_values if nb_values else title\n", " \n", " # Sort values in descending order\n", " series = pd.Series(\n", " data=locations_percents.values(), \n", " index=locations_percents.keys(), name=\"\"\n", " ).sort_values(ascending=False)\n", " \n", " # Draw pie chart\n", " size = 0.25\n", " plot = series.plot.pie(\n", " legend=legend,\n", " autopct='%1.0f%%', # Cut percentages after the decimal point\n", " pctdistance=0.85,\n", " colormap=\"Spectral\", # Pastel1, Set2, Spectral\n", " radius=1.2-size, # Fix size for the central circle\n", " wedgeprops=dict(width=size, edgecolor='w'), # Draw central circle\n", " figsize=(9, 9),\n", " title=title,\n", " )\n", "\n", " \n", "make_pandas_pie_chart(\n", " locations_percents, \n", " title=\"Localizations of Complex boundaries in the PID database\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data cleaning\n", "\n", "The data is somewhat noisy, let's try to group together the less frequent locations, and similar locations together.\n", "\n", "Note that the computation of percentages is not mandatory for the pie chart but it allows\n", "us to remove the smallest locations easily." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "nb_locations = sum(locations_count.values())\n", "\n", "# Calculation of the respective percentage of each location\n", "# Compute percent and group items\n", "locations_percents = Counter()\n", "for location, count in locations_count.items():\n", " percentage = float(count * 100) / nb_locations\n", "\n", " if percentage > 1:\n", " # Merge similar groups\n", " if \"unknown\" in location:\n", " locations_percents[\"unknown\"] += percentage\n", " elif \"cyto\" in location:\n", " locations_percents[\"cytosol\"] += percentage\n", " else:\n", " locations_percents[location] = percentage\n", " else:\n", " # Merge less frequent groups\n", " locations_percents[\"other\"] += percentage\n", "\n", "print(\"Locations percentages:\", locations_percents)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Reuse the function previously defined\n", "make_pandas_pie_chart(\n", " locations_percents,\n", " nb_values=nb_locations,\n", " legend=False,\n", " title=\"Localizations of Complex boundaries in the PID database (%s)\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pie chart alternative with Matplotlib\n", "\n", "Pandas is many things and it is also an overlay to Matplotlib when it comes to data visualization.\n", "\n", "Here is an alternative written entirely with Matplotlib, so you can easily spot the few verbose overloads that Pandas erases with a certain elegance." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "from operator import itemgetter\n", "from numpy import arange\n", "from matplotlib import cm\n", "\n", "# Data must be sorted before use\n", "sorted_locations = sorted(locations_percents.items(), key=itemgetter(1), reverse=True)\n", "unzip = lambda l: list(zip(*l))\n", "locations_names, values = unzip(sorted_locations)\n", "\n", "# Colors must be pre computed from colormap\n", "# The goal is to generate 1 color taken at regular intervals for each element\n", "# Build an array having the size n of the data to be displayed, \n", "# and containing the values from 0 to n-1;\n", "# Divide each value by n\n", "colors=cm.Spectral(arange(len(locations_percents)) / float(len(locations_percents)))\n", "\n", "# Draw pie chart\n", "fig, ax = plt.subplots()\n", "size = 0.25\n", "wedges, texts, autotexts = plt.pie(\n", " values, labels=locations_names,\n", " pctdistance=0.85,\n", " autopct='%1.0f%%',\n", " colors=colors,\n", " radius=1.2-size, # Fix size for the central circle\n", " wedgeprops=dict(width=size, edgecolor='w'), # Draw central circle\n", ")\n", "\n", "# Draw the legend\n", "plt.legend(\n", " wedges, locations_names,\n", " title=\"Locations\",\n", " loc=\"upper right\",\n", " bbox_to_anchor=(0.5, 0.5, 0.5, 0.5)\n", ")\n", "\n", "# Set title\n", "fig.suptitle(\"Localizations of Complex boundaries in the PID database\")\n", "\n", "# Adjust size\n", "fig.set_size_inches(7, 7)\n", "\n", "# Equal aspect ratio ensures that pie is drawn as a circle\n", "plt.axis('equal')\n", "plt.tight_layout()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Another database\n", "\n", "### A little tidying up\n", "\n", "Now we have a function to load a Cadbiom model: `load_model()`, a function to display data `make_pandas_pie_chart()`.\n", "Let's do a function for filtering boundaries." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def filter_complex_boundaries(model):\n", " \"\"\"Filter Complex boundaries from the given model and return their locations.\n", " \n", " - Keep Complexes\n", " - Merge unknown and cytosol locations\n", " - Merge smallest locations < 1%\n", " \n", " :return: Tuple: Dict of locations with location names as keys and the number\n", " of percentages as values; + Number of boundaries considered.\n", " :rtype: :>, \n", " \"\"\"\n", " # Get boundaries set\n", " boundaries = get_boundaries(model)\n", "\n", " locations_count = Counter()\n", " for boundary in boundaries:\n", " try:\n", " json_data = json.loads(model.node_dict[boundary].note)\n", " except ValueError:\n", " # Absence of metadata \n", " if \"__start__\" in boundary:\n", " # Skip virtual nodes\n", " continue\n", " json_data = dict()\n", " \n", " # Filter on Complexes\n", " if json_data.get(\"entityType\") != \"Complex\":\n", " continue\n", "\n", " # Unknown location\n", " locations_count[json_data.get(\"location\", \"unknown\")] += 1\n", "\n", " print(\"Retrieved locations:\", locations_count)\n", "\n", "\n", " nb_locations = sum(locations_count.values())\n", "\n", " # Calculation of the respective percentages of each location\n", " # Compute percent and group items\n", " locations_percents = Counter()\n", " for location, count in locations_count.items():\n", " percentage = float(count * 100) / nb_locations\n", "\n", " if percentage > 1:\n", " # Merge similar groups\n", " if \"unknown\" in location:\n", " locations_percents[\"unknown\"] += percentage\n", " elif \"cyto\" in location:\n", " locations_percents[\"cytosol\"] += percentage\n", " else:\n", " locations_percents[location] = percentage\n", " else:\n", " # Merge less frequent groups\n", " locations_percents[\"other\"] += percentage\n", "\n", " print(\"Locations percentages:\", locations_percents)\n", " \n", " return locations_percents, nb_locations\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### ACSN database\n", "\n", "Let's try our new workflow on another model..." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "make_pandas_pie_chart(\n", " *filter_complex_boundaries(\n", " load_model(\"_static/demo_files/models_mars_2020/model_acsn_without_scc.bcx\")\n", " ),\n", " legend=False,\n", " title=\"Localizations of Complex boundaries in the ACSN database (%s)\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Notes \n", "\n", "**About the use of Jupyter showed in this page:**\n", "\n", "Jupyter is a fancy tool but it allows to execute Python code block by block,\n", "in a global context (i.e., with variables that persist and will be mutated in that context,\n", "execution after execution). This is a very bad working practice that is\n", "however encouraged by this kind of tool and by IDEs unfortunately offered\n", "to beginners (Spyder for example).\n", "\n", "These methods are directly inherited from the practices of the community\n", "using the R language and the RStudio \"IDE\". To avoid side effects such as\n", "persistence of variables, one MUST reset the console/notebook between runs\n", "by reloading the kernel as often as possible.\n", "Whilst this may seem redundant or heavy, it's an extremely effective method\n", "of reducing unwanted side effects and bugs in your code." ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.9" } }, "nbformat": 4, "nbformat_minor": 2 }