Author Archives: Donald Doherty

ChemNetDB: Fifty Years of Rat Brain Connection and Neurotransmitter Data

Nineteen large scale brain regions (listed) partition a 125 node cerebral connectome in a. Connectomes in b through f each match the color of a neurotransmitter listed at lower right. Figure 1 from “A multi scale cerebral neurochemical connector of the rat brain” published July 3, 2017 in PLOS Biology.

Since the 1960s a lot of scientists have worked very hard, and a lot of rats have given their all, to trace brain connections and their neurotransmitters. Fundamental to understanding how the brain works is to know how it is wired and how the signals are transmitted by those wires. Work reported in a new paper “A multi scale cerebral neurochemical connector of the rat brain” (published July 3, 2017 in PLOS Biology) identified 1,560 original research articles with high quality connectivity data from 36,464 rats from the past 50 years that they transformed into multi scale atlas of rat brain connections and their neurotransmitters.

The authors point to three main features ChemNetDB has over other existing databases. First, they claim that ChemNetDB is the most comprehensive rat connectivity database of the last 50 years of research data currently in existence. Second, they used a transparent, consistent, and validated method to integrate neurochemical information with the connectivity data. And third, they used data from animals of a consistent age along with transparent and consistent terminology.

I was very excited on coming across and reading their paper. However, on visiting the site chemnetdb.org I was immediately struck by very limited data access. Perhaps there is a data endpoint or an Application Programming Interface (API). To my knowledge, the best you can currently do on the site is to search a brain area or structure and get a list of the areas connecting with that area or structure along with the connection’s neurotransmitters. Or the reverse: you may search on a neurotransmitter. References associated with the connections are displayed.

The authors may have future plans but no hints have been provided. These data would be valuable as part of the rich set of life sciences linked-data available across the Internet. Their rat connectivity data could be associated with with a rat brain anatomy ontology and a huge and growing number of other relevant ontologies and opened up through SPARQL endpoints. Then there would be a world of possibilities!

Genes, Pathways, and Autistic Spectrum Disorder

Interactions between the top 50 pathways from 659 Autistic Spectrum Disorder (ASD) genes are displayed. Pathways are grouped as Disease Pathways (left column; purple) and Functional Pathways (right column; green). The color of each node inside the groupings represents the p-value of that pathway (p-value bar at lower left). The size of each node represents the number of ASD genes in that pathways. From Figure 2 in “Pathway Network Analysis for Autism Reveal Multisystem Involvement, Major Overlaps with Other Diseases and Convergence upon MAPK and Calcium Signalling” published April 7, 2016 in PLOS One.

Autistic Spectrum Disorder (ASD) is defined behaviorally and has no known biological marker. Does ASD encompass a collection of disorders? Are they necessarily disorders or simply differences that may become disorders under some biological or environmental circumstances? We simply do not know for certain.

Research has shown that hundreds of genes, multitudes of biological pathways, and many systems seem to contribute to ASD. If all of these data are brought together and visualized will it provide a clue to the biological basis of ASD? The recent publication “Pathway Network Analysis for Autism Reveal Multisystem Involvement, Major Overlaps with Other Diseases and Convergence upon MAPK and Calcium Signalling” (April 7, 2016 in PLOS One) attempts this approach with genes and pathways known to be active in ASD patients.

When this study began, 667 genes were thought to contribute to ASD. Of these, 659 genes were in the annotation set the researchers were using to do analysis. They examined the interactions among the pathways associated with these genes and selected the top 50 pathways based on the statistical significance of pathway overlaps. Their results are displayed in the figure above.

Notice that the figure is divided into two columns. At left is the Disease Pathways column (purple). At right is the Functional Pathways column (green). Each node in a pathway grouping has a size and color. The size represents the number of ASD genes in that pathway (larger diameter means more genes). The color of each node represents the p-value of that pathway (p-value bar at lower left; red indicates a very high significance of overlap).

The resulting map shows three “hot” spots that are both large diameter (a lot of ASD genes) and red (very high significance of overlap) under the Neural and Cell Signaling functional pathways and the Cancer disease pathways. Specifically, the following three nodes stand out:

Data showed that MAPK signaling pathway interacts with half of the pathways in the network and is the most interactive pathway in the ASD data. They showed that the calcium signaling pathway is the second most interactive pathway and is associated with the most ASD genes. These two pathway types overlapped with 8 ASD genes (green intersection) known to be important in the process of calcium-PKC-Ras-Raf-MAPK/ERK. From Figure 3 in “Pathway Network Analysis for Autism Reveal Multisystem Involvement, Major Overlaps with Other Diseases and Convergence upon MAPK and Calcium Signalling

1. Neuro-active ligand-receptor interaction

2. Calcium signaling pathway

3. Collection cancer

Neuro-active ligand-receptor interaction is key to communication between each and every nerve cell both in our brains in throughout our body. The calcium signaling pathway is fundamental to each and every cell in our bodies. The genes and pathways were identified in ASD patients because their products and interactions were significantly different than in non-ASD patients. Do those diagnosed with ASD have significantly different metabolisms than the rest us? Do their brain cells communicate differently?

Research based on pulling together existent data is implicitly biased by the experiments that were performed to produce those data. Perhaps researchers were most interested in cell signaling and cancer genes and pathways and that is the reason that these kinds pathways are more common in data repositories. The result could be their prominent association with ASD genes.

Using R to Query Biomedical Data Distributed Across the Semantic Web

Figure 1. The Gene Expression Atlas RDF Schema. Image courtesy of EMBL-EBI.

A key idea behind the Semantic Web of linked-data is to enable seamless use of data from diverse sources across the Internet. Biological pathways data come from a huge number of laboratories from around the world and are captured and saved in a lot of different ways. We can use SPARQL endpoints to bring data together from any number of data repositories.

In this post we’ll use the R package SPARQL to pull biological pathway data from WikiPathways while at the same time pulling the genes expressed in these pathways from the Gene Expression Atlas.

You may find these related posts helpful: “WikiPathways: Open Biological Pathways Data on the Semantic Web” from June 27, 2017 and “Update on R and Semantic Web Technologies” from June 30, 2017.

The Gene Expression Atlas provides data on what genes or proteins are expressed in a particular species under specific conditions. It also provides data on differential expression, the increase or decrease of gene expression or protein production under specific conditions.

Let’s look for human biological pathways that show different gene activity under Alzheimer’s disease than under normal circumstances. Differences are indicated by significant increases or decreases in gene expression.

First, look up the identifier for Alzheimer’s disease using
the EMBL-EBI Ontology Lookup Service.

Figure 2. EMBL-EBI Ontology Lookup Service

Type Alzheimer’s disease into the Search EFO search box in the upper right area of the EMBL-EBI Ontology Lookup Service page (see Figure 2 above).

EFO stands for Experimental Factor Ontology.

A large list of factors appear that are associated with Alzheimer’s disease but the one we’re interested in, listed as Alzheimer’s disease, should be on the first page and provide the identifier EFO:0000249.

The following discussion assumes that you’ve installed and loaded the R package SPARQL. If not, please see the June 30, 2017 post “Update on R and Semantic Web Technologies.”

Assign the WikiPathways SPARQL endpoint URL to the endpoint variable in your R environment.

endpoint <- 'http://sparql.wikipathways.org'

Next, assign a SPARQL query to the query variable like in the following code snippet.

query <- 'PREFIX identifiers:<http://identifiers.org/ensembl/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX efo: <http://www.ebi.ac.uk/efo/>

SELECT DISTINCT ?wpURL ?pwTitle ?expressionValue ?pvalue where {
SERVICE <https://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?factor rdf:type efo:EFO_0000249 .
?value atlasterms:hasFactorValue ?factor .
?value atlasterms:isMeasurementOf ?probe .
?value atlasterms:pValue ?pvalue .
?value rdfs:label ?expressionValue .
?probe atlasterms:dbXref ?dbXref .
}
?pwElement dcterms:isPartOf ?pathway .
?pathway dc:title ?pwTitle .
?pathway dc:identifier ?wpURL .
?pwElement wp:bdbEnsembl ?dbXref .
}
ORDER BY ASC(?pvalue)'

The full statement above is sent to the WikiPathways SPARQL endpoint (the URL assigned to the endpoint variable). However, the search terms in the embedded SERVICE statement are forwarded to the Gene Expression Atlas SPAQL endpoint (the URL following SERVICE).

The first triple inside the SERVICE statement sets the ?factor variable to the Alzheimer’s disease identifier you found above. Notice that the colon was swapped out for an underscore so that EFO:0000249 became EFO_0000249.

The second through fifth triples will all have the same subject, which is assigned to the ?value variable. The first three use predicates from the atlasterms namespace. Look at the lower right quadrant of the Gene Expression Atlas RDF Schema (see Figure 1 above). Each of the three Gene Expression Atlas specific predicates (hasFactorValue, isMeasurementOf, and pValue) has atlas:DifferentialExpressionRatio as their subject.

The final triple also uses a predicate from the atlasterms namespace. dbXref is a predicate of atlas:ProbeDesignElement that is itself the object pointed to by isMeasurementOf from atlas:DifferentialExpressionRatio. The dbXref predicate provides a bridge across various databases.

Finally, enter the SPARQL function to carry out our query and assign our results to the data variable.

data <- SPARQL(endpoint, query)

Run the R summary function on the data assigned to the data variable to see how data were returned by SPARQL.

summary(data)

SPARQL returned an R data frame and assigned it to data$results.

Take a peek at the first six rows of the results by running the R head function on the data frame assigned to data$results.

head(data$results)

Figure 3. Using R and the R package SPARQL to federate data from WikiPathays and the Gene Expression Atlas

Refer to Figure 3 above to see all interactions with the R console for this exercise. As you go through analysis, you’ll find that all but 2 of the 84 pathways show decreased gene expression in Alzheimer’s disease. Two pathways show increased COL27A1 gene expression.

Mining the world’s linked-data is relatively easy using R and the R package SPARQL. You must be proficient in SPARQL, be able to navigate ontologies, and know where the SPARQL endpoints are. You did all of this to pull pathway and differential gene expression data associated with Alzheimer’s disease and used them as a single integrated dataset!