Donald Doherty's blog

Category: Open Data

FlyBase Revisited: Ensuring Data Access

Figure 1. The FlyBase website provides powerful search capabilities across fruit fly linked-data. Their current plight, due to reduction in funding, highlights the issue of how to ensure data access.

Yesterday while reading the new research paper on the fully imaged female fruit fly brain (see Largest Brain Comprehensively Imaged at Nanoscale Resolution), I thought back to blog posts I wrote about seven years ago on fruit fly online data repositories, ontologies, and atlases (see “Related blog posts” below).

The FlyBase Consortium has been a pioneering organization behind FlyBase and the Virtual Fly Brain; a linked-data repository and visual atlas respectively. They provide a powerful toolset to use with the complete set of fruit fly brain images taken at electron microscope resolution recently released to the public domain. And, in fact, the new data appear to be getting incorporated into release 2 of the Virtual Fly Brain (now in beta testing).

Possibilities were bubbling in my mind as I loaded FlyBase. Bubbles froze in mid-mind as my retina relayed the image above (Figure 1) to my cortex. Tools and data repositories we love and depend on may, at any time, become difficult to access or disappear entirely. Any number of factors may be behind restrictions or shutdowns including loss of funding, the retirement of a repository advocate, or a government deciding they should be proprietary.

The people at FlyBase clearly had no alternative but to turn to fee-based access to keep their service online. Our global scientific community should seek a solution that eliminates the need to face this choice and ensures global access to foundational tools and data. Science is a global effort. Scientific data should be included, along with scientific knowledge, in the commons for all humanity.

Related blog posts:

Largest Brain Comprehensively Imaged at Nanoscale Resolution (July 28, 2018).

A Virtual Fly Brain (April 16, 2011).

Viewing the Fly Brain Connectome with Brainbow (February 11, 2011).

How the Brain Works, Flies, and the FlyBase Online Data Repository (November 4, 2010).

July 29, 2018
ChemNetDB: Fifty Years of Rat Brain Connection and Neurotransmitter Data

Nineteen large scale brain regions (listed) partition a 125 node cerebral connectome in a. Connectomes in b through f each match the color of a neurotransmitter listed at lower right. Figure 1 from “A multi scale cerebral neurochemical connector of the rat brain” published July 3, 2017 in PLOS Biology.

Since the 1960s a lot of scientists have worked very hard, and a lot of rats have given their all, to trace brain connections and their neurotransmitters. Fundamental to understanding how the brain works is to know how it is wired and how the signals are transmitted by those wires. Work reported in a new paper “A multi scale cerebral neurochemical connector of the rat brain” (published July 3, 2017 in PLOS Biology) identified 1,560 original research articles with high quality connectivity data from 36,464 rats from the past 50 years that they transformed into multi scale atlas of rat brain connections and their neurotransmitters.

The authors point to three main features ChemNetDB has over other existing databases. First, they claim that ChemNetDB is the most comprehensive rat connectivity database of the last 50 years of research data currently in existence. Second, they used a transparent, consistent, and validated method to integrate neurochemical information with the connectivity data. And third, they used data from animals of a consistent age along with transparent and consistent terminology.

I was very excited on coming across and reading their paper. However, on visiting the site chemnetdb.org I was immediately struck by very limited data access. Perhaps there is a data endpoint or an Application Programming Interface (API). To my knowledge, the best you can currently do on the site is to search a brain area or structure and get a list of the areas connecting with that area or structure along with the connection’s neurotransmitters. Or the reverse: you may search on a neurotransmitter. References associated with the connections are displayed.

The authors may have future plans but no hints have been provided. These data would be valuable as part of the rich set of life sciences linked-data available across the Internet. Their rat connectivity data could be associated with with a rat brain anatomy ontology and a huge and growing number of other relevant ontologies and opened up through SPARQL endpoints. Then there would be a world of possibilities!

July 14, 2017
Using R to Query Biomedical Data Distributed Across the Semantic Web

Figure 1. The Gene Expression Atlas RDF Schema. Image courtesy of EMBL-EBI.

A key idea behind the Semantic Web of linked-data is to enable seamless use of data from diverse sources across the Internet. Biological pathways data come from a huge number of laboratories from around the world and are captured and saved in a lot of different ways. We can use SPARQL endpoints to bring data together from any number of data repositories.

In this post we’ll use the R package SPARQL to pull biological pathway data from WikiPathways while at the same time pulling the genes expressed in these pathways from the Gene Expression Atlas.

You may find these related posts helpful: “WikiPathways: Open Biological Pathways Data on the Semantic Web” from June 27, 2017 and “Update on R and Semantic Web Technologies” from June 30, 2017.

The Gene Expression Atlas provides data on what genes or proteins are expressed in a particular species under specific conditions. It also provides data on differential expression, the increase or decrease of gene expression or protein production under specific conditions.

Let’s look for human biological pathways that show different gene activity under Alzheimer’s disease than under normal circumstances. Differences are indicated by significant increases or decreases in gene expression.

First, look up the identifier for Alzheimer’s disease using
the EMBL-EBI Ontology Lookup Service.

Figure 2. EMBL-EBI Ontology Lookup Service

Type Alzheimer’s disease into the Search EFO search box in the upper right area of the EMBL-EBI Ontology Lookup Service page (see Figure 2 above).

EFO stands for Experimental Factor Ontology.

A large list of factors appear that are associated with Alzheimer’s disease but the one we’re interested in, listed as Alzheimer’s disease, should be on the first page and provide the identifier EFO:0000249.

The following discussion assumes that you’ve installed and loaded the R package SPARQL. If not, please see the June 30, 2017 post “Update on R and Semantic Web Technologies.”

Assign the WikiPathways SPARQL endpoint URL to the endpoint variable in your R environment.
endpoint <- 'http://sparql.wikipathways.org'
Next, assign a SPARQL query to the query variable like in the following code snippet.
query <- 'PREFIX identifiers:<http://identifiers.org/ensembl/> PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/> PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/> PREFIX efo: <http://www.ebi.ac.uk/efo/>

SELECT DISTINCT ?wpURL ?pwTitle ?expressionValue ?pvalue where { SERVICE <https://www.ebi.ac.uk/rdf/services/atlas/sparql> { ?factor rdf:type efo:EFO_0000249 . ?value atlasterms:hasFactorValue ?factor . ?value atlasterms:isMeasurementOf ?probe . ?value atlasterms:pValue ?pvalue . ?value rdfs:label ?expressionValue . ?probe atlasterms:dbXref ?dbXref . } ?pwElement dcterms:isPartOf ?pathway . ?pathway dc:title ?pwTitle . ?pathway dc:identifier ?wpURL . ?pwElement wp:bdbEnsembl ?dbXref . } ORDER BY ASC(?pvalue)'
The full statement above is sent to the WikiPathways SPARQL endpoint (the URL assigned to the endpoint variable). However, the search terms in the embedded SERVICE statement are forwarded to the Gene Expression Atlas SPAQL endpoint (the URL following SERVICE).

The first triple inside the SERVICE statement sets the ?factor variable to the Alzheimer’s disease identifier you found above. Notice that the colon was swapped out for an underscore so that EFO:0000249 became EFO_0000249.

The second through fifth triples will all have the same subject, which is assigned to the ?value variable. The first three use predicates from the atlasterms namespace. Look at the lower right quadrant of the Gene Expression Atlas RDF Schema (see Figure 1 above). Each of the three Gene Expression Atlas specific predicates (hasFactorValue, isMeasurementOf, and pValue) has atlas:DifferentialExpressionRatio as their subject.

The final triple also uses a predicate from the atlasterms namespace. dbXref is a predicate of atlas:ProbeDesignElement that is itself the object pointed to by isMeasurementOf from atlas:DifferentialExpressionRatio. The dbXref predicate provides a bridge across various databases.

Finally, enter the SPARQL function to carry out our query and assign our results to the data variable.
data <- SPARQL(endpoint, query)

Run the R summary function on the data assigned to the data variable to see how data were returned by SPARQL.
summary(data)

SPARQL returned an R data frame and assigned it to data$results.

Take a peek at the first six rows of the results by running the R head function on the data frame assigned to data$results.
head(data$results)

Figure 3. Using R and the R package SPARQL to federate data from WikiPathays and the Gene Expression Atlas

Refer to Figure 3 above to see all interactions with the R console for this exercise. As you go through analysis, you’ll find that all but 2 of the 84 pathways show decreased gene expression in Alzheimer’s disease. Two pathways show increased COL27A1 gene expression.

Mining the world’s linked-data is relatively easy using R and the R package SPARQL. You must be proficient in SPARQL, be able to navigate ontologies, and know where the SPARQL endpoints are. You did all of this to pull pathway and differential gene expression data associated with Alzheimer’s disease and used them as a single integrated dataset!

July 3, 2017