Using R to Query Biomedical Data Distributed Across the Semantic Web

Figure 1. The Gene Expression Atlas RDF Schema. Image courtesy of EMBL-EBI.

A key idea behind the Semantic Web of linked-data is to enable seamless use of data from diverse sources across the Internet. Biological pathways data come from a huge number of laboratories from around the world and are captured and saved in a lot of different ways. We can use SPARQL endpoints to bring data together from any number of data repositories.

In this post we’ll use the R package SPARQL to pull biological pathway data from WikiPathways while at the same time pulling the genes expressed in these pathways from the Gene Expression Atlas.

You may find these related posts helpful: “WikiPathways: Open Biological Pathways Data on the Semantic Web” from June 27, 2017 and “Update on R and Semantic Web Technologies” from June 30, 2017.

The Gene Expression Atlas provides data on what genes or proteins are expressed in a particular species under specific conditions. It also provides data on differential expression, the increase or decrease of gene expression or protein production under specific conditions.

Let’s look for human biological pathways that show different gene activity under Alzheimer’s disease than under normal circumstances. Differences are indicated by significant increases or decreases in gene expression.

First, look up the identifier for Alzheimer’s disease using
the EMBL-EBI Ontology Lookup Service.

Figure 2. EMBL-EBI Ontology Lookup Service

Type Alzheimer’s disease into the Search EFO search box in the upper right area of the EMBL-EBI Ontology Lookup Service page (see Figure 2 above).

EFO stands for Experimental Factor Ontology.

A large list of factors appear that are associated with Alzheimer’s disease but the one we’re interested in, listed as Alzheimer’s disease, should be on the first page and provide the identifier EFO:0000249.

The following discussion assumes that you’ve installed and loaded the R package SPARQL. If not, please see the June 30, 2017 post “Update on R and Semantic Web Technologies.”

Assign the WikiPathways SPARQL endpoint URL to the endpoint variable in your R environment.

endpoint <- ''

Next, assign a SPARQL query to the query variable like in the following code snippet.

query <- 'PREFIX identifiers:<>
PREFIX atlas: <>
PREFIX atlasterms: <>
PREFIX efo: <>

SELECT DISTINCT ?wpURL ?pwTitle ?expressionValue ?pvalue where {
?factor rdf:type efo:EFO_0000249 .
?value atlasterms:hasFactorValue ?factor .
?value atlasterms:isMeasurementOf ?probe .
?value atlasterms:pValue ?pvalue .
?value rdfs:label ?expressionValue .
?probe atlasterms:dbXref ?dbXref .
?pwElement dcterms:isPartOf ?pathway .
?pathway dc:title ?pwTitle .
?pathway dc:identifier ?wpURL .
?pwElement wp:bdbEnsembl ?dbXref .
ORDER BY ASC(?pvalue)'

The full statement above is sent to the WikiPathways SPARQL endpoint (the URL assigned to the endpoint variable). However, the search terms in the embedded SERVICE statement are forwarded to the Gene Expression Atlas SPAQL endpoint (the URL following SERVICE).

The first triple inside the SERVICE statement sets the ?factor variable to the Alzheimer’s disease identifier you found above. Notice that the colon was swapped out for an underscore so that EFO:0000249 became EFO_0000249.

The second through fifth triples will all have the same subject, which is assigned to the ?value variable. The first three use predicates from the atlasterms namespace. Look at the lower right quadrant of the Gene Expression Atlas RDF Schema (see Figure 1 above). Each of the three Gene Expression Atlas specific predicates (hasFactorValue, isMeasurementOf, and pValue) has atlas:DifferentialExpressionRatio as their subject.

The final triple also uses a predicate from the atlasterms namespace. dbXref is a predicate of atlas:ProbeDesignElement that is itself the object pointed to by isMeasurementOf from atlas:DifferentialExpressionRatio. The dbXref predicate provides a bridge across various databases.

Finally, enter the SPARQL function to carry out our query and assign our results to the data variable.

data <- SPARQL(endpoint, query)

Run the R summary function on the data assigned to the data variable to see how data were returned by SPARQL.


SPARQL returned an R data frame and assigned it to data$results.

Take a peek at the first six rows of the results by running the R head function on the data frame assigned to data$results.


Figure 3. Using R and the R package SPARQL to federate data from WikiPathays and the Gene Expression Atlas

Refer to Figure 3 above to see all interactions with the R console for this exercise. As you go through analysis, you’ll find that all but 2 of the 84 pathways show decreased gene expression in Alzheimer’s disease. Two pathways show increased COL27A1 gene expression.

Mining the world’s linked-data is relatively easy using R and the R package SPARQL. You must be proficient in SPARQL, be able to navigate ontologies, and know where the SPARQL endpoints are. You did all of this to pull pathway and differential gene expression data associated with Alzheimer’s disease and used them as a single integrated dataset!

Update on R and Semantic Web Technologies

Progress has been made with linked-data and other Semantic Web technologies over the past few years so it’s a great time to revisit how we may work with linked-data using R. A few years ago (March 13, 2014 post “R and RDF: Where Statistics and the Semantic Web Meet”) discussion was around the R package rrdf, which is no longer actively supported. Today SPARQL is the package to use. SPARQL works in very much the same way as rrdf.

Install and load the SPARQL package, if you haven’t done so already:


Enter a SPARQL endpoint variable. The SPARQL endpoint is the semantic search entry point for a particular data repository. In this case let’s use the WikiPathways endpoint that provides data on biological pathways (see my June 27, 2017 post “WikiPathways: Open Biological Pathways Data on the Semantic Web”).

endpoint <- ''

Next assign a SPARQL query to a query variable. Be sure to surround the query itself with quotes.

query <- 'PREFIX wp:
SELECT DISTINCT str(?title) as ?pathway
?pw dc:title ?title ;
wp:organism ?organism ;
wp:organismName "Homo sapiens"^^xsd:string .
ORDER BY ?pathway'

Use the SPARQL() function to carry out the query.

data <- SPARQL(endpoint, query)

The results, the names for all the repository's pathways for humans, are now in the R environment. (Enter 'data' to see the dataset.)

The R SPARQL package is a great tool for those who want to pull data from across the Semantic Web and use R to analyze and visualize the results.

WikiPathways: Open Biological Pathways Data on the Semantic Web

The above PathwayWidget provides interactive visualizations of pathways pulled from WikiPathways in real-time. Here we see human pathways active in Alzheimer’s Disease. At right are lists of mitochondrial RNA differentially expressed in Alzheimer’s Disease.

Genes, proteins, and small molecules interact in our bodies through biological pathways to carry out the moment-to-moment work that supports life. WikiPathways is an open, collaborative platform for depositing data gathered through research on biological pathways in all lifeforms including humans.

WikiPathways currently contains over 2400 pathways from over 25 different species. Created by research groups that need a platform to support high-throughput data analysis and visualization, WikiPathways provides a treasure trove of linked-data for data scientists and programmers.

A SPARQL query endpoint is available at the WikiPathways site. For example, you may get an alphabetical list of human pathways from the repository by using the following query.

PREFIX wp: <>
SELECT DISTINCT str(?title) as ?pathway
   ?pw dc:title ?title ;
      wp:organism ?organism ;
      wp:organismName "Homo sapiens"^^xsd:string .
ORDER BY ?pathway

WikiPathways is a site that offers many resources including linked-data and Semantic Web tools and has adopted the Creative Commons CC0 waiver. It’s well worth exploring.