Category Archives: Open Data

Update on R and Semantic Web Technologies

Progress has been made with linked-data and other Semantic Web technologies over the past few years so it’s a great time to revisit how we may work with linked-data using R. A few years ago (March 13, 2014 post “R and RDF: Where Statistics and the Semantic Web Meet”) discussion was around the R package rrdf, which is no longer actively supported. Today SPARQL is the package to use. SPARQL works in very much the same way as rrdf.

Install and load the SPARQL package, if you haven’t done so already:

install.packages("SPARQL")
library(SPARQL)

Enter a SPARQL endpoint variable. The SPARQL endpoint is the semantic search entry point for a particular data repository. In this case let’s use the WikiPathways endpoint that provides data on biological pathways (see my June 27, 2017 post “WikiPathways: Open Biological Pathways Data on the Semantic Web”).

endpoint <- 'http://sparql.wikipathways.org'

Next assign a SPARQL query to a query variable. Be sure to surround the query itself with quotes.

query <- 'PREFIX wp:
SELECT DISTINCT str(?title) as ?pathway
WHERE {
?pw dc:title ?title ;
wp:organism ?organism ;
wp:organismName "Homo sapiens"^^xsd:string .
}
ORDER BY ?pathway'

Use the SPARQL() function to carry out the query.

data <- SPARQL(endpoint, query)

The results, the names for all the repository's pathways for humans, are now in the R environment. (Enter 'data' to see the dataset.)

The R SPARQL package is a great tool for those who want to pull data from across the Semantic Web and use R to analyze and visualize the results.

WikiPathways: Open Biological Pathways Data on the Semantic Web


The above PathwayWidget provides interactive visualizations of pathways pulled from WikiPathways in real-time. Here we see human pathways active in Alzheimer’s Disease. At right are lists of mitochondrial RNA differentially expressed in Alzheimer’s Disease.

Genes, proteins, and small molecules interact in our bodies through biological pathways to carry out the moment-to-moment work that supports life. WikiPathways is an open, collaborative platform for depositing data gathered through research on biological pathways in all lifeforms including humans.

WikiPathways currently contains over 2400 pathways from over 25 different species. Created by research groups that need a platform to support high-throughput data analysis and visualization, WikiPathways provides a treasure trove of linked-data for data scientists and programmers.

A SPARQL query endpoint is available at the WikiPathways site. For example, you may get an alphabetical list of human pathways from the repository by using the following query.

PREFIX wp: <http://vocabularies.wikipathways.org/wp#>
SELECT DISTINCT str(?title) as ?pathway
WHERE {
   ?pw dc:title ?title ;
      wp:organism ?organism ;
      wp:organismName "Homo sapiens"^^xsd:string .
}
ORDER BY ?pathway

WikiPathways is a site that offers many resources including linked-data and Semantic Web tools and has adopted the Creative Commons CC0 waiver. It’s well worth exploring.