Category Archives: Biology

Genes, Pathways, and Autistic Spectrum Disorder

Interactions between the top 50 pathways from 659 Autistic Spectrum Disorder (ASD) genes are displayed. Pathways are grouped as Disease Pathways (left column; purple) and Functional Pathways (right column; green). The color of each node inside the groupings represents the p-value of that pathway (p-value bar at lower left). The size of each node represents the number of ASD genes in that pathways. From Figure 2 in “Pathway Network Analysis for Autism Reveal Multisystem Involvement, Major Overlaps with Other Diseases and Convergence upon MAPK and Calcium Signalling” published April 7, 2016 in PLOS One.

Autistic Spectrum Disorder (ASD) is defined behaviorally and has no known biological marker. Does ASD encompass a collection of disorders? Are they necessarily disorders or simply differences that may become disorders under some biological or environmental circumstances? We simply do not know for certain.

Research has shown that hundreds of genes, multitudes of biological pathways, and many systems seem to contribute to ASD. If all of these data are brought together and visualized will it provide a clue to the biological basis of ASD? The recent publication “Pathway Network Analysis for Autism Reveal Multisystem Involvement, Major Overlaps with Other Diseases and Convergence upon MAPK and Calcium Signalling” (April 7, 2016 in PLOS One) attempts this approach with genes and pathways known to be active in ASD patients.

When this study began, 667 genes were thought to contribute to ASD. Of these, 659 genes were in the annotation set the researchers were using to do analysis. They examined the interactions among the pathways associated with these genes and selected the top 50 pathways based on the statistical significance of pathway overlaps. Their results are displayed in the figure above.

Notice that the figure is divided into two columns. At left is the Disease Pathways column (purple). At right is the Functional Pathways column (green). Each node in a pathway grouping has a size and color. The size represents the number of ASD genes in that pathway (larger diameter means more genes). The color of each node represents the p-value of that pathway (p-value bar at lower left; red indicates a very high significance of overlap).

The resulting map shows three “hot” spots that are both large diameter (a lot of ASD genes) and red (very high significance of overlap) under the Neural and Cell Signaling functional pathways and the Cancer disease pathways. Specifically, the following three nodes stand out:

Data showed that MAPK signaling pathway interacts with half of the pathways in the network and is the most interactive pathway in the ASD data. They showed that the calcium signaling pathway is the second most interactive pathway and is associated with the most ASD genes. These two pathway types overlapped with 8 ASD genes (green intersection) known to be important in the process of calcium-PKC-Ras-Raf-MAPK/ERK. From Figure 3 in “Pathway Network Analysis for Autism Reveal Multisystem Involvement, Major Overlaps with Other Diseases and Convergence upon MAPK and Calcium Signalling

1. Neuro-active ligand-receptor interaction

2. Calcium signaling pathway

3. Collection cancer

Neuro-active ligand-receptor interaction is key to communication between each and every nerve cell both in our brains in throughout our body. The calcium signaling pathway is fundamental to each and every cell in our bodies. The genes and pathways were identified in ASD patients because their products and interactions were significantly different than in non-ASD patients. Do those diagnosed with ASD have significantly different metabolisms than the rest us? Do their brain cells communicate differently?

Research based on pulling together existent data is implicitly biased by the experiments that were performed to produce those data. Perhaps researchers were most interested in cell signaling and cancer genes and pathways and that is the reason that these kinds pathways are more common in data repositories. The result could be their prominent association with ASD genes.

Using R to Query Biomedical Data Distributed Across the Semantic Web

Figure 1. The Gene Expression Atlas RDF Schema. Image courtesy of EMBL-EBI.

A key idea behind the Semantic Web of linked-data is to enable seamless use of data from diverse sources across the Internet. Biological pathways data come from a huge number of laboratories from around the world and are captured and saved in a lot of different ways. We can use SPARQL endpoints to bring data together from any number of data repositories.

In this post we’ll use the R package SPARQL to pull biological pathway data from WikiPathways while at the same time pulling the genes expressed in these pathways from the Gene Expression Atlas.

You may find these related posts helpful: “WikiPathways: Open Biological Pathways Data on the Semantic Web” from June 27, 2017 and “Update on R and Semantic Web Technologies” from June 30, 2017.

The Gene Expression Atlas provides data on what genes or proteins are expressed in a particular species under specific conditions. It also provides data on differential expression, the increase or decrease of gene expression or protein production under specific conditions.

Let’s look for human biological pathways that show different gene activity under Alzheimer’s disease than under normal circumstances. Differences are indicated by significant increases or decreases in gene expression.

First, look up the identifier for Alzheimer’s disease using
the EMBL-EBI Ontology Lookup Service.

Figure 2. EMBL-EBI Ontology Lookup Service

Type Alzheimer’s disease into the Search EFO search box in the upper right area of the EMBL-EBI Ontology Lookup Service page (see Figure 2 above).

EFO stands for Experimental Factor Ontology.

A large list of factors appear that are associated with Alzheimer’s disease but the one we’re interested in, listed as Alzheimer’s disease, should be on the first page and provide the identifier EFO:0000249.

The following discussion assumes that you’ve installed and loaded the R package SPARQL. If not, please see the June 30, 2017 post “Update on R and Semantic Web Technologies.”

Assign the WikiPathways SPARQL endpoint URL to the endpoint variable in your R environment.

endpoint <- 'http://sparql.wikipathways.org'

Next, assign a SPARQL query to the query variable like in the following code snippet.

query <- 'PREFIX identifiers:<http://identifiers.org/ensembl/>
PREFIX atlas: <http://rdf.ebi.ac.uk/resource/atlas/>
PREFIX atlasterms: <http://rdf.ebi.ac.uk/terms/atlas/>
PREFIX efo: <http://www.ebi.ac.uk/efo/>

SELECT DISTINCT ?wpURL ?pwTitle ?expressionValue ?pvalue where {
SERVICE <https://www.ebi.ac.uk/rdf/services/atlas/sparql> {
?factor rdf:type efo:EFO_0000249 .
?value atlasterms:hasFactorValue ?factor .
?value atlasterms:isMeasurementOf ?probe .
?value atlasterms:pValue ?pvalue .
?value rdfs:label ?expressionValue .
?probe atlasterms:dbXref ?dbXref .
}
?pwElement dcterms:isPartOf ?pathway .
?pathway dc:title ?pwTitle .
?pathway dc:identifier ?wpURL .
?pwElement wp:bdbEnsembl ?dbXref .
}
ORDER BY ASC(?pvalue)'

The full statement above is sent to the WikiPathways SPARQL endpoint (the URL assigned to the endpoint variable). However, the search terms in the embedded SERVICE statement are forwarded to the Gene Expression Atlas SPAQL endpoint (the URL following SERVICE).

The first triple inside the SERVICE statement sets the ?factor variable to the Alzheimer’s disease identifier you found above. Notice that the colon was swapped out for an underscore so that EFO:0000249 became EFO_0000249.

The second through fifth triples will all have the same subject, which is assigned to the ?value variable. The first three use predicates from the atlasterms namespace. Look at the lower right quadrant of the Gene Expression Atlas RDF Schema (see Figure 1 above). Each of the three Gene Expression Atlas specific predicates (hasFactorValue, isMeasurementOf, and pValue) has atlas:DifferentialExpressionRatio as their subject.

The final triple also uses a predicate from the atlasterms namespace. dbXref is a predicate of atlas:ProbeDesignElement that is itself the object pointed to by isMeasurementOf from atlas:DifferentialExpressionRatio. The dbXref predicate provides a bridge across various databases.

Finally, enter the SPARQL function to carry out our query and assign our results to the data variable.

data <- SPARQL(endpoint, query)

Run the R summary function on the data assigned to the data variable to see how data were returned by SPARQL.

summary(data)

SPARQL returned an R data frame and assigned it to data$results.

Take a peek at the first six rows of the results by running the R head function on the data frame assigned to data$results.

head(data$results)

Figure 3. Using R and the R package SPARQL to federate data from WikiPathays and the Gene Expression Atlas

Refer to Figure 3 above to see all interactions with the R console for this exercise. As you go through analysis, you’ll find that all but 2 of the 84 pathways show decreased gene expression in Alzheimer’s disease. Two pathways show increased COL27A1 gene expression.

Mining the world’s linked-data is relatively easy using R and the R package SPARQL. You must be proficient in SPARQL, be able to navigate ontologies, and know where the SPARQL endpoints are. You did all of this to pull pathway and differential gene expression data associated with Alzheimer’s disease and used them as a single integrated dataset!

Update on R and Semantic Web Technologies

Progress has been made with linked-data and other Semantic Web technologies over the past few years so it’s a great time to revisit how we may work with linked-data using R. A few years ago (March 13, 2014 post “R and RDF: Where Statistics and the Semantic Web Meet”) discussion was around the R package rrdf, which is no longer actively supported. Today SPARQL is the package to use. SPARQL works in very much the same way as rrdf.

Install and load the SPARQL package, if you haven’t done so already:

install.packages("SPARQL")
library(SPARQL)

Enter a SPARQL endpoint variable. The SPARQL endpoint is the semantic search entry point for a particular data repository. In this case let’s use the WikiPathways endpoint that provides data on biological pathways (see my June 27, 2017 post “WikiPathways: Open Biological Pathways Data on the Semantic Web”).

endpoint <- 'http://sparql.wikipathways.org'

Next assign a SPARQL query to a query variable. Be sure to surround the query itself with quotes.

query <- 'PREFIX wp:
SELECT DISTINCT str(?title) as ?pathway
WHERE {
?pw dc:title ?title ;
wp:organism ?organism ;
wp:organismName "Homo sapiens"^^xsd:string .
}
ORDER BY ?pathway'

Use the SPARQL() function to carry out the query.

data <- SPARQL(endpoint, query)

The results, the names for all the repository's pathways for humans, are now in the R environment. (Enter 'data' to see the dataset.)

The R SPARQL package is a great tool for those who want to pull data from across the Semantic Web and use R to analyze and visualize the results.