Writing ingests with the source API

Overview

Although not required to write an ingest, we have provided a source parent class that can be extended to leverage reusable functionality in each ingest.

To create a new ingest using this method, first extend the Source class.

If the source contains flat files, include a files dictionary with this structure:

files = {
    'somekey': {
        'file': 'filename.tsv',
        'url': 'http://example.org/filename.tsv'
    },
    ...
}

For example:

from dipper.sources.Source import Source


class TPO(Source):
"""
The ToxicoPhenomicOmicsDB contains data on ...
"""

files = {
    'genes': {
        'file': 'genes.tsv',
        'url': 'http://example.org/genes.tsv'
    }
}

Initializing the class

Each source class takes a graph_type (string) and are_bnodes_skolemized (boolean) parameters. These parameters are used to initialize a graph object in the Source constructor.

Note: In the future this may be adjusted so that a graph object is passed into each source.

For example:

def __init__(self, graph_type, are_bnodes_skolemized):
    super().__init__(graph_type, are_bnodes_skolemized, 'TPO')

Writing the fetcher

This method is intended to fetch data from the remote locations (if it is newer than the local copy).

Extend the parent fetch function. If a the remote file has already been downloaded. The fetch method checks the remote headers to see if it has been updated. For sources not served over HTTP, this method may need to be overriden, for example in Bgee.

For example:

def fetch(self, is_dl_forced=False):
    """
    Fetches files from TPO

    :param is_dl_forced (bool): Force download
    :return None
    """
    self.get_files(is_dl_forced)

Writing the parser

Typically these are written by looping through the series of files that were obtained by the fetch method. The goal is to process each file minimally, adding classes and individuals as necessary, and adding triples to the sources’ graph.

For example:

def parse(self, limit=None):
    """
    Parses genes from TPO

    :param limit (int, optional) limit the number of rows processed
    :return None
    """
    if limit is not None:
        logger.info("Only parsing first %d rows", limit)

    # Open file
    fh = open('/'.join((self.rawdir, self.files['genes']['file'])), 'r')
    # Parse file
    self._add_gene_toxicology(fh, limit)
    # Close file
    fh.close()

Considerations when writing a parser

There are certain conventions that we follow when parsing data:

1. Genes are a special case of genomic feature that are added as (OWL) Classes. But all other genomic features are added as individuals of an owl class.

2. If a source references an external identifier then, assume that it has been processed in another source script, and only add the identifier (but not the label) to it within this source’s file. This will help prevent label collisions related to slightly different versions of the source data when integrating downstream.

3. You can instantiate a class or individual as many times as you want; they will get merged in the graph and will only show up once in the resulting output.