dipper.sources.Source module

class dipper.sources.Source.Source(graph_type, are_bnodes_skized=False, name=None)

Bases: object

Abstract class for any data sources that we’ll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a graph. The graph will then be written out to a single self.name().<dest_fmt> file.

checkIfRemoteIsNewer(remote, local, headers)

Given a remote file location, and the corresponding local file this will check the datetime stamp on the files to see if the remote one is newer. This is a convenience method to be used so that we don’t have to re-fetch files that we already have saved locally :param remote: URL of file to fetch from remote server :param local: pathname to save file to locally :return: True if the remote file is newer and should be downloaded

compare_local_remote_bytes(remotefile, localfile, remote_headers=None)

test to see if fetched file is the same size as the remote file using information in the content-length field in the HTTP header :return: True or False

declareAsOntology(graph)

The file we output needs to be declared as an ontology, including it’s version information.

TEC: I am not convinced dipper reformating external data as RDF triples makes an OWL ontology (nor that it should be considered a goal).

Proper ontologies are built by ontologists. Dipper reformats data and anotates/decorates it with a minimal set of carefully arranged terms drawn from from multiple proper ontologies. Which allows the whole (dipper’s RDF triples and parent ontologies) to function as a single ontology we can reason over when combined in a store such as SciGraph.

Including more than the minimal ontological terms in dipper’s RDF output constitutes a liability as it allows greater divergence between dipper artifacts and the proper ontologies.

Further information will be augmented in the dataset object. :param version: :return:

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

fetch_from_url(remotefile, localfile=None, is_dl_forced=False, headers=None)

Given a remote url and a local filename, attempt to determine if the remote file is newer; if it is, fetch the remote file and save it to the specified localfile, reporting the basic file information once it is downloaded :param remotefile: URL of remote file to fetch :param localfile: pathname of file to save locally :return: None

file_len(fname)
files = {}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

static get_eco_map(url)

To conver the three column file to a hashmap we join primary and secondary keys, for example IEA GO_REF:0000002 ECO:0000256 IEA GO_REF:0000003 ECO:0000501 IEA Default ECO:0000501

becomes IEA-GO_REF:0000002: ECO:0000256 IEA-GO_REF:0000003: ECO:0000501 IEA: ECO:0000501

Returns:dict
get_file_md5(directory, file, blocksize=1048576)
get_files(is_dl_forced, files=None)

Given a set of files for this source, it will go fetch them, and set a default version by date. If you need to set the version number by another method, then it can be set again. :param is_dl_forced - boolean :param files dict - override instance files dict :return: None

get_local_file_size(localfile)
Parameters:localfile
Returns:size of file
get_remote_content_len(remote, headers=None)
Parameters:remote
Returns:size of remote file
static hash_id(long_string)

prepend ‘b’ to avoid leading with digit truncate to 64bit sized word return sha1 hash of string :param long_string: str string to be hashed :return: str hash of id

static make_id(long_string, prefix='MONARCH')

a method to create DETERMINISTIC identifiers based on a string’s digest. currently implemented with sha1 :param long_string: :return:

namespaces = {}
static open_and_parse_yaml(file)
Parameters:file – String, path to file containing label-id mappings in the first two columns of each row
Returns:dict where keys are labels and values are ids
parse(limit)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

static parse_mapping_file(file)
Parameters:file – String, path to file containing label-id mappings in the first two columns of each row
Returns:dict where keys are labels and values are ids
process_xml_table(elem, table_name, processing_function, limit)

This is a convenience function to process the elements of an xml document, when the xml is used as an alternative way of distributing sql-like tables. In this case, the “elem” is akin to an sql table, with it’s name of `table_name`. It will then process each `row` given the `processing_function` supplied.

Parameters:
  • elem – The element data
  • table_name – The name of the table to process
  • processing_function – The row processing function
  • limit

Appears to be making calls to the elementTree library although it not explicitly imported here.

Returns:
static remove_backslash_r(filename, encoding)

A helpful utility to remove Carriage Return from any file. This will read a file into memory, and overwrite the contents of the original file.

TODO: This function may be a liability

Parameters:filename
Returns:
settestmode(mode)

Set testMode to (mode). - True: run the Source in testMode; - False: run it in full mode :param mode: :return: None

settestonly(testonly)

Set that this source should only be processed in testMode :param testOnly: :return: None

whoami()
write(fmt='turtle', stream=None)

This convenience method will write out all of the graphs associated with the source. Right now these are hardcoded to be a single “graph” and a “src_dataset.ttl” and a “src_test.ttl” If you do not supply stream=’stdout’ it will default write these to files.

In addition, if the version number isn’t yet set in the dataset, it will be set to the date on file. :return: None