dipper.sources.Source module¶

class dipper.sources.Source.Source(graph_type, are_bnodes_skized=False, name=None)¶

Bases: object

Abstract class for any data sources that we’ll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a graph. The graph will then be written out to a single self.name().<dest_fmt> file.

checkIfRemoteIsNewer(remote, local, headers)¶: Given a remote file location, and the corresponding local file this will check the datetime stamp on the files to see if the remote one is newer. This is a convenience method to be used so that we don’t have to re-fetch files that we already have saved locally :param remote: URL of file to fetch from remote server :param local: pathname to save file to locally :return: True if the remote file is newer and should be downloaded

compare_local_remote_bytes(remotefile, localfile, remote_headers=None)¶: test to see if fetched file is the same size as the remote file using information in the content-length field in the HTTP header :return: True or False

declareAsOntology(graph)¶

The file we output needs to be declared as an ontology, including it’s version information.

TEC: I am not convinced dipper reformating external data as RDF triples makes an OWL ontology (nor that it should be considered a goal).

Proper ontologies are built by ontologists. Dipper reformats data and anotates/decorates it with a minimal set of carefully arranged terms drawn from from multiple proper ontologies. Which allows the whole (dipper’s RDF triples and parent ontologies) to function as a single ontology we can reason over when combined in a store such as SciGraph.

Including more than the minimal ontological terms in dipper’s RDF output constitutes a liability as it allows greater divergence between dipper artifacts and the proper ontologies.

Further information will be augmented in the dataset object. :param version: :return:

fetch(is_dl_forced=False)¶: abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

fetch_from_url(remotefile, localfile=None, is_dl_forced=False, headers=None)¶: Given a remote url and a local filename, attempt to determine if the remote file is newer; if it is, fetch the remote file and save it to the specified localfile, reporting the basic file information once it is downloaded :param remotefile: URL of remote file to fetch :param localfile: pathname of file to save locally :return: None

file_len(fname)¶

files = {}¶

getTestSuite()¶: An abstract method that should be overwritten with tests appropriate for the specific source. :return:

static get_eco_map(url)¶

To conver the three column file to a hashmap we join primary and secondary keys, for example IEA GO_REF:0000002 ECO:0000256 IEA GO_REF:0000003 ECO:0000501 IEA Default ECO:0000501

becomes IEA-GO_REF:0000002: ECO:0000256 IEA-GO_REF:0000003: ECO:0000501 IEA: ECO:0000501

Returns:	dict

get_file_md5(directory, file, blocksize=1048576)¶

get_files(is_dl_forced, files=None)¶: Given a set of files for this source, it will go fetch them, and set a default version by date. If you need to set the version number by another method, then it can be set again. :param is_dl_forced - boolean :param files dict - override instance files dict :return: None

get_local_file_size(localfile)¶

Parameters:	localfile –
Returns:	size of file

get_remote_content_len(remote, headers=None)¶

Parameters:	remote –
Returns:	size of remote file

static hash_id(long_string)¶: prepend ‘b’ to avoid leading with digit truncate to 64bit sized word return sha1 hash of string :param long_string: str string to be hashed :return: str hash of id

static make_id(long_string, prefix='MONARCH')¶: a method to create DETERMINISTIC identifiers based on a string’s digest. currently implemented with sha1 :param long_string: :return:

namespaces = {}¶

static open_and_parse_yaml(file)¶

Parameters:	file – String, path to file containing label-id mappings in the first two columns of each row
Returns:	dict where keys are labels and values are ids

parse(limit)¶: abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

static parse_mapping_file(file)¶

Parameters:	file – String, path to file containing label-id mappings in the first two columns of each row
Returns:	dict where keys are labels and values are ids

process_xml_table(elem, table_name, processing_function, limit)¶

This is a convenience function to process the elements of an xml document, when the xml is used as an alternative way of distributing sql-like tables. In this case, the “elem” is akin to an sql table, with it’s name of `table_name`. It will then process each `row` given the `processing_function` supplied.

Parameters:	elem – The element data table_name – The name of the table to process processing_function – The row processing function limit –

Appears to be making calls to the elementTree library although it not explicitly imported here.

Returns:

static remove_backslash_r(filename, encoding)¶

A helpful utility to remove Carriage Return from any file. This will read a file into memory, and overwrite the contents of the original file.

TODO: This function may be a liability

Parameters:	filename –
Returns:

settestmode(mode)¶: Set testMode to (mode). - True: run the Source in testMode; - False: run it in full mode :param mode: :return: None

settestonly(testonly)¶: Set that this source should only be processed in testMode :param testOnly: :return: None

whoami()¶

write(fmt='turtle', stream=None)¶

This convenience method will write out all of the graphs associated with the source. Right now these are hardcoded to be a single “graph” and a “src_dataset.ttl” and a “src_test.ttl” If you do not supply stream=’stdout’ it will default write these to files.

In addition, if the version number isn’t yet set in the dataset, it will be set to the date on file. :return: None