dipper.sources.Source module

class dipper.sources.Source.Source(graph_type='rdf_graph', are_bnodes_skized=False, data_release_version=None, name=None, ingest_title=None, ingest_url=None, ingest_logo=None, ingest_description=None, license_url=None, data_rights=None, file_handle=None)

Bases: object

Abstract class for any data sources that we’ll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a graph. The graph will then be written out to a single self.name().<dest_fmt> file.

Also provides a means to marshal metadata in a consistent fashion

Houses the global translation table (from ontology label to ontology term) so it may as well be used everywhere.

ARGV = {}
DIPPERCACHE = 'https://archive.monarchinitiative.org/DipperCache'
static check_fileheader(expected, received, src_key=None)

Compare file headers received versus file headers expected if the expected headers are a subset (proper or not) of received headers report suscess (warn if proper subset)

param: expected list param: received list

return: truthyness

check_if_remote_is_newer(remote, local, headers)

Given a remote file location, and the corresponding local file this will check the datetime stamp on the files to see if the remote one is newer. This is a convenience method to be used so that we don’t have to re-fetch files that we already have saved locally :param remote: URL of file to fetch from remote server :param local: pathname to save file to locally :return: True if the remote file is newer and should be downloaded

command_args()

To make arbitrary variables from dipper-etl.py’s calling enviroment available when working in source ingests in a hopefully universal way

Does not appear to be populated till after an ingest’s _init_() finishes.

compare_local_remote_bytes(remotefile, localfile, remote_headers=None)

test to see if fetched file is the same size as the remote file using information in the content-length field in the HTTP header :return: True or False

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

fetch_from_url(remoteurl, localfile=None, is_dl_forced=False, headers=None)

Given a remote url and a local filename, attempt to determine if the remote file is newer; if it is, fetch the remote file and save it to the specified localfile, reporting the basic file information once it is downloaded :param remoteurl: URL of remote file to fetch :param localfile: pathname of file to save locally

Returns:bool
static file_len(fname)
files = {}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

static get_file_md5(directory, filename, blocksize=1048576)
get_files(is_dl_forced, files=None, delay=0)

Given a set of files for this source, it will go fetch them, and set a default version by date. If you need to set the version number by another method, then it can be set again. :param is_dl_forced - boolean :param files dict - override instance files dict :return: None

static get_local_file_size(localfile)
Parameters:localfile
Returns:size of file
get_remote_content_len(remote, headers=None)
Parameters:remote
Returns:size of remote file
static hash_id(wordage)

prepend ‘b’ to avoid leading with digit truncate to a 20 char sized word with a leading ‘b’ return truncated sha1 hash of string.

by the birthday paradox;
expect 50% chance of collision after 69 billion invocations however these are only hoped to be unique within a single file

Consider reducing to 17 hex chars to fit in a 64 bit word 16 discounting a leading constant gives a 50% chance of collision at about 4.3b billion unique input strings (currently _many_ orders of magnitude below that)

Parameters:long_string – str string to be hashed
Returns:str hash of id
load_local_translationtable(name)

Load “ingest specific” translation from whatever they called something to the ontology label we need to map it to. To facilitate seeing more ontology labels in dipper ingests a reverse mapping from ontology labels to external strings is also generated and available as a dict localtcid

‘—

# %s.yaml “”: “” # example’

static make_id(long_string, prefix='MONARCH')

a method to create DETERMINISTIC identifiers based on a string’s digest. currently implemented with sha1 :param long_string: :return:

namespaces = {}
static open_and_parse_yaml(yamlfile)
Parameters:file – String, path to file containing label-id mappings in the first two columns of each row
Returns:dict where keys are labels and values are ids
parse(limit)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

static parse_mapping_file(file)
Parameters:file – String, path to file containing label-id mappings in the first two columns of each row
Returns:dict where keys are labels and values are ids
process_xml_table(elem, table_name, processing_function, limit)

This is a convenience function to process the elements of an xml dump of a mysql relational database. The “elem” is akin to a mysql table, with it’s name of `table_name`. It will process each `row` given the `processing_function` supplied. :param elem: The element data :param table_name: The name of the table to process :param processing_function: The row processing function :param limit:

Appears to be making calls to the elementTree library although it not explicitly imported here.

Returns:
static remove_backslash_r(filename, encoding)

A helpful utility to remove Carriage Return from any file. This will read a file into memory, and overwrite the contents of the original file.

TODO: This function may be a liability

Parameters:filename
Returns:
resolve(word, mandatory=True, default=None)

composite mapping given f(x) and g(x) here: localtt & globaltt respectivly return g(f(x))|g(x)||f(x)|x in order of preference returns x|default on fall through if finding a mapping is not mandatory (by default finding is mandatory).

This may be specialized further from any mapping to a global mapping only; if need be.

Parameters:
  • word – the string to find as a key in translation tables
  • mandatory – boolean to cause failure when no key exists
  • default – string to return if nothing is found (& not manandatory)
:return
value from global translation table, or value from local translation table, or the query key if finding a value is not mandatory (in this order)
settestmode(mode)

Set testMode to (mode). - True: run the Source in testMode; - False: run it in full mode :param mode: :return: None

settestonly(testonly)

Set that this source should only be processed in testMode :param testOnly: :return: None

whoami()

pointless convieniance

write(fmt='turtle', stream=None, write_metadata_in_main_graph=True)
This convenience method will write out all of the graphs
associated with the source.

Right now these are hardcoded to be a single main “graph” and a “src_dataset.ttl” and a “src_test.ttl” If you do not supply stream=’stdout’ it will default write these to files.

In addition, if the version number isn’t yet set in the dataset, it will be set to the date on file. :return: None