dipper.sources.Source module¶
-
class
dipper.sources.Source.Source(graph_type='rdf_graph', are_bnodes_skized=False, data_release_version=None, name=None, ingest_title=None, ingest_url=None, ingest_logo=None, ingest_description=None, license_url=None, data_rights=None, file_handle=None)¶ Bases:
objectAbstract class for any data sources that we’ll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a graph. The graph will then be written out to a single self.name().<dest_fmt> file.
Also provides a means to marshal metadata in a consistent fashion
Houses the global translation table (from ontology label to ontology term) so it may as well be used everywhere.
-
ARGV= {}¶
-
DIPPERCACHE= 'https://archive.monarchinitiative.org/DipperCache'¶
-
static
check_fileheader(expected, received, src_key=None)¶ Compare file headers received versus file headers expected if the expected headers are a subset (proper or not) of received headers report suscess (warn if proper subset)
param: expected list param: received list
return: truthyness
-
check_if_remote_is_newer(remote, local, headers)¶ Given a remote file location, and the corresponding local file this will check the datetime stamp on the files to see if the remote one is newer. This is a convenience method to be used so that we don’t have to re-fetch files that we already have saved locally :param remote: URL of file to fetch from remote server :param local: pathname to save file to locally :return: True if the remote file is newer and should be downloaded
-
command_args()¶ To make arbitrary variables from dipper-etl.py’s calling enviroment available when working in source ingests in a hopefully universal way
Does not appear to be populated till after an ingest’s _init_() finishes.
-
compare_local_remote_bytes(remotefile, localfile, remote_headers=None)¶ test to see if fetched file is the same size as the remote file using information in the content-length field in the HTTP header :return: True or False
-
fetch(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
fetch_from_url(remoteurl, localfile=None, is_dl_forced=False, headers=None)¶ Given a remote url and a local filename, attempt to determine if the remote file is newer; if it is, fetch the remote file and save it to the specified localfile, reporting the basic file information once it is downloaded :param remoteurl: URL of remote file to fetch :param localfile: pathname of file to save locally
Returns: bool
-
static
file_len(fname)¶
-
files= {}¶
-
getTestSuite()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
static
get_file_md5(directory, filename, blocksize=1048576)¶
-
get_files(is_dl_forced, files=None, delay=0)¶ Given a set of files for this source, it will go fetch them, and set a default version by date. If you need to set the version number by another method, then it can be set again. :param is_dl_forced - boolean :param files dict - override instance files dict :return: None
-
static
get_local_file_size(localfile)¶ Parameters: localfile – Returns: size of file
-
get_remote_content_len(remote, headers=None)¶ Parameters: remote – Returns: size of remote file
-
static
hash_id(wordage)¶ prepend ‘b’ to avoid leading with digit truncate to a 20 char sized word with a leading ‘b’ return truncated sha1 hash of string.
- by the birthday paradox;
- expect 50% chance of collision after 69 billion invocations however these are only hoped to be unique within a single file
Consider reducing to 17 hex chars to fit in a 64 bit word 16 discounting a leading constant gives a 50% chance of collision at about 4.3b billion unique input strings (currently _many_ orders of magnitude below that)
Parameters: long_string – str string to be hashed Returns: str hash of id
-
load_local_translationtable(name)¶ Load “ingest specific” translation from whatever they called something to the ontology label we need to map it to. To facilitate seeing more ontology labels in dipper ingests a reverse mapping from ontology labels to external strings is also generated and available as a dict localtcid
‘—
# %s.yaml “”: “” # example’
-
static
make_id(long_string, prefix='MONARCH')¶ a method to create DETERMINISTIC identifiers based on a string’s digest. currently implemented with sha1 :param long_string: :return:
-
namespaces= {}¶
-
static
open_and_parse_yaml(yamlfile)¶ Parameters: file – String, path to file containing label-id mappings in the first two columns of each row Returns: dict where keys are labels and values are ids
-
parse(limit)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
static
parse_mapping_file(file)¶ Parameters: file – String, path to file containing label-id mappings in the first two columns of each row Returns: dict where keys are labels and values are ids
-
process_xml_table(elem, table_name, processing_function, limit)¶ This is a convenience function to process the elements of an xml dump of a mysql relational database. The “elem” is akin to a mysql table, with it’s name of
`table_name`. It will process each`row`given the`processing_function`supplied. :param elem: The element data :param table_name: The name of the table to process :param processing_function: The row processing function :param limit:Appears to be making calls to the elementTree library although it not explicitly imported here.
Returns:
-
static
remove_backslash_r(filename, encoding)¶ A helpful utility to remove Carriage Return from any file. This will read a file into memory, and overwrite the contents of the original file.
TODO: This function may be a liability
Parameters: filename – Returns:
-
resolve(word, mandatory=True, default=None)¶ composite mapping given f(x) and g(x) here: localtt & globaltt respectivly return g(f(x))|g(x)||f(x)|x in order of preference returns x|default on fall through if finding a mapping is not mandatory (by default finding is mandatory).
This may be specialized further from any mapping to a global mapping only; if need be.
Parameters: - word – the string to find as a key in translation tables
- mandatory – boolean to cause failure when no key exists
- default – string to return if nothing is found (& not manandatory)
- :return
- value from global translation table, or value from local translation table, or the query key if finding a value is not mandatory (in this order)
-
settestmode(mode)¶ Set testMode to (mode). - True: run the Source in testMode; - False: run it in full mode :param mode: :return: None
-
settestonly(testonly)¶ Set that this source should only be processed in testMode :param testOnly: :return: None
-
whoami()¶ pointless convieniance
-
write(fmt='turtle', stream=None, write_metadata_in_main_graph=True)¶ - This convenience method will write out all of the graphs
- associated with the source.
Right now these are hardcoded to be a single main “graph” and a “src_dataset.ttl” and a “src_test.ttl” If you do not supply stream=’stdout’ it will default write these to files.
In addition, if the version number isn’t yet set in the dataset, it will be set to the date on file. :return: None
-