dipper.sources.Source module¶
-
class
dipper.sources.Source.
Source
(graph_type='rdf_graph', are_bnodes_skized=False, data_release_version=None, name=None, ingest_title=None, ingest_url=None, ingest_logo=None, ingest_description=None, license_url=None, data_rights=None, file_handle=None)¶ Bases:
object
Abstract class for any data sources that we’ll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a graph. The graph will then be written out to a single self.name().<dest_fmt> file.
Also provides a means to marshal metadata in a consistent fashion
Houses the global translation table (from ontology label to ontology term) so it may as well be used everywhere.
-
ARGV
= {}¶
-
DIPPERCACHE
= 'https://archive.monarchinitiative.org/DipperCache'¶
-
static
check_fileheader
(expected, received, src_key=None)¶ Compare file headers received versus file headers expected if the expected headers are a subset (proper or not) of received headers report suscess (warn if proper subset)
param: expected list param: received list
return: truthyness
-
check_if_remote_is_newer
(remote, local, headers)¶ Given a remote file location, and the corresponding local file this will check the datetime stamp on the files to see if the remote one is newer. This is a convenience method to be used so that we don’t have to re-fetch files that we already have saved locally :param remote: URL of file to fetch from remote server :param local: pathname to save file to locally :return: True if the remote file is newer and should be downloaded
-
command_args
()¶ To make arbitrary variables from dipper-etl.py’s calling enviroment available when working in source ingests in a hopefully universal way
Does not appear to be populated till after an ingest’s _init_() finishes.
-
compare_local_remote_bytes
(remotefile, localfile, remote_headers=None)¶ test to see if fetched file is the same size as the remote file using information in the content-length field in the HTTP header :return: True or False
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
fetch_from_url
(remoteurl, localfile=None, is_dl_forced=False, headers=None)¶ Given a remote url and a local filename, attempt to determine if the remote file is newer; if it is, fetch the remote file and save it to the specified localfile, reporting the basic file information once it is downloaded :param remoteurl: URL of remote file to fetch :param localfile: pathname of file to save locally
Returns: bool
-
static
file_len
(fname)¶
-
files
= {}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
static
get_file_md5
(directory, filename, blocksize=1048576)¶
-
get_files
(is_dl_forced, files=None, delay=0)¶ Given a set of files for this source, it will go fetch them, and set a default version by date. If you need to set the version number by another method, then it can be set again. :param is_dl_forced - boolean :param files dict - override instance files dict :return: None
-
static
get_local_file_size
(localfile)¶ Parameters: localfile – Returns: size of file
-
get_remote_content_len
(remote, headers=None)¶ Parameters: remote – Returns: size of remote file
-
static
hash_id
(wordage)¶ prepend ‘b’ to avoid leading with digit truncate to a 20 char sized word with a leading ‘b’ return truncated sha1 hash of string.
- by the birthday paradox;
- expect 50% chance of collision after 69 billion invocations however these are only hoped to be unique within a single file
Consider reducing to 17 hex chars to fit in a 64 bit word 16 discounting a leading constant gives a 50% chance of collision at about 4.3b billion unique input strings (currently _many_ orders of magnitude below that)
Parameters: long_string – str string to be hashed Returns: str hash of id
-
load_local_translationtable
(name)¶ Load “ingest specific” translation from whatever they called something to the ontology label we need to map it to. To facilitate seeing more ontology labels in dipper ingests a reverse mapping from ontology labels to external strings is also generated and available as a dict localtcid
‘—
# %s.yaml “”: “” # example’
-
static
make_id
(long_string, prefix='MONARCH')¶ a method to create DETERMINISTIC identifiers based on a string’s digest. currently implemented with sha1 :param long_string: :return:
-
namespaces
= {}¶
-
static
open_and_parse_yaml
(yamlfile)¶ Parameters: file – String, path to file containing label-id mappings in the first two columns of each row Returns: dict where keys are labels and values are ids
-
parse
(limit)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
static
parse_mapping_file
(file)¶ Parameters: file – String, path to file containing label-id mappings in the first two columns of each row Returns: dict where keys are labels and values are ids
-
process_xml_table
(elem, table_name, processing_function, limit)¶ This is a convenience function to process the elements of an xml dump of a mysql relational database. The “elem” is akin to a mysql table, with it’s name of
`table_name`
. It will process each`row`
given the`processing_function`
supplied. :param elem: The element data :param table_name: The name of the table to process :param processing_function: The row processing function :param limit:Appears to be making calls to the elementTree library although it not explicitly imported here.
Returns:
-
static
remove_backslash_r
(filename, encoding)¶ A helpful utility to remove Carriage Return from any file. This will read a file into memory, and overwrite the contents of the original file.
TODO: This function may be a liability
Parameters: filename – Returns:
-
resolve
(word, mandatory=True, default=None)¶ composite mapping given f(x) and g(x) here: localtt & globaltt respectivly return g(f(x))|g(x)||f(x)|x in order of preference returns x|default on fall through if finding a mapping is not mandatory (by default finding is mandatory).
This may be specialized further from any mapping to a global mapping only; if need be.
Parameters: - word – the string to find as a key in translation tables
- mandatory – boolean to cause failure when no key exists
- default – string to return if nothing is found (& not manandatory)
- :return
- value from global translation table, or value from local translation table, or the query key if finding a value is not mandatory (in this order)
-
settestmode
(mode)¶ Set testMode to (mode). - True: run the Source in testMode; - False: run it in full mode :param mode: :return: None
-
settestonly
(testonly)¶ Set that this source should only be processed in testMode :param testOnly: :return: None
-
whoami
()¶ pointless convieniance
-
write
(fmt='turtle', stream=None, write_metadata_in_main_graph=True)¶ - This convenience method will write out all of the graphs
- associated with the source.
Right now these are hardcoded to be a single main “graph” and a “src_dataset.ttl” and a “src_test.ttl” If you do not supply stream=’stdout’ it will default write these to files.
In addition, if the version number isn’t yet set in the dataset, it will be set to the date on file. :return: None
-