dipper.sources.Source module¶
-
class
dipper.sources.Source.
Source
(graph_type, are_bnodes_skized=False, name=None)¶ Bases:
object
Abstract class for any data sources that we’ll import and process. Each of the subclasses will fetch() the data, scrub() it as necessary, then parse() it into a graph. The graph will then be written out to a single self.name().<dest_fmt> file.
-
checkIfRemoteIsNewer
(remote, local, headers)¶ Given a remote file location, and the corresponding local file this will check the datetime stamp on the files to see if the remote one is newer. This is a convenience method to be used so that we don’t have to re-fetch files that we already have saved locally :param remote: URL of file to fetch from remote server :param local: pathname to save file to locally :return: True if the remote file is newer and should be downloaded
-
compare_local_remote_bytes
(remotefile, localfile, remote_headers=None)¶ test to see if fetched file is the same size as the remote file using information in the content-length field in the HTTP header :return: True or False
-
declareAsOntology
(graph)¶ The file we output needs to be declared as an ontology, including it’s version information.
TEC: I am not convinced dipper reformating external data as RDF triples makes an OWL ontology (nor that it should be considered a goal).
Proper ontologies are built by ontologists. Dipper reformats data and anotates/decorates it with a minimal set of carefully arranged terms drawn from from multiple proper ontologies. Which allows the whole (dipper’s RDF triples and parent ontologies) to function as a single ontology we can reason over when combined in a store such as SciGraph.
Including more than the minimal ontological terms in dipper’s RDF output constitutes a liability as it allows greater divergence between dipper artifacts and the proper ontologies.
Further information will be augmented in the dataset object. :param version: :return:
-
fetch
(is_dl_forced=False)¶ abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None
-
fetch_from_url
(remotefile, localfile=None, is_dl_forced=False, headers=None)¶ Given a remote url and a local filename, attempt to determine if the remote file is newer; if it is, fetch the remote file and save it to the specified localfile, reporting the basic file information once it is downloaded :param remotefile: URL of remote file to fetch :param localfile: pathname of file to save locally :return: None
-
file_len
(fname)¶
-
files
= {}¶
-
getTestSuite
()¶ An abstract method that should be overwritten with tests appropriate for the specific source. :return:
-
static
get_eco_map
(url)¶ To conver the three column file to a hashmap we join primary and secondary keys, for example IEA GO_REF:0000002 ECO:0000256 IEA GO_REF:0000003 ECO:0000501 IEA Default ECO:0000501
becomes IEA-GO_REF:0000002: ECO:0000256 IEA-GO_REF:0000003: ECO:0000501 IEA: ECO:0000501
Returns: dict
-
get_file_md5
(directory, file, blocksize=1048576)¶
-
get_files
(is_dl_forced, files=None)¶ Given a set of files for this source, it will go fetch them, and set a default version by date. If you need to set the version number by another method, then it can be set again. :param is_dl_forced - boolean :param files dict - override instance files dict :return: None
-
get_local_file_size
(localfile)¶ Parameters: localfile – Returns: size of file
-
get_remote_content_len
(remote, headers=None)¶ Parameters: remote – Returns: size of remote file
-
static
hash_id
(long_string)¶ prepend ‘b’ to avoid leading with digit truncate to 64bit sized word return sha1 hash of string :param long_string: str string to be hashed :return: str hash of id
-
static
make_id
(long_string, prefix='MONARCH')¶ a method to create DETERMINISTIC identifiers based on a string’s digest. currently implemented with sha1 :param long_string: :return:
-
namespaces
= {}¶
-
static
open_and_parse_yaml
(file)¶ Parameters: file – String, path to file containing label-id mappings in the first two columns of each row Returns: dict where keys are labels and values are ids
-
parse
(limit)¶ abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None
-
static
parse_mapping_file
(file)¶ Parameters: file – String, path to file containing label-id mappings in the first two columns of each row Returns: dict where keys are labels and values are ids
-
process_xml_table
(elem, table_name, processing_function, limit)¶ This is a convenience function to process the elements of an xml document, when the xml is used as an alternative way of distributing sql-like tables. In this case, the “elem” is akin to an sql table, with it’s name of
`table_name`
. It will then process each`row`
given the`processing_function`
supplied.Parameters: - elem – The element data
- table_name – The name of the table to process
- processing_function – The row processing function
- limit –
Appears to be making calls to the elementTree library although it not explicitly imported here.
Returns:
-
static
remove_backslash_r
(filename, encoding)¶ A helpful utility to remove Carriage Return from any file. This will read a file into memory, and overwrite the contents of the original file.
TODO: This function may be a liability
Parameters: filename – Returns:
-
settestmode
(mode)¶ Set testMode to (mode). - True: run the Source in testMode; - False: run it in full mode :param mode: :return: None
-
settestonly
(testonly)¶ Set that this source should only be processed in testMode :param testOnly: :return: None
-
whoami
()¶
-
write
(fmt='turtle', stream=None)¶ This convenience method will write out all of the graphs associated with the source. Right now these are hardcoded to be a single “graph” and a “src_dataset.ttl” and a “src_test.ttl” If you do not supply stream=’stdout’ it will default write these to files.
In addition, if the version number isn’t yet set in the dataset, it will be set to the date on file. :return: None
-