dipper.models.Dataset module¶

Produces metadata about ingested data

class dipper.models.Dataset.Dataset(identifier, data_release_version, ingest_name, ingest_title, ingest_url, ingest_logo=None, ingest_description=None, license_url=None, data_rights=None, graph_type='rdf_graph', file_handle=None, distribution_type='ttl', dataset_curie_prefix='MonarchArchive')¶

Bases: object

This class produces metadata about a dataset that is compliant with the HCLS dataset specification: https://www.w3.org/TR/2015/NOTE-hcls-dataset-20150514/#s4_4

Summary level: The summary level provides a description of a dataset that is independent of a specific version or format. (e.g. the Monarch ingest of CTD) CURIE for this is something like MonarchData:[SOURCE IDENTIFIER]

Version level: The version level captures version-specific characteristics of a dataset. (e.g. the 01-02-2018 ingest of CTD) CURIE for this is something like MonarchData:[SOURCE IDENTIFIER_INGESTTIMESTAMP]

Distribution level: The distribution level captures metadata about a specific form and version of a dataset (e.g. turtle file for 01-02-2018 ingest of CTD). There is a [distribution level resource] for each different downloadable file we emit, i.e. one for the TTL file, one for the ntriples file, etc. CURIE for this is like MonarchData:[SOURCE IDENTIFIER_INGESTTIMESTAMP].ttl or MonarchData:[SOURCE IDENTIFIER_INGESTTIMESTAMP].nt or MonarchData:[SOURCE IDENTIFIER_INGESTTIMESTAMP].[whatever file format]

We write out at least the following triples:

SUMMARY LEVEL TRIPLES: [summary level resource] - rdf:type -> dctypes:Dataset [summary level resource] - dc:title -> title (literal) [summary level resource] - dc:description -> description (literal)

(use docstring from Source class)

[summary level resource] - dc:source -> [source web page, e.g. omim.org] [summary level resource] - schema:logo -> [source logo IRI] [summary level resource] - dc:publisher -> monarchinitiative.org

n.b: about summary level resource triples: – HCLS spec says we “should” link to our logo and web page, but I’m not, because it would confuse the issue of whether we are pointing to our logo/page or the logo/page of the data source for this ingest. Same below for [version level resource] and [distibution level resource] - I’m not linking to our page/logo down there either. - spec says we “should” include summary level triples describing Update frequency and SPARQL endpoint but I’m omitting this for now, because these are not clearly defined at the moment

VERSION LEVEL TRIPLES: [version level resource] - rdf:type -> dctypes:Dataset [version level resource] - dc:title -> version title (literal) [version level resource] - dc:description -> version description (literal) [version level resource] - dc:created -> ingest timestamp [ISO 8601 compliant] [version level resource] - pav:version -> ingest timestamp (same one above) [version level resource] - dc:creator -> monarchinitiative.org [version level resource] - dc:publisher -> monarchinitiative.org [version level resource] - dc:isVersionOf -> [summary level resource] [version level resource] - dc:source -> [source file 1 IRI] [version level resource] - dc:source -> [source file 2 IRI] …

[source file 1 IRI] - pav:retrievedOn -> [download date timestamp] [source file 2 IRI] - pav:version -> [source version (if set, optional)] [source file 2 IRI] - pav:retrievedOn -> [download date timestamp] [source file 2 IRI] - pav:version -> [source version (if set, optional)] …

[version level resource] - pav:createdWith -> [Dipper github URI] [version level resource] - void:dataset -> [distribution level resource]

[version level resource] - cito:citesAsAuthoriy -> [citation id 1] [version level resource] - cito:citesAsAuthoriy -> [citation id 2] [version level resource] - cito:citesAsAuthoriy -> [citation id 3]

n.b: about version level resource triples: - spec says we “should” include Date of issue/dc:issued triple, but I’m not because it is redundant with this triple above: [version level resource] - dc:created -> time stamp and would introduce ambiguity and confusion if the two disagree. Same below for [distribution level resource] - dc:created -> tgiime stamp below Also omitting:

triples linking to our logo and page, see above.

License/dc:license triple, because we will make this triple via the [distribution level resource] below

Language/dc:language triple b/c it seems superfluous. Same below for [distribution level resource] - no language triple.

[version level resource] - pav:version triple is also a bit redundant

with the pav:version triple below, but the spec requires both these triples - I’m omitting the [version level resource] -> pav:previousVersion because Dipper doesn’t know this info for certain at run time. Same below for [distribution level resource] - pav:previousVersion.

DISTRIBUTION LEVEL TRIPLES: [distribution level resource] - rdf:type -> dctypes:Dataset [distribution level resource] - rdf:type -> dcat:Distribution [distribution level resource] - dc:title -> distribution title (literal) [distribution level resource] - dc:description -> distribution description (lit.) [distribution level resource] - dc:created -> ingest timestamp[ISO 8601 compliant] [distribution level resource] - pav:version -> ingest timestamp (same as above) [distribution level resource] - dc:creator -> monarchinitiative.org [distribution level resource] - dc:publisher -> monarchinitiative.org [distribution level resource] - dc:license -> [license info, if available

otherwise indicate unknown]

[distribution level resource] - dc:rights -> [data rights IRI] [distribution level resource] - pav:createdWith -> [Dipper github URI] [distribution level resource] - dc:format -> [IRI of ttl|nt|whatever spec] [distribution level resource] - dcat:downloadURL -> [ttl|nt URI] [distribution level resource] - void:triples -> [triples count (literal)] [distribution level resource] - void:entities -> [entities count (literal)] [distribution level resource] - void:distinctSubjects -> [subject count (literal)] [distribution level resource] - void:distinctObjects -> [object count (literal)] [distribution level resource] - void:properties -> [properties count (literal)] …

n.b: about distribution level resource triples: - omitting Vocabularies used/void:vocabulary and Standards used/dc:conformTo triples, because they are described in the ttl file - also omitting Example identifier/idot:exampleIdentifier and Example resource/void:exampleResource, because we don’t really have one canonical example of either - they’re all very different. - [distribution level resource] - dc:created should have the exact same time stamp as this triple above: [version level resource] - dc:created -> time stamp - this [distribution level resource] - pav:version triple should have the same object as [version level resource] - pav:version triple above - Data source provenance/dc:source triples are above in the [version level resource] - omitting Byte size/dc:byteSize, RDF File URL/void:dataDump, and Linkset/void:subset triples because they probably aren’t necessary for MI right now - these triples “should” be emitted, but we will do this in a later iteration: # of classes void:classPartition IRI # of literals void:classPartition IRI # of RDF graphs void:classPartition IRI

Note: Do not use blank nodes in the dataset graph. This dataset graph is added to the main Dipper graph in Source.write() like so

$ mainGraph = mainGraph + datasetGraph

which apparently in theory could lead to blank node ID collisions between the two graphs.

Note also that this implementation currently does not support producing metadata for StreamedGraph graphs (see dipper/graph/StreamedGraph.py). StreamedGraph is currently not being used for any ingests, so this isn’t a problem. There was talk of using StreamedGraph for a rewrite/refactor of the Clinvar ingest, which would probably require adding support here for StreamedGraph’s.

get_graph()¶: This method returns the dataset graph :param :return: dataset graph

get_license()¶: This method returns the license info :param :return: license info

static hash_id(word)¶

Given a string, make a hash Duplicated from Source.py.

Parameters:	word – str string to be hashed
Returns:	hash of id

static make_id(long_string, prefix='MONARCH')¶: A method to create DETERMINISTIC identifiers based on a string’s digest. currently implemented with sha1 Duplicated from Source.py to avoid circular imports. :param long_string: string to use to generate identifier :param prefix: prefix to prepend to identifier [Monarch] :return: a Monarch identifier

set_citation(citation_id)¶: This method adds [citaton_id] argument to the set of citations, and also adds a triple indicating that version level cito:citesAsAuthority [citation_id] :param: citation_id :return: none

set_ingest_source(url, predicate=None, is_object_literal=False)¶

This method writes a triple to the dataset graph indicating that the ingest used a file or resource at [url] during the ingest.

Triple emitted is version_level_curie dc:source [url]

This triple is likely to be redundant if Source.get_files() is used to retrieve the remote files/resources, since this triple should also be emitted as files/resources are being retrieved. This method is provided as a convenience method for sources that do their own downloading of files.

Parameters:

url – a remote resource used as a source during ingest
predicate – the predicate to use for the triple [“dc:source”] from spec (https://www.w3.org/TR/2015/NOTE-hcls-dataset-20150514/) “Use dc:source when the source dataset was used in whole or in part. Use pav:retrievedFrom when the source dataset was used in whole and was not modified from its original distribution. Use prov:wasDerivedFrom when the source dataset was in whole or in part and was modified from its original distribution.”

Returns:

None

set_ingest_source_file_version_date(file_iri, date, datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#date'))¶

This method sets the version that the source (OMIM, CTD, whatever) uses to refer to this version of the remote file/resource that was used in the ingest

It writes this triple:

file_iri - ‘pav:version’ -> date or timestamp

Version is added as a literal of datatype XSD date

Note: if file_iri was retrieved using get_files(), then the following triple was created and you might not need this method:

file_iri - ‘pav:retrievedOn’ -> download date

Parameters:	file_iri – a remote file or resource used in ingest date – a date in YYYYMMDD format that the source (OMIM, CTD). You can

add timestamp as a version by using a different datatype (below) :param datatype: an XSD literal datatype, default is XSD.date uses to refer to this version of the file/resource used during the ingest :return: None

set_ingest_source_file_version_num(file_iri, version)¶

This method sets the version of a remote file or resource that is used in the ingest. It writes this triple:

file_iri - ‘pav:version’ -> version

Version is an untyped literal

Note: if your version is a date or timestamp, use set_ingest_source_file_version_date() instead

Parameters:	file_iri – a remote file or resource used in ingest version – a number or string (e.g. v1.2.3) that the source (OMIM, CTD)

uses to refer to this version of the file/resource used during the ingest :return: None

set_ingest_source_file_version_retrieved_on(file_iri, date, datatype=rdflib.term.URIRef('http://www.w3.org/2001/XMLSchema#date'))¶

This method sets the date on which a remote file/resource (from OMIM, CTD, etc) was retrieved.

It writes this triple:

file_iri - ‘pav:retrievedOn’ -> date or timestamp

Version is added as a literal of datatype XSD date by default

Note: if file_iri was retrieved using get_files(), then the following triple was created and you might not need this method:

file_iri - ‘pav:retrievedOn’ -> download date

Parameters:	file_iri – a remote file or resource used in ingest date – a date in YYYYMMDD format that the source (OMIM, CTD). You can

add timestamp as a version by using a different datatype (below) :param datatype: an XSD literal datatype, default is XSD.date uses to refer to this version of the file/resource used during the ingest :return: None