dipper.sources.ClinVar module

Converts ClinVar XML into RDF triples to be ingested by SciGraph. These triples conform to the core of the SEPIO Evidence & Provenance model

We also use the clinvar curated gene to disease mappings to discern the functional consequence of a variant on a gene in cases where this is ambiguous. For example, some variants are located in two genes overlapping on different strands, and may only have a functional consequence on one gene. This is suboptimal and we should look for a source that directly provides this.

creating a test set.
get a full dataset default ClinVarFullRelease_00-latest.xml.gz get the mapping file default gene_condition_source_id get a list of RCV default CV_test_RCV.txt put the input files the raw directory write the test set back to the raw directory

./scripts/ClinVarXML_Subset.sh | gzip > raw/clinvar/ClinVarTestSet.xml.gz

parsing a test set (Skolemizing blank nodes i.e. for Protege) dipper/sources/ClinVar.py -f ClinVarTestSet.xml.gz -o ClinVarTestSet_`datestamp`.nt

For while we are still required to redundantly conflate the owl properties in with the data files.

python3 ./scripts/add-properties2turtle.py –input ./out/ClinVarTestSet_`datestamp`.nt –output ./out/ClinVarTestSet_`datestamp`.nt –format nt

dipper.sources.ClinVar.allele_to_triples(allele, triples) → None

Process allele info such as dbsnp ids and synonyms :param allele: Allele :param triples: List, Buffer to store the triples :return: None

dipper.sources.ClinVar.digest_id(wordage)

return a deterministic digest of input the ‘b’ is an experiment forcing the first char to be non numeric but valid hex; which is in no way required for RDF but may help when using the identifier in other contexts which do not allow identifiers to begin with a digit

:param wordage the string to hash :returns 20 hex char digest

dipper.sources.ClinVar.expand_curie(this_curie)
dipper.sources.ClinVar.is_literal(thing)

make inference on type (literal or CURIE)

return: logical

dipper.sources.ClinVar.make_spo(sub, prd, obj, subject_category=None, object_category=None)

Decorates the three given strings as a line of ntriples (also writes a triple for subj biolink:category and obj biolink:category)

dipper.sources.ClinVar.parse()

Main function for parsing a clinvar XML release and outputting triples

dipper.sources.ClinVar.process_measure_set(measure_set, rcv_acc) → dipper.models.ClinVarRecord.Variant

Given a MeasureSet, create a Variant object :param measure_set: XML object :param rcv_acc: str rcv accession :return: Variant object

dipper.sources.ClinVar.record_to_triples(rcv: dipper.models.ClinVarRecord.ClinVarRecord, triples: List[T], g2p_map: Dict[KT, VT]) → None

Given a ClinVarRecord, adds triples to the triples list

Parameters:
  • rcv – ClinVarRecord
  • triples – List, Buffer to store the triples
  • g2p_map – Gene to phenotype dict
Returns:

None

dipper.sources.ClinVar.resolve(label)

composite mapping given f(x) and g(x) here: GLOBALTT & LOCALTT respectivly in order of preference return g(f(x))|f(x)|g(x) | x TODO consider returning x on fall through

# the decendent resolve(label) function in Source.py # should be used instead and this f(x) removed

: return label’s mapping

Creates links between SCV based on their pathonicty/significance calls

# GENO:0000840 - GENO:0000840 –> is_equilavent_to SEPIO:0000098 # GENO:0000841 - GENO:0000841 –> is_equilavent_to SEPIO:0000098 # GENO:0000843 - GENO:0000843 –> is_equilavent_to SEPIO:0000098 # GENO:0000844 - GENO:0000844 –> is_equilavent_to SEPIO:0000098 # GENO:0000840 - GENO:0000844 –> contradicts SEPIO:0000101 # GENO:0000841 - GENO:0000844 –> contradicts SEPIO:0000101 # GENO:0000841 - GENO:0000843 –> contradicts SEPIO:0000101 # GENO:0000840 - GENO:0000841 –> is_consistent_with SEPIO:0000099 # GENO:0000843 - GENO:0000844 –> is_consistent_with SEPIO:0000099 # GENO:0000840 - GENO:0000843 –> strongly_contradicts SEPIO:0000100

dipper.sources.ClinVar.write_review_status_scores()

Make triples that attach a “star” score to each of ClinVar’s review statuses. (Stars are basically a 0-4 rating of the review status.)

Per https://www.ncbi.nlm.nih.gov/clinvar/docs/details/ Table 1. The review status and assignment of stars( with changes made mid-2015) Number of gold stars Description and review statuses

NO STARS: <ReviewStatus> “no assertion criteria provided” <ReviewStatus> “no assertion provided” No submitter provided an interpretation with assertion criteria (no assertion criteria provided), or no interpretation was provided (no assertion provided)

ONE STAR: <ReviewStatus> “criteria provided, single submitter” <ReviewStatus> “criteria provided, conflicting interpretations” One submitter provided an interpretation with assertion criteria (criteria provided, single submitter) or multiple submitters provided assertion criteria but there are conflicting interpretations in which case the independent values are enumerated for clinical significance (criteria provided, conflicting interpretations)

TWO STARS: <ReviewStatus> “criteria provided, multiple submitters, no conflicts” Two or more submitters providing assertion criteria provided the same interpretation (criteria provided, multiple submitters, no conflicts)

THREE STARS: <ReviewStatus> “reviewed by expert panel” reviewed by expert panel

FOUR STARS: <ReviewStatus> “practice guideline” practice guideline A group wishing to be recognized as an expert panel must first apply to ClinGen by completing the form that can be downloaded from our ftp site.

:param None :return: list of triples that attach a “star” score to each of ClinVar’s review statuses

dipper.sources.ClinVar.write_spo(sub, prd, obj, triples, subject_category=None, object_category=None)

write triples to a buffer in case we decide to drop them