dipper.sources.NCBIGene module

class dipper.sources.NCBIGene.NCBIGene(graph_type, are_bnodes_skolemized, tax_ids=None, gene_ids=None)

Bases: dipper.sources.Source.Source

This is the processing module for the National Center for Biotechnology Information. It includes parsers for the gene_info (gene names, symbols, ids, equivalent ids), gene history (alt ids), and gene2pubmed publication references about a gene.

This creates Genes as classes, when they are properly typed as such. For those entries where it is an ‘unknown significance’, it is added simply as an instance of a sequence feature. It will add equivalentClasses for a subset of external identifiers, including: ENSEMBL, HGMD, MGI, ZFIN, and gene product links for HPRD. They are additionally located to their Chromosomal band (until we process actual genomic coords in a separate file).

We process the genes from the filtered taxa, starting with those configured by default (human, mouse, fish). This can be overridden in the calling script to include additional taxa, if desired. The gene ids in the conf.json will be used to subset the data when testing.

All entries in the gene_history file are added as deprecated classes, and linked to the current gene id, with “replaced_by” relationships.

Since we do not know much about the specific link in the gene2pubmed; we simply create a “mentions” relationship.

SCIGRAPH_BASE = 'https://scigraph-ontology-dev.monarchinitiative.org/scigraph/graph/'
add_orthologs_by_gene_group(graph, gene_ids)

This will get orthologies between human and other vertebrate genomes based on the gene_group annotation pipeline from NCBI. More information 9can be learned here: http://www.ncbi.nlm.nih.gov/news/03-13-2014-gene-provides-orthologs-regions/ The method for associations is described in [PMCID:3882889](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3882889/) == [PMID:24063302](http://www.ncbi.nlm.nih.gov/pubmed/24063302/). Because these are only between human and vertebrate genomes, they will certainly miss out on very distant orthologies, and should not be considered complete.

We do not run this within the NCBI parser itself; rather it is a convenience function for others parsers to call.

Parameters:
  • graph
  • gene_ids – Gene ids to fetch the orthology
Returns:

fetch(is_dl_forced=False)

abstract method to fetch all data from an external resource. this should be overridden by subclasses :return: None

files = {'gene2pubmed': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene2pubmed.gz', 'file': 'gene2pubmed.gz'}, 'gene_group': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene_group.gz', 'file': 'gene_group.gz'}, 'gene_history': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene_history.gz', 'file': 'gene_history.gz'}, 'gene_info': {'url': 'http://ftp.ncbi.nih.gov/gene/DATA/gene_info.gz', 'file': 'gene_info.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

static map_type_of_gene(sotype)
parse(limit=None)

abstract method to parse all data from an external resource, that was fetched in fetch() this should be overridden by subclasses :return: None

resources = {'clique_leader': '../../resources/clique_leader.yaml'}