dipper.sources.FlyBase module

class dipper.sources.FlyBase.FlyBase(graph_type, are_bnodes_skolemized)

Bases: dipper.sources.PostgreSQLSource.PostgreSQLSource

This is the [Drosophila Genetics](http://www.flybase.org/) resource, from which we process genotype and phenotype data about fruitfly. Genotypes leverage the GENO genotype model.

Here, we connect to their public database, and download a subset of tables/views to get specifically at the geno-pheno data, then iterate over the tables. We end up effectively performing joins when adding nodes to the graph. We connect using the [Direct Chado Access](http://gmod.org/wiki/Public_Chado_Databases#Direct_Chado_Access)

When running the whole set, it performs best by dumping raw triples using the flag `--format nt`.

fetch(is_dl_forced=False)
Returns:
files = {'disease_models': {'url': 'ftp://ftp.flybase.net/releases/current/precomputed_files/human_disease/allele_human_disease_model_data_fb_*.tsv.gz', 'file': 'allele_human_disease_model_data.tsv.gz'}}
getTestSuite()

An abstract method that should be overwritten with tests appropriate for the specific source. :return:

parse(limit=None)

We process each of the postgres tables in turn. The order of processing is important here, as we build up a hashmap of internal vs external identifers (unique keys by type to FB id). These include allele, marker (gene), publication, strain, genotype, annotation (association), and descriptive notes. :param limit: Only parse this many lines of each table :return:

querys = {'feature': "\n SELECT feature_id, dbxref_id, organism_id, name, uniquename,\n null as residues, seqlen, md5checksum, type_id, is_analysis,\n timeaccessioned, timelastmodified\n FROM feature WHERE is_analysis = false and is_obsolete = 'f'\n ", 'feature_dbxref_WIP': ' -- 17M rows in ~2 minutes\n SELECT\n feature.name feature_name, feature.uniquename feature_id,\n organism.abbreviation abbrev, organism.genus, organism.species,\n cvterm.name frature_type, db.name db, dbxref.accession\n FROM feature_dbxref\n JOIN dbxref ON feature_dbxref.dbxref_id = dbxref.dbxref_id\n JOIN db ON dbxref.db_id = db.db_id\n JOIN feature ON feature_dbxref.feature_id = feature.feature_id\n JOIN organism ON feature.organism_id = organism.organism_id\n JOIN cvterm ON feature.type_id = cvterm.cvterm_id\n WHERE feature_dbxref.is_current = true\n AND feature.is_analysis = false\n AND feature.is_obsolete = false\n AND cvterm.is_obsolete = 0\n ;\n '}
resources = [{'outfile': 'feature_relationship', 'query': '../../resources/sql/fb/feature_relationship.sql'}, {'outfile': 'stockprop', 'query': '../../resources/sql/fb/stockprop.sql'}]
tables = ['genotype', 'feature_genotype', 'pub', 'feature_pub', 'pub_dbxref', 'feature_dbxref', 'cvterm', 'stock_genotype', 'stock', 'organism', 'organism_dbxref', 'environment', 'phenotype', 'phenstatement', 'dbxref', 'phenotype_cvterm', 'phendesc', 'environment_cvterm']
test_keys = {'allele': [29677937, 23174110, 23230960, 23123654, 23124718, 23146222, 29677936, 23174703, 11384915, 11397966, 53333044, 23189969, 3206803, 29677937, 29677934, 23256689, 23213050, 23230614, 23274987, 53323093, 40362726, 11380755, 11380754, 23121027, 44425218, 28298666], 'annot': [437783, 437784, 437785, 437786, 437789, 437796, 459885, 436779, 436780, 479826], 'feature': [11411407, 53361578, 53323094, 40377849, 40362727, 11379415, 61115970, 11380753, 44425219, 44426878, 44425220], 'gene': [23220066, 10344219, 58107328, 3132660, 23193483, 3118401, 3128715, 3128888, 23232298, 23294450, 3128626, 23255338, 8350351, 41994592, 3128715, 3128432, 3128840, 3128650, 3128654, 3128602, 3165464, 23235262, 3165510, 3153563, 23225695, 54564652, 3111381, 3111324], 'genotype': [267393, 267400, 130147, 168516, 111147, 200899, 46696, 328131, 328132, 328134, 328136, 381024, 267411, 327436, 197293, 373125, 361163, 403038], 'notes': [], 'organism': [1, 226, 456], 'pub': [359867, 327373, 153054, 153620, 370777, 154315, 345909, 365672, 366057, 11380753], 'strain': [8117, 3649, 64034, 213, 30131]}