This page provides downloads of the data, and the tools used to construct the mapping, together with instructions on how to use them.
Follow these instructions to create a MySQL version TreeBASE
We need to add a table to TreeBASE that joins studys and taxa. One way is simply to make the join, dump it to disk, and reimport it.
SELECT Study.StudyID, TreeTaxa.TaxonID INTO OUTFILE '/tmp/StudyTaxa.txt' FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' FROM Analysis INNER JOIN Study ON Analysis.StudyID = Study.StudyID INNER JOIN Tree ON Analysis.AnalysisID = Tree.AnalysisID INNER JOIN TreeTaxa ON Tree.TreeID = TreeTaxa.TreeID;
Follow these instructions to create a copy of the NCBI tree in MySQL.
Once the NCBI classification has been added to the MySQL database, we need to create a table linking the mapping to the TreeBASE studies. This table is used to speed up hierarchical queries.
CREATE TABLE `ncbi_classification` ( `StudyID` varchar(32) default NULL, `TaxonID` varchar(32) default NULL, `left_id` int(11) default NULL, `right_id` int(11) default NULL, `Ref` text, KEY `StudyID` (`StudyID`), KEY `left_id` (`left_id`), KEY `right_id` (`right_id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
We then populate this table.
SELECT StudyTaxa.StudyID, StudyTaxa.TaxonID, ncbi_tree.left_id, ncbi_tree.right_id INTO OUTFILE '/tmp/ncbi_classification.txt' FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n' FROM rdmp_mapping INNER JOIN ncbi_tree ON (rdmp_mapping.source_id = ncbi_tree.tax_id) INNER JOIN StudyTaxa ON (rdmp_mapping.TaxonID = StudyTaxa.TaxonID) INNER JOIN Study ON Study.StudyID = StudyTaxa.StudyID WHERE (rdmp_mapping.source = 'ncbi'); LOAD DATA LOCAL INFILE '/users/rpage/ncbi_classification.txt' INTO TABLE ncbi_classification FIELDS TERMINATED BY '\t' (StudyID, TaxonID, left_id, right_id);
Now, we can find studies containing a taxon in the NCBI classification. For example, to find all studies with birds ("Aves", NCBI tax_id 8782), we get the values of left_id and right_id from the NCBI tree (37636 and 52543, respectively in my version of the NCBI tree), and perform this query:
SELECT DISTINCT ncbi_classification.StudyID, COUNT(ncbi_classification.TaxonID) AS hits, Study.Ref FROM ncbi_classification INNER JOIN Study ON Study.StudyID = ncbi_classification.StudyID WHERE (ncbi_classification.left_id >= 37636) AND (ncbi_classification.right_id <= 52543) GROUP BY ncbi_classification.StudyID ORDER BY hits DESC;';
hits is the number of birds in each study.
The mapping was constructed using a series of Perl scripts, database queries to local copies of NCBI and ITIS databases, and manual cleaning. The mapping is available as a MySQL dump here. This SQL dump contains some extra goodies, such as a provisional list of PubMed ids, DOIs, and Handles for studies in TreeBASE.
The mapping also includes the "name clusters", which are stored in the table rdmp_tc. Instructions on how to generate the name clusters are here.