Downloads

This page provides downloads of the data, and the tools used to construct the mapping, together with instructions on how to use them.

TreeBASE in MySQL

Follow these instructions to create a MySQL version TreeBASE

StudyTaxa table

We need to add a table to TreeBASE that joins studys and taxa. One way is simply to make the join, dump it to disk, and reimport it.

SELECT Study.StudyID,  TreeTaxa.TaxonID
 INTO OUTFILE '/tmp/StudyTaxa.txt'
 FIELDS TERMINATED BY '\t'
 LINES TERMINATED BY '\n'
FROM Analysis
INNER JOIN Study ON Analysis.StudyID = Study.StudyID
INNER JOIN Tree ON Analysis.AnalysisID = Tree.AnalysisID
INNER JOIN TreeTaxa ON Tree.TreeID = TreeTaxa.TreeID;

NCBI Taxonomy Tree

Step 1

Follow these instructions to create a copy of the NCBI tree in MySQL.

Step 2

Once the NCBI classification has been added to the MySQL database, we need to create a table linking the mapping to the TreeBASE studies. This table is used to speed up hierarchical queries.

CREATE TABLE `ncbi_classification` (
  `StudyID` varchar(32) default NULL,
  `TaxonID` varchar(32) default NULL,
  `left_id` int(11) default NULL,
  `right_id` int(11) default NULL,
  `Ref` text,
  KEY `StudyID` (`StudyID`),
  KEY `left_id` (`left_id`),
  KEY `right_id` (`right_id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;

We then populate this table.

  SELECT StudyTaxa.StudyID, StudyTaxa.TaxonID, ncbi_tree.left_id, ncbi_tree.right_id 
	INTO OUTFILE '/tmp/ncbi_classification.txt'
	FIELDS TERMINATED BY '\t'
	LINES TERMINATED BY '\n'
	FROM rdmp_mapping
	INNER JOIN ncbi_tree ON (rdmp_mapping.source_id = ncbi_tree.tax_id)
	INNER JOIN StudyTaxa ON (rdmp_mapping.TaxonID = StudyTaxa.TaxonID)
	INNER JOIN Study ON Study.StudyID = StudyTaxa.StudyID
WHERE (rdmp_mapping.source = 'ncbi');

LOAD DATA LOCAL INFILE '/users/rpage/ncbi_classification.txt'
INTO TABLE ncbi_classification FIELDS TERMINATED BY '\t'
(StudyID, TaxonID, left_id, right_id);

Now, we can find studies containing a taxon in the NCBI classification. For example, to find all studies with birds ("Aves", NCBI tax_id 8782), we get the values of left_id and right_id from the NCBI tree (37636 and 52543, respectively in my version of the NCBI tree), and perform this query:


SELECT DISTINCT ncbi_classification.StudyID,  COUNT(ncbi_classification.TaxonID) AS hits, Study.Ref 
FROM ncbi_classification
	INNER JOIN Study ON Study.StudyID = ncbi_classification.StudyID
WHERE (ncbi_classification.left_id >= 37636) 
	AND (ncbi_classification.right_id <= 52543) 
GROUP BY ncbi_classification.StudyID
ORDER BY hits DESC;';

hits is the number of birds in each study.

Mapping

The mapping was constructed using a series of Perl scripts, database queries to local copies of NCBI and ITIS databases, and manual cleaning. The mapping is available as a MySQL dump here. This SQL dump contains some extra goodies, such as a provisional list of PubMed ids, DOIs, and Handles for studies in TreeBASE.

Name clusters

The mapping also includes the "name clusters", which are stored in the table rdmp_tc. Instructions on how to generate the name clusters are here.