bac120_taxonomy_.tsv GTDB taxonomy for all bacterial genomes assigned to a GTDB species cluster. bac120_.tree Bacterial reference tree inferred from the concatenation of 120 proteins and spanning the representative genomes for each bacterial species cluster. This tree is used to curate the GTDB taxonomy. The provided tree is in Newick format, decorated with the GTDB taxonomy, and contains non-parametric bootstrap support values. bac120_.sp_labels.tree Synonymous to the bac120_.tree, except with species labels appended to each genome. bac120_metadata_.tsv Metadata for all bacterial genomes including GTDB and NCBI taxonomies, completeness and contamination estimates, assembly statistics, and genomic properties. metadata_field_desc.tsv Description of each field in the above metadata file and indication of the originating source of the metadata. bac120_individual_genes_.tar.gz Untrimmed and unaligned marker genes used in the concatenated alignment used to infer the bacterial reference tree. Genomes without a marker gene indicate either multiple hits or no hits were found for that gene. Genes are provided as both nucleotide and amino acid sequences. Trimmed and aligned hits can be found in bac120_msa_individual_genes_.tar.gz bac120_msa_individual_genes_.tar.gz Multiple sequence alignments for each of the 120 bacterial proteins. bac120_msa_marker_info_.tsv Information about each of the 120 bacterial proteins used to infer the bacterial reference tree. The order of proteins in this file indicates the order in which they are concatenated in the MSA. bac120_msa_mask_.txt Mask indicating which columns were trimmed from the 120 bacterial protein concatenated alignment. bac120_msa_.faa FASTA file of the trimmed multiple sequence alignment used to infer the bacterial reference tree. bac120_ssu_.fna FASTA file of 16S rRNA gene sequences identified within the set of bacterial representative genomes. The longest identified 16S rRNA sequence is selected for each representative genomes. The assigned taxonomy reflects the GTDB classification of the genome. Sequences are identified using nhmmer with the 16S rRNA model (RF00177) from the RFAM database. Only sequences with a length >=200 bp and an E-value <= 1e-06 are reported. In a small number of cases, the 16S rRNA sequences are incongruent with this taxonomic assignment as a result of contaminating 16S rRNA sequences. gtdb__rep_genomes.tar.gz FASTA files for each GTDB representative genome. gtdb__rep_genomes.protein.faa.tar.gz Amino acid FASTA files for all protein-coding gene sequences predicted with Prodigal. gtdb__rep_genomes.protein.fna.tar.gz Nucleotide FASTA files for all protein-coding gene sequences predicted with Prodigal. gtdb_uba_mags.tar.gz Genomic files for 3,087 UBA genomes used to infer the GTDB taxonomy. The majority of these are now available from NCBI under BioProject PRJNA417962. gtdb_uba_mags_arc.tar.gz FASTA files for 234 archaeal UBA genomes not yet available through INSDC gtdbtk_data.tar.gz Reference data required by the companion tool GTDB-Tk (https://github.com/Ecogenomics/GTDBTk) for classifying genomes according to the GTDB. This includes the genomic FASTA files for the GTDB reference genomes. hq_mimag_genomes_.tsv List of genomes meeting the MIMAG high-quality genome criteria (Bowers et al., Nat Biotechnol, 2017): - estimated completeness >90% - estimated contamination <5% - presence of the 5S, 16S, and 23S rRNA genes - at least 18 tRNAs ncbi_vs_gtdb__bacteria.xlsx Correspondence between NCBI and GTDB taxa ordered by degree of polyphyly. sp_clusters_.tsv Metadata file indicating the representative genome of each GTDB species cluster, the set of genomes assigned to the species cluster, and the average nucleotide identity radius used to circumscribe the species cluster. ssu_.fna FASTA file containing 16S rRNA sequences identified across the set of GTDB genomes passing QC. The assigned taxonomy reflects the GTDB classification of the genome. Sequences are identified using nhmmer with the 16S rRNA model (RF00177) from the RFAM database. All sequences with a length >=200 bp and an E-value <= 1e-06 are reported. synonyms_.tsv List of species considered synonyms in the GTDB taxonomy. gtdb_.dic List of all taxa with and without rank prefixes in GTDB. This can be used as a dictionary in word processing programs to indicate the correct spelling of GTDB taxa. See: https://www.officetooltips.com/word_2016/tips/how_to_share_the_custom_dictionary_in_word.html Analogous files are provide for Archaea.