Gene identification
-------------------

- Gene calling was performed with Prodigal v2.6.3 (Hyatt et al., 2010) and marker 
  genes identified and aligned using HMMER v3.1b1 (Eddy, 2011). Marker genes and 
  corresponding HMMs are from the Pfam v33.1 (Finn et al., 2014) and TIGRFAMs 
  v15.0 (Haft et al., 2003) databases.
  

Tree inference
--------------

- Bacteria reference tree is inferred with FastTree v2.1.10 under the WAG model
  from the concatenated alignment of 120 ubiquitous bacterial genes (Parks et al., 2018)
- Archaea reference tree is inferred with IQ-Tree v2.1.2 under the PMSF model from the
  concatenated alignment of 53 archaeal genes based on a subset of the "top-ranked
  marker proteins" from a recent evaluation based on minimizing horizontal gene transfer
  and optimising the recovery of monophyletic lineages (Rinke & Spang et al., 2020),
  using FastTree v2.1.10 to infer an initial guide tree.


Identifying 16S rRNA sequences
------------------------------

- Sequences are identified using nhmmer v3.1b2 (Wheeler and Eddy, 2013) with the 
  bacterial (RF00177) and archaeal (RF01959) 16S rRNA models 
  from the RFAM database (Kalvari et al., 2018).
  
  
Average nucleotide identity
---------------------------

Average nucleotide identity (ANI) and alignment fraction (AF) values were calculated
with skani v0.2.1 (Shaw and Yu, 2023).


Genome QC criteria
------------------

Genomes are obtained from NCBI and must meet the following criteria to be included in 
the GTDB reference trees and database:

   i) CheckM completeness estimate >50%
  ii) CheckM contamination estimate <10%
 iii) quality score, defined as completeness - 5*contamination, >50
  iv) contain >40% of the bac120 or ar53 marker genes
   v) contain <1000 contigs
  vi) have an N50 >5kb
 vii) contain <100,000 ambiguous bases

Filtered genomes are manually inspected and exceptions are made for genomes of high 
nomenclatural or taxonomic significance, e.g. the isolate genome Ktedonobacter racemifer 
representing the class Ktedonobacteria in the phylum Chloroflexota has a contamination 
estimate of 11%. Genomes with CheckM contamination between 10% and 20% which pass critieria 
i and iv to vii are also retained if >80% of all duplicate marker genes are 100% identical 
as this suggest a large legitimate genome duplication event, e.g. GCF_004799645.1, a 
complete isolate genome from the type strain of Natronorubrum bangense.


Updating GTDB species representatives
-------------------------------------

Each GTDB species is defined by a single representative genome and species assignments 
established by considering the ANI and AF to these representative genomes (Parks et al., 
Nature Biotechnology, 2019). Species representatives are re-evaluated each GTDB release 
with an emphasis placed on retaining representatives so they can serve as effective 
nomenclatural type material. However, the goal of stable representatives must be balanced 
with the desire to use high-quality genomes as representatives, the incorporation of 
changing taxonomic opinion, and identified errors in genome classification or assembly. 

GTDB representatives are updated according to two primary principles: i) representatives 
should be assembled from the type strain of a species whenever possible, and ii) representatives 
should only be replaced by assembles of suitably higher overall quality. These two principles 
are quantitatively defined by the balanced ANI score (BAS) given by:

  0.5 * (ANI score) + 0.5 * (quality score)
  
where the ANI score is 100 – 20*(100 - ANI to current representative) and the quality 
score is defined by the criteria given in Table 1. An existing representative is only 
replaced by a new representative if it has a BAS >= 10 above the BAS of the current 
representative. Intuitively, the BAS achieves the goal of stable representatives by 
requiring a new representative to be of increasingly higher quality (as defined by the 
quality score) the more dissimilar it is from the current representative (as defined by 
the ANI score).

Representatives are also updated to account for genome assemblies being removed from 
NCBI and representatives are updated whenever the underlying assembly is updated at NCBI.


TABLE 1. Criteria used to establish assembly quality score

CRITERIA: SCORE
Type species of genome: 100,000
Effective type strain of species according to NCBI: 10,000
NCBI representative of species: 1,000
Complete genome: 100
CheckM quality estimate: completeness - 5*contamination
MAG or SAG: -100
Contig count: -5 * (no. contigs/100)
Undetermined bases: -5 * (no. undetermined bases/10,000)
Full length 16S rRNA gene: 10


Updating name of GTDB species clusters
--------------------------------------

The names assigned to GTDB species clusters are re-evaluated each GTDB release with an 
emphasize placed on nomenclature stability. However, names are changed in some cases to 
reflect changes in taxonomic opinions and/or to correct identified errors in GTDB or NCBI 
assignments. Species clusters containing one or more genomes assembled from the type strain 
of a species are named after the species with nomenclatural priority (Parker et al., 2019),
with the generic and specific names changed as necessary to reflect any genus level 
reclassifications in the GTDB. Species names identified as synonyms are provided as separated 
file in the GTDB repository and updated each release.

Species clusters without a type strain genome are assigned via a majority voting approach 
based on NCBI species assignments regarded as correct under the GTDB framework. A genome is 
considered to have an erroneous NCBI species assignment if a genome assembled from the type 
strain of this species exists and resides in a different GTDB species cluster. A cluster is 
assigned a name by majority voting if >50% of genomes in the cluster with a GTDB-validated NCBI 
name are from a single species and >50% of all genomes with this species classification are in 
the cluster. Otherwise, the species cluster is assigned an alphanumeric or Latin suffixed 
placeholder name. In order to maximize the stability of GTDB names, placeholder names are not 
updated to new placeholder names (e.g., Bacillus sp002153395 to B. subtilis_A or vice versa) even 
if an updated placeholder name might better reflect the current classification of genomes within 
a cluster.

Species clusters containing an assembly from the type strain of a subspecies or a subspecies 
satisfying the majority voting criteria will have the subspecies name promoted to the specific 
name of the cluster in cases where a placeholder name would otherwise be required.


Additional information
----------------------

Please consult the following GTDB publications for additional information:

Parks, D. H., et al. (2018). A standardized bacterial taxonomy based on genome 
  phylogeny substantially revises the tree of life. Nature Biotechnology, 
  36: 996-1004.

Parks, D.H., et al. (2020). A complete domain-to-species taxonomy for Bacteria 
  and Archaea. Nature Biotechnology, https://doi.org/10.1038/s41587-020-0501-8.

Chaumeil P-A, et al. (2019). GTDB-Tk: a toolkit to classify genomes with the 
  Genome Taxonomy Database. Bioinformatics, btz848:
  https://doi.org/10.1093/bioinformatics/btz848.


REFERENCES
----------

Eddy SR. 2011. Accelerated Profile HMM Searches. PLoS Comput Biol 7: e1002195.

Finn RD, et al. 2014. Pfam: The protein families database. Nucleic Acids Res 
  42: D222-230.
  
Haft DH, Selengut JD, White O. 2003. The TIGRFAMs database of protein families. 
  Nucl Acids Res 31: 371-373.

Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW, Hauser LJ. 2010. 
  Prodigal: Prokaryotic gene recognition and translation initiation site 
  identification. BMC Bioinformatics 11: 119.
  
Jain C, et al. (2018). High throughput ANI analysis of 90K prokaryotic genomes 
  reveals clear species boundaries. Nature Communication 9: 5114.

Kalvari I, et al. 2018. Rfam 13.0: shifting to a genome-centric resource for 
  non-coding RNA families. Nucleic Acids Res. 46(D1):D335-D342.
  
Parker et al. International Code of Nomenclature of Prokaryotes. IJSEM 60: 
  doi: 10.1099/ijsem.0.000778.

Parks DH, et al. 2017. Recovery of nearly 8,000 metagenome-assembled genomes 
  substantially expands the tree of life. Nat Microbiol 2: 1533-42.

Wheeler TJ, Eddy SR. 2013. nhmmer: DNA homology search with profile HMMs. 
  Bioinformatics. 2013 Oct 1;29(19):2487-9.