Description
TOGA
(Tool to infer Orthologs from Genome Alignments)
is a homology-based method that integrates gene annotation, inferring
orthologs and classifying genes as intact or lost.
This track has 45,327 items in the track, covering 519,840,058 bases
in the sequence which is % 43.47 of the total sequence of size
1,195,847,496 nucleotides.
Methods
As input, TOGA uses a gene annotation of a reference species
(human/hg38 for mammals, chicken/galGal6 for birds) and
a whole genome alignment between the reference and query genome.
TOGA implements a novel paradigm that relies on alignments of intronic
and intergenic regions and uses machine learning to accurately distinguish
orthologs from paralogs or processed pseudogenes.
To annotate genes,
CESAR 2.0
is used to determine the positions and boundaries of coding exons of a
reference transcript in the orthologous genomic locus in the query species.
Display Conventions and Configuration
Each annotated transcript is shown in a color-coded classification as
-
"intact": middle 80% of the CDS
(coding sequence) is present and exhibits no gene-inactivating mutation.
These transcripts likely encode functional proteins.
-
"partially intact": 50% of the CDS
is present in the query and the middle 80% of the CDS exhibits no
inactivating mutation. These transcripts may also encode functional
proteins, but the evidence is weaker as parts of the CDS are missing,
often due to assembly gaps.
-
"missing": <50% of the CDS is present
in the query and the middle 80% of the CDS exhibits no inactivating
mutation.
-
"uncertain loss": there is 1
inactivating mutation in the middle 80% of the CDS, but evidence is not
strong enough to classify the transcript as lost. These transcripts may
or may not encode a functional protein.
-
"lost": typically several inactivating
mutations are present, thus there is strong evidence that the transcript
is unlikely to encode a functional protein.
Clicking on a transcript provides additional information about the orthology
classification, inactivating mutations, the protein sequence and protein/exon
alignments.
Data Access
The data for this track is available from the bigBed file format
with the command line access tool bigBedToBed available from
the utilities download directory
hgdownload.soe.ucsc.edu/admin/exe/linux_x86_64.
To extract from the bigBed file:
bigBedToBed "https://hgdownload.soe.ucsc.edu/hubs/GCF/015/220/075/GCF_015220075.1/bbi/HLTOGAannotVsGalGal6v1.bb" togaData.bed
with the result in the togaData.bed file.
Credits
This data was prepared by the Michael Hiller Lab
References
The TOGA software is available from
github.com/hillerlab/TOGA
Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE,
Ahmed AW, Kontopoulos DG, Hilgers L, Lindblad-Toh K, Karlsson EK,
Zoonomia Consortium, Hiller M.
Integrating gene annotation with orthology inference at
scale. Science, 380(6643), eabn3107, 2023
|