TogoAnnotator Help

Introduction

In the era of high-throughput sequencers, newly obtained genome sequence data have been accumulated rapidly. This advancement of sequence equipment significantly contributes to the progress of the life science research. However, curating protein definition is still a highly labor-intensive work and a bottleneck of genome annotation. To make this curation work more efficient, TogoAnnotator accepts product names which you need to check and shows you recommended protein definitions (i.e., recommended product names of a gene) along with their related information. This related information includes candidates for recommended names, whether the input name matches ones of existing public databases such as UniProt gene symbols or not, and so on.

TogoAnnotator uses a dictionary that holds pairs of a product name to be refined and its refined (i.e., recommended) name. Here, we call the former "before name" and the latter "after name". In the TogoAnnotator process, an input product name is first checked to see if it exactly (i.e., literally) matches any after names. If it matches, it is already a recommended name and just fine. If it doesn't, then the input product name is checked to see if it exactly matches any before names. If it matches, its corresponding after name is a recommended name. In addition, TogoAnnotator provides an approximate match function, where you can get a recommended name that is approximately similar to the input product name or whose corresponding before name is approximately similar to the input name.

How to use

TogoAnnotator provides three ways of submitting product name(s). The one is to submit a single product name by the HTTP GET method. The second is to submit a plain text file that contains a list of product names, each of which is written on one line. The third is to submit a DDBJ submission file where "product" property values are used as product names to be refined. The fourth is to submit a GenBank format file, where product qualifier values are used as product names. The fifth is to sbmit a BLAST report file, where hit annotations are used as product names. The sixth is to submit a GFF3 format file, where product tag values of the attribute column are used as product names. The seventh is to submit a FASTA format file, where the description part of an entry is used as a product name. The latter six ways assume the HTTP POST method. The JSON format is used to return results for all the three ways, and their structures are basically the same except for the first way because there is only one result. In other words, an array of results for each product name is returned for the other two ways, and each result has the same structure as the first way. The technical details of each method are explained at the top page, and therefore here we just explain how to interpret the result data. As the technical details show, each result has the following five fields or attributes.

First, the "query" field contains the literally input string.

Second, the "result" field contains the recommended names for an input. If the dictionary contains multiple recommended names, they are shown with the delimiter "@".

Third, the "match" field contains the matching type of the input to a dictionary entry. There are four values for this attribute, and "ex" means "Exact match", that is, the input exactly matches an entry of the dictionary. Next, "cs" means "cosine", that is, the input matches an entry approximately. TogoAnnotator hires a character based bi-gram cosine distance to determine how similar the input string and the entry are, and the threshold is 0.6. In addition, "bcs" means "cs" to "before name". While "cs" denotes approximate match to "after name", "bcs" denotes approximate match to "before name". Even the approximate match fails, the "match" field has the value of "no_hit".

The "info" field contains related information. This field has two types of data, and the one is whether an input product name matches an entry of existing public database entries. They are enzyme (IUBMB), locus name (UniProt), gene symbol (UniProt), and family name (Pfam). The other data include explanation of the result to help you interpret it. For example, this field contains "cs_avoidance", which means the input name contains a stop-word not to be used for the approximate match. The stop-word list contains "subunit", "domain protein", "family protein", "-like protein", and "fragment". Note that "@" is used as the delimiter if multiple data are included in this field.

Finally, the "result_array" field contains a list of candidates for recommendations. This field is empty unless the approximate match is used to return a result. The symbol "@" is used as the delimiter.

Contact point

TogoAnnotator is collaboratedly developed by Database Center for Life Science (DBCLS) and Center for Information Biology, National Institute of Genetics (NIG). Any questions or suggestions are welcome. Please contact to Yasunori Yamamoto (yy@dbcls.rois.ac.jp).

Creative Commons License TogoAnnotator by Database Center for Life Science (DBCLS) is licensed under a Creative Commons 表示 2.1 日本 License. This software includes the work that is distributed in the Apache License 2.0.