| |
Bioinformatics
(It’s not clear who first coined this term and when, but the first time it was
used in the literature was in 1994 (see
Boguski, 1994 and
Murray-Rust, 1994).
GenBank and NCBI
Algorithms, Databases and Homology Searching
- Molecular Sequence Databases and
Their Uses (1992)
- dbEST--Database For "Expressed
Sequence Tags" (1993) |PubMed|PDF|
The brevity of this correspondence stands in inverse proportion to
the breath and depth of its impact on molecular cell biology and
genetics. Few remember today, but back in the late 1980s and early
1990s most biologists were either skeptical about or outright
opposed to the Human Genome Project for any or all of three reasons:
the sequence would be an uninterpretable white elephant, genomics
would be mindless, boring science and the project would divert
funding resources away from conventional “RO1 labs” doing the
mindful, interesting research (McIntosh
and West, 1995). In this milieu,
my colleagues and I at
NCBI started the database of
Expressed Sequence Tags (dbEST) in 1993 and, by 1996, in the
plenary address at the annual Cold
Spring Harbor Genome Meeting,
Shirely Tilghman described its
impact as part of a sea-change in biomedical research: “If a
conservative is a liberal who has just been mugged…a genome
enthusiast is a genome critic who just got a hit in the EST
database.” EST data created a number of unique
informatics challenges, both quantitative and qualitative. At
the time of its origin, dbEST contained only 22,537 partial cDNA
sequences. Within two years, it had grown to 60% of sequence
entries in GenBank (Boguski,
1995) – more data that GenBank had
accumulated during its first 13 years of operation (1982-1995).
Today (GenBank
release 163), dbEST contains 48
million sequences, a nearly 2,000-fold increase in size since our
paper appeared in 1993. Although size matters, ESTs
also differed qualitatively from traditional GenBank entries
(representing functionally-cloned genes) in that the data were by
nature incomplete, inaccurate and subject to several types of
artifacts. Addressing these problems in such a way that made the
data useful to biologists required the integration of numerous,
cutting edge sequence analysis algorithms and methods (Altschul,
Boguski et al., 1995) and resulted
in a number of bioinformatics “firsts” including i) periodic,
automated re-annotation of sequence records; ii) sequence filtering
and masking to deal with artifacts and natural, but confounding
sequence features; iii) similarity searching using conceptual,
six-frame translations of EST sequences and iv) creation of
non-redundant search databases. These advances enabled, for the
general biomedical community: a) gene discovery (the database “hits’
to which Tilghman referred); b) gene expression profiling via
transcript counting, years before the invention of microarray
methods and c) comparative “genomics” of transcribed sequences by
analysis of EST collections from different organisms. Accelerated
cloning was also enabled by dbEST because it was the first sequence
database to systematically provide links to physical DNA clone
resources, such as ATCC.
This was only the beginning as dbEST
enabled important applications in genome research itself. The
clustering of highly-redundant EST collections into “UniGenes” (Boguski
and Schuler, 1995) directly led to
the first large-scale gene map of the Human Genome (Schuler,
Boguski et al., 1996). These
so-called UniGenes were also critical to the development of
functional genomics and used to design the first comprehensive,
human cDNA microarray (Iyer
et al., 1998). Conversely, the
inherent redundancy of EST sequences enabled rapid and efficient
computational methods for the discovery of single nucleotide
polymorphisms or SNPs (Marth
et al., 1999). dbEST also led to
the development of heuristic methods to greatly improve the accuracy
of computational gene predictions (e.g.
Xu et al., 1997) and consequently
annotation of the human genome.
- Detecting Subtle Sequence
Signals: A Gibbs Sampling Strategy for Multiple Alignment (1993)
|PubMed|PDF|
- Gene Discovery in dbEST (1994)
|PubMed|PDF|
- Issues in Searching Molecular
Sequence Databases (1994) |PubMed|PDF|
- Constructing Aligned Sequence
Blocks (1994) |PubMed|PDF|
- A Note About Computing All Local
Alignments (1994) |PubMed|PDF|
- Hunting for Genes in Computer
Data Bases (1995) |PubMed|PDF|
- Sequence Similarity Searching
Using the BLAST Family of Programs (1995)
Review Articles and Tutorials
- On Computer-Assisted Analysis of
Biological Sequences: Proline Punctuation, Consensus Sequences, and
Apolipoprotein Repeats (1986)
|PubMed|PDF|
- Rat Apolipoprotein A-IV:
Application of Computational Methods for Studying the Structure,
Function, and Evolution of a Protein (1986)
|PubMed|PDF|
- Homology and Similarity (1991)
- Computational Sequence Analysis
Revisited: New Databases, Software Tools, and the Research
Opportunities They Engender (1992)
|PubMed|PDF|
- Bioinformatics (1994)
|PubMed|PDF|
- How to Make Discoveries in
Molecular Sequence Databases (1995)
- Internet Basics for Biologists
(1995)
- Sequence Similarity Searching
Using the BLAST Family of Programs (1995)
- Computational Analysis of DNA and
Protein Sequences (1997) |PubMed|PDF|
- Bioinformatics - a New Era (1998)
|PDF|
- The Bioinformatics Bookshelf:
Teach Yourself Bioinformatics (1999)
|PDF|
- Biomedical Informatics for
Proteomics (2003) |PubMed|PDF|
- Genome Informatics: Current
Status and Future Prospects (2003)
|PubMed|
|PDF|
Home
Education
Employment
Publications
Presentations
Boards
Teaching
Awards |

The first Bioinformatics textbook, 1991.
_small.jpg)
Oxana Pickeral and I reviewed 6
bioinformatics
books for Cell in 1999. Table II in this review contains an interesting list
of topics and activities that were within the purview of bio-informatics
at that time.
|