LSHTM Homepage  
HomeMicroarray SectionBioinformatics SectionSequencingLinksData ArchiveLSHTM ITDLSHTM PMBContact Details

Bioinformatics Overview


- Aim
- Similarity Searches
- Secondary
- Alignment
- Functional Assignment
- Phylogenetic Analysis
- Software Based Tools



Bioinformatics is the use of computer-based tools to assist us in the analysis of biological data. The development of DNA sequencing techniques and advances within this field has allowed a vast amount of data to be obtained in a short space of time (most notably from the human genome project). This has coincided with technological developments within the computer industry. Advancements in database storage techniques, software development and above all, the formation of the World Wide Web (WWW) have all contributed to the creation of bioinformatics.

Computer based tools (in silico analysis) can be used in a variety of ways to analyse biological data. The different techniques are briefly discussed below;

Similarity Searches


Similarity searches are the most common bioinformatics based tools and are located on many bioinformatics based websites e.g. BLAST (Altschul., et al. 1990). These tools allow the comparison of sequences that have no assigned function against ones that do. The websites typically contain a user interface allowing the user to input a sequence. The search algorithm then compares this sequence with all the sequences held in a database and assigns a similarity value. This allows a potential function to be assigned to the sequence in a considerably shorter time when compared to laboratory methods. The results of this type of search could potentially dictate future research avenues within the laboratory.

Similarity searches work by aligning the user's DNA sequence (unknown function) against sequences held in the database (known function). A typical example would be as follows;

User's sequence      - GCNTA
Database sequence  - GCATA

N = gap

A comparison between two sequences must take into account certain points. If a query sequence will match with a higher degree of homology when a 'gap' is inserted, then the algorithm will perform the insertion. Scoring penalties are employed to reduce the number of gaps at different points along with length of gaps employed. Different algorithms work in varying ways, however, the essential fact is that all possible comparisons are taken into account and a final statistical value is given. This determines how similar the two sequences are. The hypothesis is that if the sequences are similar, then there is a good chance the function will also be.

Similarity searches can also be carried out for amino acid sequences as well as DNA. In practise, both DNA and protein databases are searched and due to the degeneracy of the genetic code, different results may be obtained. As more organisms are sequenced and the databases become more reliable, so the results will become more accurate. The main role of genome databases originally was to distribute information for published or finished work. Thus, the databases are open to the public. However, as the genome projects have unfolded databases have the role of assisting on-going experimental work and are now an integral part of daily research.

Different types of databases exist that can be used for the analysis and functional characterisation of sequences. When considering DNA databases the main contributors are a combination of three sites, Genbank (US), EMBL (European), and DDBJ (Japan). Organism specific databases are also common e.g. Campylobacter jejuni database found on CampyDB website.

Secondary Searches


Secondary searches allow conserved motifs to be identified. This type of search contains the 'fruits' of investigations carried out from previous experiments (Kanehisa., 2001). Protein sequences are used for secondary search analysis. There are different types of secondary search techniques that can be categorized into three main areas;

Single Motif Methods - e.g. PROSITE. Based on algorithms that look for the most common motifs viewed in a multiple alignment (Bairoch., et al. 1997). Searching a query sequence would potentially identify which family of proteins the sequence contains and therefore assist in the characterisation process.

Multiple motif methods - e.g. PRINTS. Based on the concept that most protein families contain one or more motif, hence the algorithm used attempts to account for this.

Full Domain methods - e.g. Pfam. Based on the theory that variable regions between conserved motifs also contain sequence information.

Performing secondary database searches is a supplement to primary DNA/Protein database similarity searching. This type of search allows further characterisation of a sequence and opening up future investigative avenues. As there are many types of secondary databases, each varying in the model used to deduce its results, performing a single search would be naive. Hence, multiple databases should be searched to allow the result to be authentic and credible. Composite protein secondary databases can be used to speed up the process and are an integral part of the bioinformatics process.

Alignment Techniques


Once similarity and secondary searches have been carried out, multiple alignment tools are used. Popular software are ClustalW and PileUp. Comparing multiple alignments allows the possibility of conserved sequences to be identified. These are also termed 'motifs'. Identification of such conserved sequences represents possible structural and functional similarities and also possible evolutionary links between sequences. This procedure may at first seem to be pointless if initial primary and secondary sequence searches have produced insignificant results. However, if a number of sequences are aligned simultaneously, important gene families (conserved domains) can be identified. Different methods of multiple alignments exist, however, the most common is known as the progressive method.

Multiple alignment tools are an essential part of sequence analysis.

Functional Assignment


Genome sequencing has led to a vast amount of information being deposited onto databases. One of the most important ways in which these sequences can be used is to try and assign function to them. Functional assignment is a large topic in bioinformatics and has many software tools that can be applied to this subsection of bioinformatics. There are two main techniques that can be employed to aid the process of functional assignment. The first is to use sequence based methods for functional assignment (as described above). This typically involves similarity search methods. Alternatively, structure based methods can be used. This technique involves using bioinformatics software that predicts 3D protein structre from sequence. Examples of this kind of software include SwissModel and Fugue. These tools allow users to input sequences into the program and then return a predicted structure. Users can view returned structures in a viewing software such as Rasmol. The hypothetical structures can then be compared to known structures held in a structure database such as PDBsum, with the aim of identifying structural homologues.

Thus, with more and more sequences being made available, there will a movement towards the proteomics field.

Phylogenetic Analysis


Identification of similar sequences can also lead to phylogenetic analysis. Examples of common packages include PHYLIP and PAUP, which use a series of programs to calculate the best possible tree diagram. It is likely that sequences containing homology could have diverged through evolution. However, for a true representation of phylogenetics, every single result would have to be placed into the program, using multiple search parameters. Many phylogenetic algorithms are available however their reliability and practicality are in all cases dependent on the structure and size of the data (Hershkovitz., et al. 1998). Thus, computational phylogenetic analysis can be a very useful supplementary tool for the characterisation of a sequence if used correctly.

Software Based Tools


An alternative bioinformatics based tool is to use computer software designed for specific applications. For example, GeneSpring is used to analyse microarray results and gives the possibility of identifying up and down regulated genes within microarray experiments. Additional software that is used for microarray studies is ImaGene. This software allows spots on the microarray to be quantified. Whether commercial or freeware, there are hundreds of specific software for specific applications.

The software section within the Bioinformatics Facility discusses specific software at LSHTM regarding comparative genomics studies in particular.