LSHTM Homepage  
HomeMicroarray SectionBioinformatics SectionSequencingLinksData ArchiveLSHTM ITDLSHTM PMBContact Details

Comparative Genomics Software Overview


- Overview
- Image Analysis
- Normalization
- Quality Control of Data
- Filtering
- Clustering
- Summary



Microarray experiments can be broken down into two main categories;

- RNA expression studies
- Comparative Genomics

Comparative studies allow genomic comparisons to be made between different strains. RNA expression studies allow comparisons of RNA expression levels between different strains. By combining clustering analysis of the data along with evolutionary and environmental background history, it is possible for new experimental avenues to be investigated.

The type of experiment planned will determine the starting material e.g. DNA or RNA. This will require quantification before hybridization. The experiment will be performed and then scanned. Different experiment types will need different samples e.g. if carrying out a time course analysis, you would isolate samples according to the different experimental time points. For comparative genomics you would simply hybridize a test strain against a control strain and identify potentially 'absent or highly divergent genes'.

In addition to different samples obtained due to experimental aims, what distinguishes the microarray experiment types are the analysis techniques involved. Many of the analysis procedures applied to comparative genomics studies will also apply to time course and expression studies, however, it is important to bear in mind that this overview is specifically intended for comparative genomics.

Microarray analysis begins once the slides have been scanned. Methods for microarrays have been around for a decade or so and have been vastly refined over the years allowing users to reach a high level of consistency in obtaining results. Analysis techniques are not so clear cut. There are hundreds of different analysis software and multiple techniques used to carry out analysis. It can often be the case that the final analysis steps actually take longer than the experiment!

A typical analysis path, Figure 1.

Comparative Genomics Pathway

Figure 1. Diagram illustrating path for comparative genomics analysis.

Image Analysis


The first step of any analysis procedure is to quantify the images using image analysis software. Image analysis allows signal intensity values to be assigned to each spot (a single gene). It allows quantification of the spot intensity allowing background intensity consideration to be taken into account. It is this final value which is used in further analysis calculations regarding ratio values.

The school uses two different software products to calculate the quantification of the spots. These are ImaGene (BioDiscovery) and BlueFuse (BlueGnome), represented in figures 2 and 3. Each one is described in more detail in the software section. Image analysis software uses .tif image files generated from the scanning software. The algorithms calculate a signal intensity value with consideration for the shape of the spot and light intensity from each pixel. The software generates a .txt file that can be used in further analysis.

ImaGene Interface

Figure 2. ImaGene (Biodiscovery) interface.

BlueFuse Interface

Figure 3. BlueFuse (BluGnome) interface.



Once the arrays have been scanned and quantification has been carried out, the next step is to normalize the data. A good definition for normalization is given by GeneSpring (Agilent Technologies), "Experiment normalizations are used to standardize your microarray data to enable differentiation between real (biological) variations in gene expression levels and variations due to the measurement process. Normalizing also scales your data so that you can compare relative gene expression levels."

The process of selecting a normalization procedure can be a difficult one. The underlying mathematics are often complex and thus selection can be an arduos task for biologists. A commonly used normalization method is 'Lowess' (also known as 'Loess') normalization. There are many ways of normalizing data and methods can be used simultaneously. We will not attempt to select the optimum normalization for you. This is dependent on many factors such as the type of experiment being carried out and the type of data set being used. It is often the case that many different methods must be tried to see which procedure allows the greatest standardization. What this means is, which hybridization procedure allows microarray quantified data to be in as much a linear format as possible. It will be evident from certain normalazation procedures that certain data will have unususal kinks or curves. Normalization aims to avoid this and produce as much a linear distribution as possible.

Normalization can be carried out within GeneSpring and also from various freeware packages such as ArrayNorm (Bioinformatics Institute Graz) represented in figures 4 and 5.

GeneSpring normalization options

Figure 4. GeneSpring (Agilent Technologies/Silicon Genetics) Normalization options.

ArrayNorm normalization interface

Figure 5. ArrayNorm Normalization from Bioinformatics Graz.

Quality Control of Data


There are various quality control methods used to sift out genes with poor intensity values. In particular, genes that have been flagged in previous quantification software can be selected against, thus, not including these genes in any further analysis. GeneSpring allows flagged genes from ImaGene to be identified. There can be issues surrounding communication problems between different versions of software and their compatibility. An example of this is when any version of GeneSpring before 6.2 could not identify flags raised from ImaGene. Likewise, problems can also arise from data generated using different versions of the same software. This has been known to happen when attempting to create large datasets using data analysed over a large range of software versions.

Additional methods for quality control filtering can be achieved through filtering on intensity. This ensures that only genes above a certain threshold will be considered in further calculations. Many of these quality control methods can be carried out within GeneSpring and also by using stand alone applications such as those from the Bioconductor packages written in program R, represented in figures 6 and 7.

Bioconductor Packages

Figure 6. Bioconductor packages used for quality control filtering.

GeneSpring Filtering

Figure 7. GeneSpring (Agilent Technologies/SiliconGenetics) filtering options.



With comparative genomics, the crucial aim is to select genes that are in the category of present (core) and 'absent or highly divergent' (variable). This in essence is the whole point of carrying out this type of experiment. However, what constitutes genes that are 'absent or highly divergent' is a matter that is under constant scrutiny. Various types of cut-off selection procedures exist. These range from filtering on the fold change. This allows arbitrary values to be set either-side of the normalized data e.g. 2 fold up and down regulated. This is the most basic method. Additional techniques include cut-offs based on standard deviation. These tools are found in the GeneSpring analysis platform and also many stand alone applications.

One alternative to generating cut-offs that has gained recent popularity is the program GACK (Charlie Kim., et al 2002). GACK calculates dynamic cut-off values and can generate improved identification of 'absent or highly divergent genes' when compared to many other techniques. GACK is shown in figure 8.

GACK Interface

Figure 8. GACK interface.



Comparative genomics analysis tends to come to a conclusion with some kind of clustering work. The aim of clustering is to group strains according to their genomic content. A typical example would be to carry out clustering analysis on multiple strains using whole strain comparison and also, comparison by using only 'absent or highly divergent genes' genes. Clustering backed up with background information about the strains along with evolutionary data can lead to important hypothesis being deduced or confirmed.

Typical clustering analysis can be carried out within GeneSpring, shown in Figure 9. Examples include generating Condition trees (within GeneSpring) and K-means trees with multiple similarity measurement options available. Spearmans rank correlation coefficient is used as a typical similarity measurement. Similar to the Normalization process, this analysis technique can delve greatly into mathematics. Multiple techniques should be attempted to ensure a general trend is observed. However, for comparative genomics work, Spearmans rank correlation coefficient has often been used.

GeneSpring Clustering

Figure 9. GeneSpring cluster output.

There are many additional clustering tools that exist. A popular theme at the moment is to base the clustering on Bayesian statistics. MrBayes is a program which clusters according to Bayesian statistics and arguably produces a more detailed tree. Generally, clustering techniques work better when there is more background information available about the strains analysed. However, when the amount of information increases, Bayesian statistics can deal with this far better and generate far more accurate results when compared to other clustering techniques.



Comparative genomics is an important subsection within microarray analysis. The path outlined for analysis is by no means set in stone and will no doubt deviate as further advancements will be made within the field. However, this overview does give a current up to date route to analysing comparative genomics data. Once this analysis is complete, hopefully it will lead to some interesting and useful observations about the stains studied.