For biologists, DNA microarrays present at once unprecedented opportunities and monumental challenges. In the opportunities column, microarrays produce genome-wide gene expression snapshots, facilitating a migration from gene-by-gene hypothesis-driven research to a relatively unbiased "discovery mode." The challenges broadly include data quality, analysis, and interpretation--that is, reaching an accurate and useful biological conclusion from the correlations identified within the data.
Progress on these three fronts will yield substantial dividends both in medicine and in understanding the complex networks of signal pathways that cannot be elucidated through studies of small gene groups in isolation, according to Almut Schulze, a signalling specialist at the Imperial Cancer Research Fund in London. "It would be possible [after solving the data analysis problems] to measure the expression profile of a cell in a certain situation and to be able to deduce information about the activation state of signalling pathways," he says.
Until recently the problem of data noise seemed the most intractable. The challenge is to filter out erroneous data through regression and normalization, but this task can be impeded by uncertainty over whether some apparently outlying data values in an expression set really are invalid. In some cases it will be the outliers rather than the main body of the data that are of greatest interest. "We can ask why are we getting dirty data and what can we learn even with the dirty data," says Georges Grinstein, professor and director of the Center for Biomolecular and Medical Informatics at the University of Massachusetts, Lowell.
By definition such outlying data points will be small in number, and virtually impossible to detect by generic statistical analysis methods that fail to consider the underlying physical and chemical behavior of the whole microarray process, according to Gustavo Stolovitzky, manager of functional genomics and systems biology at IBM.
The only hope of resolving such noise lies in dissembling the different contributing components to the errors, and determining their contribution. Then by various means, such as repeating experiments and subtracting systematic errors, the genuine data representing expression levels can be derived.
NOISE CONTROL Several steps in the microarray process contribute to the total data noise. First, because microarrays require a significant amount of RNA, typically 20 to 200 [micro]g per test, (1)...