In recent years, concerns about bioterrorism and emerging diseases have changed public health surveillance requirements and stimulated large-scale development of new surveillance systems and outbreak detection algorithms. Not all outbreak detection algorithms perform equally well, however, and there is no single method that meets all analytic requirements in all surveillance settings. In public health practice, there is a need, therefore, for evidence to inform the choice of an outbreak detection algorithm for a given surveillance application. Despite this critical need, there is little direct evidence to suggest how the accuracy and timeliness of different surveillance methods compare for detecting different types of outbreaks. The efforts directed at evaluating surveillance algorithms have so far been sporadic. Although the number of published evaluation studies has increased in recent years, authors focus typically on evaluating individual methods within a specific surveillance context using a single data source. As a consequence, the results of these studies are hard to compare and generalize from, and public health practitioners often must guess or rely on intuition when selecting outbreak detection algorithms for use in surveillance systems. The lack of evidence to inform selection of methods to be used in surveillance systems is a direct consequence of the difficulty of evaluating outbreak detection. These studies require time, expertise, computational resources, and sufficient data. In addition, in the absence of a consistent model of how different methods are related, it is difficult to compare the results across evaluation studies. Our present work addresses the lack of evidence in 3 ways: 1. To provide a framework for gathering evidence about the relative performance of outbreak detection algorithms, we have created an explicit representation of these algorithms in a manner that clarifies how methods differ; 2. To lower the barriers to conducting evaluation studies, we are developing a computational testbed that draws on real-world data sources and allows users to configure, run, and evaluate alternative methods on a large-scale basis; and, 3. To begin building an evidence-base of the relative performance of different methods we have used our testbed to conduct a series of evaluation studies.