Benchmarking challenge

In my view, the discussions regarding the transparency of the epidemiological modelling ought to lead to a more constructive assessment of the reliability of such models. While I have not worked on models in this field, I do have experience with assessing methods for data analysis where similar concerns played a role. For that reason, I would like to contribute by describing an idea in this direction.
In the present context I think it might first of all be useful to consider separately two kinds of effort. Not necessarily in order of importance, the first aim is to determine what currently the reproduction number actually is. The second aim would be to assess what the reproduction number becomes under various scenarios. I think this is an important distinction.
In order to achieve the first aim, I would assume that ideally the methods to use are purely data-analysis methods: no assumption ought to be necessary concerning the shape, integral, or other characteristics of the secondary infection rate as a function of time. Various methods could differ however in terms of how measurement errors (misidentification of infections, delays or bias in reporting etc.) are accounted for.
In order to achieve the second aim, which is to extrapolate current trends even when circumstances in society change, of necessity there is a need for some form of model which involves proposing the mechanism(s) that influence infection rates, and parameterization of such mechanisms. I would expect that there is more variation between models that could be used for predictions/ extrapolations than there is between methods aimed at data analysis.

My proposal would be to have two parallel tracks for each of the two aims, since the type of methods focussed on the first aim and the type of models focussed on the second might well be quite different. For instance, the data-analysis methods might not have any self-evident generalisation to be used for predictive purposes.
In each track there should be at least 2 but preferably a number of (alternative) modelling efforts which are all fully specified on a public forum (all equations, all assumptions, etc), or at least public for all groups that are involved in this trialling / benchmarking exercise.
At the time that I was involved in a similar exercise in a different field (mid-90’s) this was done mostly via email and ftp repositories. Currently there are rather more sophisticated ways to organise such collaborative efforts: an example that a colleague pointed out can be found here, but there are probably others too.

This type of assessment effort requires that one group, separate from the groups that carry out designing and running models, set up one or several synthetic datasets with various scenarios of increasing difficulty. The crucial aspect is that at first only the synthetic data be made available to the modelling groups; i.e. none of the assumptions, bias, errors, or anything else that has been done to generate the synthetic data. It is important that the modelling groups run their models blind.
From these data the modelers would be required to try and determine the properties and parameters that were used to generate the synthetic data, without knowing what the “synthetic data”-group put in the dataset in terms of real effects, biases, and other forms of stochastic data-error type pollution.
A certain amount of time is set aside for the modelling groups to run their models and describe the results, which are then published on the benchmark site. Only once that deadline is passed and the effort is closed, the inputs would be published as well by the group that generated the synthetic datasets. Then an analysis is made of the differences between the modelling results and the actual parameters used to generate the synthetic data. This comparative analysis would probably best be carried out by the group which generated the synthetic data, which is another reason for that group not also to do any modelling.

Such an effort would not only reveal much more clearly what the internal uncertainty of the various models is: i.e. how sensitive models are to data-errors. It would also reveal the external uncertainty due to the different modelling frameworks. Even without knowing fully what has been used, this is valuable in its own right and is a more constructive way of achieving objective criteria by which to judge the quality of models. Ideally one would discover what the necessary level of complexity of the models is, which is one step towards determining what type of models are sufficient to attempt to base policies on. Even if in the end it turns out that the combination of internal and external uncertainty is so large that models give no clear guidance, that would be an important lesson to learn.

Ideally, this would be an international effort, where it would be nice to have both the involvement of groups that already have an international reputation in this field such as JHU or Imperial College, but also newer groups that have novel ideas or approaches.