“Clouds are not spheres, mountains are not cones, coastlines are not circles, and bark is not smooth, nor does lightning travel in a straight line.” (Benoit Mandelbrot, November 20, 1924–October 14, 2010)

Photo by Glenn Carstens-Peters on Unsplash

The map is not the territory

There is a lot of uncertainty in international cooperation projects. If the tasks aid agencies try to accomplish were easy, cheap or readily replicated, they would have gotten done a long time ago. But they’re not, and evaluation should reliably show how much projects work or do not (accountability) and how and why they do or do not (learning). Agencies looking for evaluators often say they require rigorous design and analysis.

It’ll be unpopular to say this, but I think funders’ focus on evaluation “rigor” is misguided. Or, at least, how it is interpreted and implemented is misguided. Scientific rigor works well when we (evaluators, or the agency, or the implementer) can control conditions. That is egregiously uncommon in international cooperation projects.

Rigor is still construed largely in quantitative terms. USAID has recently brought on board a Chief Economist, whose bona fides in randomized controlled trials (RCTs) are indisputable: he was on the team that launched Innovations for Poverty Action at MIT. That organization has been at the forefront of the push for more RCTs and meta-analyses in development research.

But RCTs still make up a tiny fraction of development evaluations. This is not because commissioners of evaluations lack knowledge or because RCTs are too much work. It’s because RCTs have questionable applicability in nearly all realms of international cooperation programs.

Impact evaluation in the real world

Author’s photo, Monrovia, Liberia, 2011

I’ve seen how impact evaluations play out in practice for the agency, and it’s disheartening. In my experience, even when the designs work well (and they often do not), the results are unimpressive for the country officers who patiently await them. Sometimes they don’t know how to interpret the results; often, the amount of quantifiable “impact” is less than they were hoping, or even negative. This isn’t necessarily because the project didn’t have any impact – it’s because in the interim there were elections, peace negotiations started or stalled, the areas had flooding and mudslides… and the impact evaluation design can’t carve that out of the results to say what is attributable to the project, and what is attributable to external factors.

That’s true even with a control group. Often the control or comparison groups are affected in different ways by national or regional events and trends. They may also be selected by what look like rigorous methods. One way is by creating an index of key factors and building that index for each site, so that units with similar index scores are paired into treatment and control. But it’s like looking at a map and thinking it’s an accurate rendition of the territory it shows: cooperation project sites are complex and dynamic. Choosing seven or ten or twenty-five variables and combining them, however thoughtfully done, looks good in the methodology section but doesn’t always match sites well.

“The textbook theory of expected utility does have one considerable virtue, which is that it’s easy to explain in textbooks.” Peter Coy, New York Times, 24 October 2022

An example of matching

Photo by Fallon Michael on Unsplash

Here’s an example from my experience. In a country burdened with internal armed conflict among multiple actors, a team of university professors designed just such a study to measure outcomes of a program to help residents in conflict zones. After USAID had paid for a considerable amount of time and data to create the index, the professors shared the list of matched villages. In each pair, one village received the program benefits, and one didn’t but had important similarities that made them a “match.”

Except that the implementing partner staff looked at the matches and said that these villages didn’t match – that, on key variables, they were very different. These differences included what kinds of income and how much income the villagers survived on, for example, or how much the armed conflict affected them, the villages’ access to markets for their produce, the role of religion in community life, or other non-trivial variables. Non-trivial, and not easily quantifiable. Comparisons between poorly matched treatment and control villages could skew the results.

But the professors said the model worked, and that the quasi-experimental design and sample size would ensure the results would be intelligible and useful. The argument is that any big societal/economic/political changes would affect both treated and non-treated sites, so any difference in impact could be attributable to the project and the project alone. The agency trusted this assessment and ran the baseline study, then signed off on the professors’ report.

Then the funders did something even weirder, but unfortunately also quite common: they hired someone else to do the midline and endline data collection. The new team came in to find this unresolved question mark in the design, and a short timeline to collect and report on midline data – including (the agency dearly hoped) impressive impact figures!

Unfortunately, no impact emerged in the data at midline, despite comprehensive theories of change, dedicated local professionals, and years of work. Whether we reviewed the data with a microscope or a telescope, there was just nothing there – no evidence that the treatment sites had any better results at all than the control sites.

I suspect that differences in how armed conflict had played out in treated and non-treated villages, with different armed actors using different tactics, and/or differences in how those villages got connected to markets or not – whether by the project, by their own accord, by government action, or by market forces. Matching didn’t work. It masked heterogeneity instead of resolving it – there were too many hidden but key differences between treated and non-treated sites so that the comparison crashed.

Then again, maybe the matching was perfect and the intervention simply wasn’t any good. We’ll never know, because the study wasn’t designed to find out the how and why of successes and failures.

But Keri, how can you be against RCTs?

That’s just not true! I think they’re fascinating to work on and tremendously useful. Done well, they can make the case for cost-effectiveness and replication. Insecticide-treated bed nets are a famous example, as are cash transfer programs. Another (former) senior USAID staffer gushes that giving cash could be better than development programs and worked doggedly in the Agency to increase the times USAID could test this theory. On its face, cash benchmarking sounds great – and avoids high project overhead costs.

But the majority of USAID’s work is not in providing bed nets or cash transfers, nutrition programs and the like. USAID works more and more on systems, as discussed in a previous post [LINK HERE], and those systems require multi-faceted approaches to change. Other funders are also working on increasing green energy financing, helping governments renovate their infrastructure, modernizing IT systems, fighting corruption, improving educational opportunities in rural areas, and making judicial systems fairer for all. These are not environments, topics, outcomes or complexity levels that RCTs or cash benchmarking can hope to measure.

I’m not against RCTs. I just think it’s a fair amount of hubris to think you can design scientific rigor, comparability, or clear cost-benefit relationships for the great bulk of cooperation projects, just because the funder wants you to.

Chess has an 8 X 8 square board, two players, and six types of pieces. Even so, it has so many permutations that even the best players are hard-pressed to choose wisely from among their options. Why do we expect to reduce international cooperation projects – with unlevel playing fields, draconian resource levels, hierarchical social relations, and wildly varying stakeholders and stakes – to a “difference-in-differences” calculation?

What else is there, apart from rigor?

“…to live with the untrammeled openendedness of such fertile not-knowing is no easy task in a world where certitudes are hoarded as the bargaining chips for status and achievement.” Maria Popova in The Marginalian, paraphrasing Wislawa Szymborska

“…a world bedeviled, as Rebecca Solnit memorably put it, by ‘a desire to make certain what is uncertain, to know what is unknowable, to turn the flight across the sky into the roast upon the plate’.” Ibid

There are no formulas that will tell us clearly, firmly, finally and flatly what works and what doesn’t. We have to wade into the uncertainty and rest comfortably with not knowing, and with failure. These are not characteristics that international aid agencies, or most regular people, particularly enjoy.

Aid agencies can be their own worst enemies, however. After painstakingly reviewing multiple evaluation proposals, references and writing samples, and deciding after careful consideration to select a given evaluation team, commissioners of evaluation are likely to mistrust the very evaluators they’ve contracted. The aid agency staffers want more rigor in the design and plenty of detailed annexes on everything that will be done and analyzed during the course of an evaluation. They put a dozen or more evaluation questions into lingo-heavy, multiple-clause language to ensure they get what they want.

This may well be warranted at times – I’ve seen a bit of shoddy evaluation work here and there in the past. But honestly, not that often.

For the most part, international program evaluators that make it through these selection processes have good experience in the thicket of ambiguity. We are accustomed to steep and fast learning curves on tightly limited budgets, to begin to understand the intentions and results of multi-faceted programming to address intractable problems in unpredictable conditions. We are eager to answer your evaluation questions thoroughly, which is why we respectfully recommend fewer, better, and more realistic questions, then end up turning ourselves into sleep-deprived pretzels to answer all you wanted and more.

What they, and we, don’t have is a strictly scientific, unassailable, precision process to capture, explain, and rate complex human endeavors. Shame on us as evaluators if that’s what we promise in our proposals. But instead of ruing the human imperfection of an evaluation team, notice the way we look for the moment when our qualitative data stops showing new things, and how we make meaning of what we find among those diverse voices. Instead of showing you all the superficial ways a project might be replicable, notice how evaluators apply critical thinking capacities to different situations differently, highlighting what’s unique to the place and people where the project is active. Instead of looking for ways to showcase when implementers exceed targets for numbers of people trained, celebrate those projects that are willing to fail and learn from it, undaunted.