AI Testing and Evaluation: Reflections

Illustrated headshots of Amanda Craig Deckard and Kathleen Sullivan.

Generative AI presents a unique challenge and opportunity to reexamine governance practices for the responsible development, deployment, and use of AI. To advance thinking in this space, Microsoft has tapped into the experience and knowledge of experts across domains—from genome editing to cybersecurity—to investigate the role of testing and evaluation as a governance tool. AI Testing and Evaluation: Learnings from Science and Industry, hosted by Microsoft Research’s Kathleen Sullivan, explores what the technology industry and policymakers can learn from these fields and how that might help shape the course of AI development.

In the series finale, Amanda Craig Deckard, senior director of public policy in Microsoft’s Office of Responsible AI, rejoins Sullivan to discuss what Microsoft has learned about testing as a governance tool and what’s next for the company’s work in the AI governance space. The pair explores high-level takeaways (i.e., testing is important and challenging!); the roles of rigor, standardization, and interpretability in making testing a reliable governance tool; and the potential for public-private partnerships to help advance not only model-level evaluation but deployment-level evaluation, too.

Learn more:

Learning from other domains to advance AI evaluation and testing
Microsoft Research Blog | June 2025

Responsible AI: Ethical policies and practices | Microsoft AI

Transcript

[MUSIC]

KATHLEEN SULLIVAN: Welcome to AI Testing and Evaluation: Learnings from Science and Industry. I’m your host, Kathleen Sullivan.

As generative AI continues to advance, Microsoft has gathered a range of experts—from genome editing to cybersecurity—to share how their fields approach evaluation and risk assessment. Our goal is to learn from their successes and their stumbles to move the science and practice of AI testing forward. In this series, we’ll explore how these insights might help guide the future of AI development, deployment, and responsible use.

[MUSIC ENDS]

For our final episode of the series, I’m thrilled to once again be joined by Amanda Craig Deckard, senior director of public policy in Microsoft’s Office of Responsible AI.

Amanda, welcome back to the podcast!

AMANDA CRAIG DECKARD: Thank you so much.

SULLIVAN: In our intro episode, you really helped set the stage for this series. And it’s been great, because since then, we’ve had the pleasure of speaking with governance experts about genome editing, pharma, medical devices, cybersecurity, and we’ve also gotten to spend some time with our own Microsoft responsible AI leaders and hear reflections from them.

And here’s what stuck with me, and I’d love to hear from you on this, as well: testing builds trust; context is shaping risk; and every field is really thinking about striking its own balance between pre-deployment testing and post-deployment monitoring.

So drawing on what you’ve learned from the workshop and the case studies, what headline insights do you think matter the most for AI governance?

CRAIG DECKARD: It’s been really interesting to learn from all of these different domains, and there are, you know, lots of really interesting takeaways.

I think a starting point for me is actually pretty similar to where you landed, which is just that testing is really important for trust, and it’s also really hard [LAUGHS] to figure out exactly, you know, how to get it right, how to make sure that you’re addressing risks, that you’re not constraining innovation, that you are recognizing that a lot of the industry that’s impacted is really different. You have small organizations, you have large organizations, and you want to enable that opportunity that is enabled by the technology across the board.

And so it’s just difficult to, kind of, get all of these dynamics right, especially when, you know, I think we heard from other domains, testing is not some, sort of, like, oh, simple thing, right. There’s not this linear path from, like, A to B where you just test the one thing and you’re done.

SULLIVAN: Right.

CRAIG DECKARD: It’s complex, right. Testing is multistage. There’s a lot of testing by different actors. There are a lot of different purposes for which you might test. As I think it was Dan Carpenter who talked about it’s not just about testing for safety. It’s also about testing for efficacy and building confidence in the right dosage for pharmaceuticals, for example. And that’s across the board for all of these domains, right. That you’re really thinking about the performance of the technology. You’re thinking about safety. You’re trying to also calibrate for efficiency.

And so those tradeoffs, every expert shared that navigating those is really challenging. And also that there were real impacts to early choices in the, sort of, governance of risk in these different domains and the development of the testing, sort of, expectations, and that in some cases, this had been difficult to reverse, which also just layers on that complexity and that difficulty in a different way. So that’s the super high-level takeaway. But maybe if I could just quickly distill, like, three takeaways that I think really are applicable to AI in a bit more of a granular way.

You know, one is about, how is the testing exactly used? For what purpose? And the second is what emphasis there is on this pre- versus post-deployment testing and monitoring. And then the third is how rigid versus adaptive the, sort of, testing regimes or frameworks are in these different domains.

So on the first—how is testing used?—so is testing something that impacts market entry, for example? Or is it something that might be used more for informing how risk is evolving in the domain and how broader risk management strategies might need to be applied? We have examples, like the pharmaceutical or medical device industry experts with whom you spoke, that’s really, you know, testing … there is a pre-deployment requirement. So that’s one question.

The second is this emphasis on pre- versus post-deployment testing and monitoring, and we really did see across domains that in many cases, there is a desire for both pre- and post-deployment, sort of, testing and monitoring, but also that, sort of, naturally in these different domains, a degree of emphasis on one or the other had evolved and that had a real impact on governance and tradeoffs.

And the third is just how rigid versus adaptive these testing and evaluation regimes or frameworks are in these different domains. We saw, you know, in some domains, the testing requirements were more rigid as you might expect in more of the pharmaceutical or medical devices industries, for example. And in other domains, there was this more, sort of, adaptive approach to how testing might get used. So, for example, in the case of our other general-purpose technologies, you know, you spoke with Alta Charo on genome editing, and in our case studies, we also explored this in the context of nanotechnology. In those general-purpose technology domains, there is more emphasis on downstream or application-context testing that is more, sort of, adaptive to the use scenario of the technology and, you know, having that work in conjunction with testing more at the, kind of, level of the technology itself.

SULLIVAN: I want to double-click on a number of the things we just talked about. But actually, before we go too much deeper, a question on if there’s anything that really surprised you or challenged maybe some of your own assumptions in this space from some of the discussions that we had over the series.

CRAIG DECKARD: Yeah. You know, I know I’ve already just mentioned this pre- versus post-deployment testing and monitoring issue, but it was something that was very interesting to me and in some ways surprised me or made me just realize something that I hadn’t fully connected before, about how these, sort of, regimes might evolve in different contexts and why. And in part, I couldn’t help but bring the context I have from cybersecurity policy into this, kind of, processing of what we learned and reflection because there was a real contrast for me between the pharmaceutical industry and the cybersecurity domain when I think about the emphasis on pre- versus post-deployment monitoring.

And on the one hand, we have in the pharmaceutical domain a real emphasis that has developed around pre-market testing. And there is also an expectation in some circumstances in the pharmaceutical domain for post-deployment testing, as well. But as we learned from our experts in that domain, there has naturally been a real, kind of, emphasis on the pre-market portion of that testing. And in reality, even where post-market monitoring is required and post-market testing is required, it does not always actually happen. And the experts really explained that, you know, part of it is just the incentive structure around the emphasis around, you know, the testing as a pre-market, sort of, entry requirement. And also just the resources that exist among regulators, right. There’s limited resources, right. And so there are just choices and tradeoffs that they need to make in their own, sort of, enforcement work.

And then on the other hand, you know, in cybersecurity, I never thought about the, kind of, emphasis on things like coordinated vulnerability disclosure and bug bounties that have really developed in the cybersecurity domain. But it’s a really important part of how we secure technology and enhance cybersecurity over time, where we have these norms that have developed where, you know, security researchers are doing really important research. They’re finding vulnerabilities in products. And we have norms developed where they report those to the companies that are in a position to address those vulnerabilities. And in some cases, those companies actually pay, through bug bounties, the researchers. And perhaps in some ways, the role of coordinated vulnerability disclosure and bug bounties has evolved the way that it has because there hasn’t been as much emphasis on the pre-market testing across the board at least in the context of software.

And so you look at those two industries and it was interesting to me to study them to some extent in contrast with each other as this way that the incentives and the resources that need to be applied to testing, sort of, evolve to address where there’s, kind of, more or less emphasis.

SULLIVAN: It’s a great point. I mean, I think what we’re hearing—and what you’re saying—is just exactly this choice … like, is there a binary choice between focusing on pre-deployment testing or post-deployment monitoring? And, you know, I think our assumption is that we need to do both. But I’d love to hear from you on that.

CRAIG DECKARD: Absolutely. I think we need to do both. I’m very persuaded by this inclination always that there’s value in trying to really do it all in a risk management context.

And also, we know one of the principles of risk management is you have to prioritize because there are finite resources. And I think that’s where we get to this challenge in really thinking deeply, especially as we’re in the early days of AI governance, and we need to be very thoughtful about, you know, tradeoffs that we may not want to be making but we are because, again, these are finite choices and we, kind of, can’t help but put our finger on the dial in different directions with our choices that, you know, it’s going to be very difficult to have, sort of, equal emphasis on both. And we need to invest in both, but we need to be very deliberate about the roles of each and how they complement each other and who does which and how we use what we learn from pre- versus post-deployment testing and monitoring.

SULLIVAN: Maybe just spending a little bit more time here … you know, a lot of attention goes into testing models upstream, but risk often shows up once they’re wired into real products and workflows. How much does deployment context change the risk picture from your perspective?

CRAIG DECKARD: Yeah, I … such an important question. I really agree that there has been a lot of emphasis to date on, sort of, testing models upstream, the AI model evaluation. And it’s also really important that we bring more attention into evaluation at the system or application level. And I actually see that in governance conversations, this is actually increasingly raised, this need to have system-level evaluation. We see this across regulation. We also see it in the context of just organizations trying to put in governance requirements for how their organization is going to operate in deploying this technology.

And there’s a gap today in terms of best practices around system-level testing, perhaps even more than model-level evaluation. And it’s really important because in a lot of cases, the deployment context really does impact the risk picture, especially with AI, which is a general-purpose technology, and we really saw this in our study of other domains that represented general-purpose technology.

So in the case study that you can find online on nanotechnology, you know, there’s a real distinction between the risk evaluation and the governance of nanotechnology in different deployment contexts. So the chapter that our expert on nanotechnology wrote really goes into incredibly interesting detail around, you know, deployment of nanotechnology in the context of, like, chemical applications versus consumer electronics versus pharmaceuticals versus construction and how the way that nanoparticles are basically delivered in all those different deployment contexts, as well as, like, what the risk of the actual use scenario is just varies so much. And so there’s a real need to do that kind of risk evaluation and testing in the deployment context, and this difference in terms of risks and what we learned in these other domains where, you know, there are these different approaches to trying to really think about and gain efficiencies and address risks at a horizontal level versus, you know, taking a real sector-by-sector approach. And to some extent, it seems like it’s more time intensive to do that sectoral deployment-specific work. And at the same time, perhaps there are efficiencies to be gained by actually doing the work in the context in which, you know, you have a better understanding of the risk that can result from really deploying this technology.

And ultimately, [LAUGHS] really what we also need to think about here is probably, in the end, just like pre- and post-deployment testing, you need both. Not probably; certainly!

So effectively we need to think about evaluation at the model level and the system level as being really important. And it’s really important to get system evaluation right so that we can actually get trust in this technology in deployment context so we enable adoption in low- and in high-risk deployments in a way that means that we’ve done risk evaluation in each of those contexts in a way that really makes sense in terms of the resources that we need to apply and ultimately we are able to unlock more applications of this technology in a risk-informed way.

SULLIVAN: That’s great. I mean, I couldn’t agree more. I think these contexts, the approaches are so important for trust and adoption, and I’d love to hear from you, what do we need to advance AI evaluation and testing in our ecosystem? What are some of the big gaps that you’re seeing, and what role can different stakeholders play in filling them? And maybe an add-on, actually: is there some sort of network effect that could 10x our testing capacity?

CRAIG DECKARD: Absolutely. So there’s a lot of work that needs to be done, and there’s a lot of work in process to really level up our whole evaluation and testing ecosystem. We learned, across domains, that there’s really a need to advance our thinking and our practice in three areas: rigor of testing; standardization of methodologies and processes; and interpretability of test results.

So what we mean by rigor is that we are ensuring that what we are ultimately evaluating in terms of risks is defined in a scientifically valid way and we are able to measure against that risk in a scientifically valid way.

By standardization, what we mean is that there’s really an accepted and well-understood and, again, a scientifically valid methodology for doing that testing and for actually producing artifacts out of that testing that are meeting those standards. And that sets us up for the final portion on interpretability, which is, like, really the process by which you can trust that the testing has been done in this rigorous and standardized way and that then you have artifacts that result from the testing process that can really be used in the risk management context because they can be interpreted, right.

We understand how to, like, apply weight to them for our risk-management decisions. We actually are able to interpret them in a way that perhaps they inform other downstream risk mitigations that address the risks that we see through the testing results and that we actually understand what limitations apply to the test results and why they may or may not be valid in certain, sort of, deployment contexts, for example, and especially in the context of other risk mitigations that we need to apply. So there’s a need to advance all three of those things—rigor, standardization, and interpretability—to level up the whole testing and evaluation ecosystem.

And when we think about what actors should be involved in that work … really everybody, which is both complex to orchestrate but also really important. And so, you know, you need to have the entire value chain involved in really advancing this work. You need the model developers, but you also need the system developers and deployers that are really engaged in advancing the science of evaluation and advancing how we are using these testing artifacts in the risk management process.

When we think about what could actually 10x our testing capacity—that’s the dream, right? We all want to accelerate our progress in this space. You know, I think we need work across all three of those areas of rigor, standardization, and interpretability, but I think one that will really help accelerate our progress across the board is that standardization work, because ultimately, you’re going to need to have these tests be done and applied across so many different contexts, and ultimately, while we want the whole value chain engaged in the development of the thinking and the science and the standards in this space, we also need to realize that not every organization is necessarily going to have the capacity to, kind of, contribute to developing the ways that we create and use these tests. And there are going to be many organizations that are going to benefit from there being standardization of the methodologies and the artifacts that they can pick up and use.

One thing that I know we’ve heard throughout this podcast series from our experts in other domains, including Timo [Minssen] in the medical devices context and Ciaran [Martin] in the cybersecurity context, is that there’s been a recognition, as those domains have evolved, that there’s a need to calibrate our, sort of, expectations for different actors in the ecosystem and really understand that small businesses, for example, just cannot apply the same degree of resources that others may be able to, to do testing and evaluation and risk management. And so the benefit of having standardized approaches is that those organizations are able to, kind of, integrate into the broader supply chain ecosystem and apply their own, kind of, risk management practices in their own context in a way that is more efficient.

And finally, the last stakeholder that I think is really important to think about in terms of partnership across the ecosystem to really advance the whole testing and evaluation work that needs to happen is government partners, right, and thinking beyond the value chain, the AI supply chain, and really thinking about public-private partnership. That’s going to be incredibly important to advancing this ecosystem.

You know, I think there’s been real progress already in the AI evaluation and testing ecosystem in the public-private partnership context. We have been really supportive of the work of the International Network of AI Safety and Security Institutes (opens in new tab)[1] (opens in new tab) and the Center for AI Standards and Innovation (opens in new tab) that all allow for that kind of public-private partnership on actually testing and advancing the science and best practices around standards.

And there are other innovative, kind of, partnerships, as well, in the ecosystem. You know, Singapore has recently launched their Global AI Assurance Pilot (opens in new tab) findings. And that effort really paired application deployers and testers so that consequential impacts at deployment could really be tested. And that’s a really fruitful, sort of, effort that complements the work of these institutes and centers that are more focused on evaluation at the model level, for example.

And in general, you know, I think that there’s just really a lot of benefits for us thinking expansively about what we can accomplish through deep, meaningful public-private partnership in this space. I’m really excited to see where we can go from here with building on, you know, partnerships across AI supply chains and with governments and public-private partnerships.

SULLIVAN: I couldn’t agree more. I mean, this notion of more engagement across the ecosystem and value chain is super important for us and informs how we think about the space completely.

If you could invite any other industry to the next workshop, maybe quantum safety, space tech, even gaming, who’s on your wish list? And maybe what are some of the things you’d want to go deeper on?

CRAIG DECKARD: This is something that we really welcome feedback on if anyone listening has ideas about other domains that would be interesting to study. I will say, I think I shared at the outset of this podcast series, the domains that we added in this round of our efforts in studying other domains actually all came from feedback that we received from, you know, folks we’d engaged with our first study of other domains and multilateral, sort of, governance institutions. And so we’re really keen to think about what other domains could be interesting to study. And we are also keen to go deeper, building on what we learned in this round of effort going forward.

One of the areas that I am particularly really interested in is going deeper on, what, sort of, transparency and information sharing about risk evaluation and testing will be really useful to share in different contexts? So across the AI supply chain, what is the information that’s going to be really meaningful to share between developers and deployers of models and systems and those that are ultimately using this technology in particular deployment contexts? And, you know, I think that we could have much to learn from other general-purpose technologies like genome editing and nanotechnology and cybersecurity, where we could learn a bit more about the kinds of information that they have shared across the development and deployment life cycle and how that has strengthened risk management in general as well as provided a really strong feedback loop around testing and evaluation. What kind of testing is most useful to do at what point in the life cycle, and what artifacts are most useful to share as a result of that testing and evaluation work?

I’ll say, as Microsoft, we have been really investing in how we are sharing information with our various stakeholders. We also have been engaged with others in industry in reporting what we’ve done in the context of the Hiroshima AI Process, or HAIP, Reporting Framework (opens in new tab). This is an effort that is really just in its first round of really exploring how this kind of reporting can be really additive to risk management understanding. And again, I think there’s real opportunity here to look at this kind of reporting and understand, you know, what’s valuable for stakeholders and where is there opportunity to go further in really informing value chains and policymakers and the public about AI risk and opportunity and what can we learn again from other domains that have done this kind of work over decades to really refine that kind of information sharing.

SULLIVAN: It’s really great to hear about all the advances that we’re making on these reports. I’m guessing a lot of the metrics in there are technical, but sociotechnical impacts—jobs, maybe misinformation, well-being—are harder to score. What new measurement ideas are you excited about, and do you have any thoughts on, like, who needs to pilot those?

CRAIG DECKARD: Yeah, it’s an incredibly interesting question that I think also just speaks to, you know, the breadth of, sort of, testing and evaluation that’s needed at different points along that AI life cycle and really not getting lost in one particular kind of testing or another pre- or post-deployment and thinking expansively about the risks that we’re trying to address through this testing.

You know, for example, even with the UK’s AI Security Institute (opens in new tab) that has just recently launched a new program, a new team, that’s focused on societal resilience research. I think it’s going to be a really important area from a sociotechnical impact perspective to bring some focus into as this technology is more widely deployed. Are we understanding the impacts over time as different people and different cultures adopt and use this technology for different purposes?

And I think that’s an area where there really is opportunity for greater public-private partnership in this research. Because we all share this long-term interest in ensuring that this technology is really serving people and we have to understand the impacts so that we understand, you know, what adjustments we can actually pursue sooner upstream to address those impacts and make sure that this technology is really going to work for all of us and in a way that is consistent with the societal values that we want.

SULLIVAN: So, Amanda, looking ahead, I would love to hear just what’s going to be on your radar? What’s top of mind for you in the coming weeks?

CRAIG DECKARD: Well, we are certainly continuing to process all the learnings that we’ve had from studying these domains. It’s really been a rich set of insights that we want to make sure we, kind of, fully take advantage of. And, you know, I think these hard questions and, you know, real opportunities to be thoughtful in these early days of AI governance are not, sort of, going away or being easily resolved soon. And so I think we continue to see value in really learning from others, thinking about what’s distinct in the AI context, but also what we can apply in terms of what other domains have learned.

SULLIVAN: Well, Amanda, it has been such a special experience for me to help illuminate the work of the Office of Responsible AI and our team in Microsoft Research, and [MUSIC] it’s just really special to see all of the work that we’re doing to help set the standard for responsible development and deployment of AI. So thank you for joining us today, and thanks for your reflections and discussion.

And to our listeners, thank you so much for joining us for the series. We really hope you enjoyed it! To check out all of our episodes, visit aka.ms/AITestingandEvaluation (opens in new tab), and if you want to learn more about how Microsoft approaches AI governance, you can visit microsoft.com/RAI (opens in new tab).

See you next time!

[MUSIC FADES] 

AI Testing and Evaluation podcast series

[1] (opens in new tab) Since the launch of the International Network of AI Safety Institutes, the UK renamed its institute the AI Security Institute (opens in new tab).

The post AI Testing and Evaluation: Reflections appeared first on Microsoft Research.

No Comments

Uncategorized

AI Testing and Evaluation: Reflections

Learn more:

Transcript

Leave a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Uncategorized

AI Testing and Evaluation: Reflections

Learn more:

Subscribe to the Microsoft Research Podcast:

Transcript

Leave a Comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories