Podcast: Here's a Free Tool To Test Your AI
In this episode, we explore how Robuscope is reimagining the future of AI testing, offering groundbreaking solutions for improving the accuracy and reliability of artificial intelligence systems.
In this episode, we explore how Robuscope is reimagining the future of AI testing, offering groundbreaking solutions for improving the accuracy and reliability of artificial intelligence systems.
Episode Notes
(0:50) - AI Testing Reimagined with Robuscope
This episode is part of series developed in collaboration with SAE International to explore the leading edge of mobility with the support of experts from industry and academia. Learn more about the importance of robust AI & Robuscope from the Fraunhofer Institute by checking out this link.
Become a founding reader of our newsletter: http://read.thenextbyte.com/
Transcript
If you've ever seen an AI model, a self-driving car, your phone, something like that, make a mistake, that's okay because AI isn't always ready for the real world. And in today's episode, we're talking about a testing model that you can use today. It's free and it's online to help users and developers submit their AI models and receive detailed reports to understand the strengths and weaknesses of their AI models.
I'm Daniel, and I'm Farbod. And this is the NextByte Podcast. Every week, we explore interesting and impactful tech and engineering content from Wevolver.com and deliver it to you in bite sized episodes that are easy to understand, regardless of your background.
Daniel: What's up, folks? Today we're talking about AI, but in a little bit of a different way than we've been talking about it in the past. Usually we're talking about, here's AI that's designed for a specific purpose. Here's the problem. Here's how AI helps solve it. And here's the significance. But in today's episode, we're talking about AI in a different twist in that we realize that AI is becoming increasingly integrated into all parts of applications around our lives, a lot of this being critical, autonomous vehicles, medical diagnostics, et cetera, knowing that the more involved AI becomes in our lives, the more crucial it is that they're actually really reliable. So, in today's episode, we're talking all about online tool that's actually free and it's available to help everyone rigorously test the robustness of their own AI model. So, for our listeners who are AI engineers out there or who know them, this is an interesting tool for you to share, to listen to, to evaluate, to provide some feedback on. Because in this case, inadequate testing of AI models in complex real-world scenarios causes bad outcomes. So, our goal here is to reduce AI failures in high-stakes environments and make AI as reliable and safe as we can.
Farbod: And just like adding some context there, right? Imagine your doctors become more and more reliant on AI models for analyzing your CAT scans or your MRI scans. Now imagine those models are actually not very robust, meaning that it gives an answer about if you have cancer or not, while it's like maybe 60 or 70% certain, instead of saying, hey, I'm not that sure. And then your doctor just blindly accepts that. Now, I'm not saying this is what happens, I'm just saying envision that future. Well, to get ahead of that, you need to have good testing, good validation of these AI models before they get launched out to the general public, right? And that's what these folks are trying to do.
Daniel: No, I'm with you, man. And it's like, it's not just the scenario, I think, that you mentioned where somebody's trusting an AI model that doesn't have 100% certainty. But what if this AI model is 99% certain, but it's wrong. Like that's a potential outcome as well, right? Where AI models have been trained and have really, really high fidelity, high reliability, high accuracy in a sterile scenario with the training data. But then in real world applications, they know that there's a twist to the data or some boundary case that makes the AI model not as reliable as it was before. So, there's a lot of big companies out there that are pushing AI through without testing. We see those things go viral on social media. I think about Facebook, was it, that used AI in their hiring process and it ended up biasing their hiring.
Farbod: Amazon.
Daniel: Amazon. Yeah. Boom, used AI in their hiring process and it ended up biasing their hiring in a way that it was training off of existing resumes in their employee group. And it ended up furthering the lack of diversity that was already existing in their workforce. So that's just one example. I see stuff all the time, every single day. I don't wanna like take stabs at too many companies because it's a fundamental challenge. It's a fundamental challenge with AI development. And I appreciate the, is it Fraunhofer? Fraunhofer?
Farbod: Fraunhofer, yeah.
Daniel: Fraunhofer Institute that created this tool that we're going to talk about today called Robuscope. It's a free tool, tests AI in all sorts of tough situations, right, the boundary conditions, a comprehensive testing framework, almost a report card or a health check on your AI model so that you can input data points, test a model. It gives it a comprehensive evaluation in various scenarios to identify potential weaknesses, then gives you a report, a performance report to say like, hey, this is where your AI tool may, or your AI model may be unreliable. I think of it like a, I don't know, webpage loading test that's like this boilerplate test that developers can use to understand whether their page is loading fast enough or not. But in this case, use it with AI models to make sure that they're robust.
Farbod: Yeah, definitely. And one thing that came to mind when I saw this is like, I don't do this but the discourse I've seen around AI model testing is very interesting. Essentially, it's a product that you can only do black box testing on. So that leaves a lot of the test engineers in a weird spot of, okay, like how much testing, how much of an in-depth testing can we actually do here? And like you were saying earlier, we've seen a lot of companies just kind of forego the testing process and just shipping products because the competition is so hot, or maybe it's just not worth it. Basically, the manual testing process of these AI models can be quite monotonous, can be quite long, can be quite expensive, which all stand against these tough deadlines for shipping products that our tech companies have become accustomed to. So now you have something that is free, easy to use, quite fast, that at the very minimum gives you, like you're saying, just pulse check that everything is as good as it can be. And you don't even get a yes or no, you get a pretty detailed report card of how it's doing in these different robustness and uncertainty metrics.
Daniel: I think it's super sick. Like one of the cool things about it is, I feel like we're like a little bit unorganized today, but it's just because we're excited about it. Two things that I think are really interesting, one of them you're mentioning, it's not just a yes or no. You can go to the website and check out an example report, even without uploading a set of data yourself. It tells you the accuracy of your model, the F score. I'm not sure what F score is. Do you know?
Farbod: No. No idea. I saw it and I was like; I'm looking for the explanation. I can't find it.
Daniel: Expected calibration error, minimum calibration error, prediction rejection ratio, prediction stability for correct predictions, and prediction stability for incorrect prediction. So, it kind of tells you the likelihood in which your AI model is to get confused, the average confidence you can have for different classes of data. And then it helps interpret each of these things the one that I didn't know. It's got a bunch of descriptors underneath to help explain all the different descript, like all the different factors by which it's grading your AI model. I think that's super cool. The second thing that I think is really interesting, and I think it's even more interesting than the level of detail that it gives you, you don't actually need to upload any proprietary data about your AI model.
Farbod: Right!
Daniel: All you can do is take the inputs and the outputs and then feed those into Robuscope. And that's how it helps make sure that your model is robust. So, it doesn't need the actual secret sauce, which is funny because that's what this podcast is all about. We got a secret sauce poster on the wall behind me, but you don't actually need to share your secret sauce. In this case, you can just share the inputs and the outputs and keep the secret recipe like Plankton, or no, like Mr. Krabs, trying to keep it locked up in the safe so Plankton doesn't come take it.
Farbod: Yeah, yeah. Hey, but before we move on, I did use ChatGPT to tell me what the F score is. It's a metric for analyzing machine learning models that combines the metric for precision and recallability into a single score. So how precise the predictions are and how well it's able to recall, I'm guessing, past information.
Daniel: Love it.
Farbod: So, combination of those two things. But I totally agree. Not only is it, like you said, is it kind of straightforward in terms of how you test it. You know, you give the ground truth and the predictions that your model came up with. The format in which you provide it is quite amazing as well. It's just a JSON file. For the folks that might not be familiar, JSON is a commonly used format of a file. Think text files or word files they use on your computer. JSON is just another version of it. And inside of it, you usually have hierarchies and structures that are dictionaries. So, you have a key, like your ground truth value. Let's say the right answer is A, or no, no, A's not right, 1. And then you have the predictions from the model, which could be, you know, 0.85, 0.67, et cetera, et cetera. And then that's what you're uploading. So very like in widely used format, very easy to generate. And yeah, super accessible. You can upload, I think files up to four gigabytes, which is quite large. And you can test not just a single grouping of data. You can upload different groupings of data by just customizing your JSON file a little bit.
Daniel: Well, and let's talk about like a potential example here that the so what to kind of contextualize what can and would make this so valuable to someone who's a developer. We kind of alluded to this earlier, but an example that they use is folks that are using AI in the field of medical diagnostics, right? Using medical image data such as a CT or an MRI scan to try and diagnose a health issue with a patient. Incorrect diagnoses one way saying, you know, a false positive, let's say, like they tell you, you have breast cancer when you do not. That's a bad outcome. An even worse outcome is a false negative where it tells you you're healthy when you're actually not. So, you always want the AI model results to be both reliable and transparent in terms of the uncertainty associated with a certain outcome. So, Robuscope does a straightforward testing. They actually have focused a lot on the medical realm. They focused on logistics and autonomous driving and essentially you can help developers using this tool understand the degree of uncertainty of predictions for different classes of data. So, in this case, Robuscope would help check how robust and safe the predictions of an AI system, like we said, for breast cancer diagnoses are, and then could help share that with the developer, what types, what classes of data, what types of outcomes need to be escalated to the medical staff using this tool so that data can be reviewed manually. So, it's not saying like, let's sideline your AI model, but you take a bunch of inputs and a bunch of outputs, and this can tell you where it's robust, where it's not, and then can let medical staff know which types of results they have to double check. Maybe it's some certain types of images with a different level of contrast threshold or a different size of image. I'm not sure what those inputs are for a medical diagnostics situation, but the idea is you can understand where the models really, really strong, accurate, reliable, with low certainty, and then also understand the weaknesses to try and help us take a nuanced approach to the rollout and application of AI in very critical realms like self-driving, like medical diagnostics, like logistics, et cetera.
Farbod: Totally agree. And as I was reading this, it just really sounds like a win all around. It can be a standardized tool that everyone kind of uses moving forward and it grows and gets better over time, which is great. So, I was struggling to find, because we promised our listeners that we would give a balanced view on every topic that we talk about. So, I was struggling to find a potential downside, but after some reflection, I think being an online only tool might be one of them, like requiring an internet connection. In this day and age, everybody has an internet connection, so I don't know how big that is. But the fact that you can't just download the model, go offline and then use your sensitive data in some cases where you might need to test a sensitive data. I think it could be a potential growth area, not even on downside, that's how I'll phrase it.
Daniel: Well, yeah, that's what I was gonna say too, is it would be awesome if, and maybe they already have this available, I don't know, but I can think of a constraint like in my day job where they say like, it doesn't matter if they say it's confidential, do not upload any data to an online tool. Like just as a rule of thumb, right, for IT security. So, I'm with you there. Like if there were some type of air gapped version of this that could help give people a lot of confidence, I think it might help companies, universities, governments, et cetera, that are at the cutting edge of technology feel more confident about using it and help enhance the adoption curve of this tool. Otherwise, I don't, what's the score in Silicon Valley that they like try to achieve the best score.
Farbod: The Weissman score?
Daniel: The Weissman score. I could see this being like the new Weitzman score for A.I. is like, what's the benchmark for A.I. reliability? What's the, you know, share with investors, share with the public, share with your friends the level of robustness that you get in A.I. development. I could even see professors using this in schools to like help developers learning how to use AI tools. Like someone here listening or someone here recording this podcast that's starting classes next week, like using this like a tool in their school, right? That’s Farbod by the way, not me. I'm not going back to school. You know, like in your master's program starting next week, using something like this to help evaluate the robustness of a machine learning tool that you're developing for a class assignment. Like I could see this becoming the way by which you can self-evaluate your work.
Farbod: Well, that's very fitting because my first class is machine learning for stock trading. So, this will come in quite handy.
Daniel: Well, yeah, and instead of putting your money where your mouth is first, you can use something like this to like, exactly, check that first.
Farbod: Yeah, see if I will go broke or not before I actually do.
Daniel: Or to understand the strong areas, the specific stock categories that your model is really robust at so that you can double down on trading in those categories and not trade so much in the others where you would have lost a lot of money.
Farbod: I don't know. Knowing myself, my algorithm's just gonna say buy GameStop.
Daniel: Probably.
Farbod: Yeah, this is not financial advice, by the way.
Daniel: Not at all. You don't wanna listen to us for financial advice.
Farbod: Or any advice, really. You wanna do a little quick wrap up?
Daniel: Yes, sir. Have you ever seen a self-driving car or a smart device or an AI model make a weird mistake? That's okay, it's because AI isn't always as smart and reliable as we think. We always want AI to be super reliable, especially when it's growing. Its importance is growing in our lives. It's helping drive our cars, helping doctors make important decisions. But the problem is traditional development methods, the tests that people use, they don't always catch all the ways that AI can mess up in real life. So, the Fraunhofer Institute created Robuscope. It's this free tool that tests AI in all sorts of tough situations. Helps basically create a report card to make sure that your AI model is ready for anything. Unlike a lot of big companies that push out AI models without doing thorough testing, I'm looking at you, Tesla, Robuscope helps catch all these mistakes before they happen, help you understand where your models are strong and where they're weak, so you implement it in the right way, making AI safer for everyone.
Farbod: Wow, a good summary and a jab at Tesla? You're killing it.
Daniel: You know me, man. The mic comes up and the gloves come off.
Farbod: Oh yeah, that's Daniel for you. Alright everyone, thank you so much for listening. And as always, we'll catch you the next one.
Daniel: Peace.
As always, you can find these and other interesting & impactful engineering articles on Wevolver.com.
To learn more about this show, please visit our shows page. By following the page, you will get automatic updates by email when a new show is published. Be sure to give us a follow and review on Apple podcasts, Spotify, and most of your favorite podcast platforms!
--
The Next Byte: We're two engineers on a mission to simplify complex science & technology, making it easy to understand. In each episode of our show, we dive into world-changing tech (such as AI, robotics, 3D printing, IoT, & much more), all while keeping it entertaining & engaging along the way.