Have you seen machine learning solutions fall flat in practice?
Well, I have. Several times. I get occasional panic calls from teams about their 98% accurate models generating questionable predictions once released to actual users.
Did they build a bad model? Maybe.
But the real issue is that the majority of these teams skipped a step.
And that step is testing. Not just any type of testing, but post-development testing (PDT).
What is Post-Development Testing (PDT)?
Post-development testing in the context of machine learning is an experimentation period where you take a model from development and test it on real data, and often with real users. And this happens before model deployment.
Why is PDT important?
PDT is important for two primary reasons. First, for ensuring that the model is working as expected in practice (model success). And next, to verify if the model is serving the needs of the business (business success).
Let’s first talk about model success. When models go from development to production (i.e., practice), there’s often a natural degradation in performance. However, if the performance discrepancy is too wide, then it’s no longer a natural degradation. It’s a problem that needs to be addressed.
Let’s say you’re dealing with a fraud detection model. In development, say its accuracy is at 97%. But when you put the model to test on real data, you see that the model is making all sorts of mistakes and is only performing at about 50% accuracy. From a business standpoint, this performance is unacceptable.
There can be many reasons for a wide discrepancy between development performance (DevPerform) and production performance (ProdPerform). Perhaps there are many missing values, which the model does not expect. Or the real data comes from a different distribution than the training data. It could also be that the model has memorized exact patterns from the data it was shown. And when it sees a new unfamiliar pattern from the real world, it doesn’t recognize it and makes an incorrect prediction. In technical terms, this problem is known as overfitting.
Such problems can be discovered during PDT. Instead of assuming that the model is going to work out-of-the-box, PDT allows you to validate this. And if it’s not working as expected you still have a chance to fix it before it’s released more widely.
A machine learning model exists to serve a business purpose. Whether it’s to automate a workflow, improve productivity or reduce human errors—there is a purpose. Making progress towards this purpose is what business success is about.
The experimentation period during PDT allows you to verify that your business objectives are also being met. Even if it’s not a hundred percent there yet, PDT gives you a chance to see if things are moving in the right direction from a business objective perspective.
Let’s say the fraud detection model discussed earlier was meant to make customer support agents more productive. But it isn’t—even after all model issues have been fixed.
Perhaps the user interface is problematic. Or, the network latency is preventing agents from consuming the results in a timely fashion. All of this can impact your decision whether to pour money and resources into operationalizing the model or not.
Getting Started With Post Development Testing
If you’ve never considered PDT in your AI initiatives, here are four tips for getting started:
#1: Test early
Don’t wait for a perfect model. As long as the DevPerform is reasonable, you can put models to the test and start real-world evaluation. If you wait too long, you risk discovering issues that can set you back weeks and months. For example, say your development team has made the wrong assumptions about the input data. If this happens, no matter how perfect the model is, it would still require significant rework.
#2: Use the right ProdPerform metrics
Although your development team may have development metrics established, what you’d use in a production setting can be completely different.
For example, when it comes to a product recommendation problem, you may use precision and recall to assess DevPerform. But the same metric may not be applicable in a production setting due to the dynamic nature of the recommendations.
So, what can you do?
One option is to track click-throughs in production to see if users are engaging with the recommendations.
A click can be an implicit way of assessing relevance. Alternatively, if your budget allows for it, you can also conduct a user study. With this, you would recruit users to rate the recommendations on a rating scale. Say you assign a scale of 1-5, where the higher the rating, the more relevant the recommendation. This turns a subjective human assessment into something quantifiable.
You can get super creative with how you evaluate your ProdPerform. But always remember, part of it should be something you can track with time and spot deterioration.
#3: Use multi-faceted evaluation
PDT is not just to assess model performance. It’s also a time to evaluate the relevant business metrics.
Ensure that the right business metrics are established and being tracked alongside metrics used for ProdPerform. If you don’t see model issues, are your business metrics moving in the right direction? If not, this is the time to figure out why and resolve underlying issues. Or decide if you should go back to the drawing board.
#4: Iterate and re-evaluate
As problems are discovered during PDT, you need to fix the underlying problems and re-evaluate until a certain level of accuracy, business metrics improvement, or user satisfaction is reached. Once you’ve reached a point of diminishing returns, or when you feel that the results are good enough for practical use, then it’s time for full deployment.
Have you tried PDT?
Looking back at your AI initiatives, have you paid close attention to post-development testing? Did it help? If not, is there something you or your team would’ve done differently prior to deployment? Let me know below.