In recent past there has been a spate of accidents involving AI and Machine learning models in practice and deployment. Much so that there is an active database of all such accidents being chronicled (https://incidentdatabase.ai/ ). At a time when AI is making strides in radical business transformation for enterprises, it is vital that we ensure seamless deployments of AI in real transformational scenarios. To ensure such seamless deployments it is vital that we ensure a quality, trustworthy and responsible AI.
A critical need to ensure quality, trustworthy, and responsible AI is the focused effort to test AI and ML and DL models thoroughly. In a previous article https://medium.com/@srinivaspadmanabhuni/why-current-testing-processes-in-ai-ml-are-not-enough-f9a53b603ec6 we showed how existing techniques and processes are not sufficient to ensure a quality, trustworthy and responsible AI.
Here in this article we intend to elucidate the complete set of tests as required for an AI model to be able to ensure a quality, trustworthy and responsible AI.
We shall enumerate and define each of these tests for AI ML DL models below.
- Randomized Testing with Train-Test Split:
At the core of the article https://medium.com/@srinivaspadmanabhuni/why-current-testing-processes-in-ai-ml-are-not-enough-f9a53b603ec6 we illustrated that current foundations of testing in ML life cycle rests on the principle of splitting the data into training and test data and testing for metrics on the test data. Metrics could vary from accuracy in classification to MSE in regression. The basic idea is to test how the model performs on unseen data.
2. Cross Validation techniques
This is an effective model evaluation technique set which is currently in vogue as part of the ML process. Here again the basic idea is to test how the model performs on unseen data. The idea is to the evaluate the effectiveness of the model by iterating the metrics evaluation across several iterations of splits of the data. This can again be ensured by any of the three techniques below
2.1 K-Fold Cross Validation: Here the data is split into k parts and each iteration one of the k parts becomes test set and remaining k-1 parts become the training set and metrics are averages across iterations.
2.2 LOOCV: An extreme form of K Fold cross validation where a single data item is created as test set and remaining n-1 items are treated as train set and over n (size of data) the metrics are averaged
2.3 Bootstrap: Here the idea is to create a new data set from existing data set of same size by sampling with replacement, and metrics evaluated over several such iterations.
These abovementioned test techniques are quite prevalent in today s AI ML DL deployments. However as highlighted in https://medium.com/@srinivaspadmanabhuni/why-current-testing-processes-in-ai-ml-are-not-enough-f9a53b603ec6 these may not be enough to deal with scenarios like corner cases, performance issues, security issues, privacy issues, transparency issues, and fairness/bias issues. Hence we need to expand the scope of testing to cover broader aspects to ensure a quality, trustworthy and responsible AI.
To set a benchmark for such a repertoire of tests, we shall refer the quality dimensions of AI in addition the standard ones as defined in ISO25010 in the talk by Rick Marselis at https://www.slideshare.net/RikMarselis/testing-intelligent-machines-approaches-and-techniques-qatest-bilbao-2018
In addition the standard ISO25010 quality metrics, there are three additional quality metrics proposed for testing AI/ML systems. These are as below:
a. Intelligent Behaviour: It can be a test for evaluating the intelligence of the system. Within this the traits that can be tested include test for ability to learn, improvisation, transparency of choices, collaboration and naturalness of the interaction.
b. Morality: It can be a test for evaluating the moral dimensions of the AI system. This can include broad tests for ethics (including bias), privacy, and human friendliness.
c. Personality: This is closely related to testing humanness of the AI system. It includes tests for mood, empathy, humour , charisma like dimensions.
In view of this discussion it is vital we evolve a testing strategy involving a comprehensive set of tests for AI/ML systems to look at both these additional dimensions of quality as well as standard dimensions from ISO 25010 perspective.
Let us look at some of the important tests we need to incorporate from these additional quality attributes perspective.
3. Tests for Explainability:
In order to enable testing for the “transparency of choices” element under Intelligent behavior as above, we need to have a comprehensive approach to test the models for explainability. As we discussed in https://medium.com/@srinivaspadmanabhuni/why-some-ml-models-required-to-have-explainability-fc190906a9c8 these are specifically required when models in AI ML are not interpretable, like neural networks etc. In case of interpretable models, it is fairly easier to get information on the rationale of an inference by an ML model. However in complex models like neural networks these have to be tested for explainability where we test for rationale for any decision. This whole area broadly referred to as XAI (Explainable AI) framed by DARPA at https://www.darpa.mil/program/explainable-artificial-intelligence
These explainability tests can be again of two types:
3.1 Model Agnostic Tests: These tests do not take into account any specific details of the ML model and perform independent of the model, much like the black box testing models. Examples include LIME etc.
3.2 Model Specific Tests: These explainability tests take into account specifics of the model under consideration. Like if you have a CNN like model, you can use GRAD-CAM like model to transparently look at the rationale of the decision.
4. Security Testing for AI/ML models: In context of the quality attributes in ISO25010 security with its broad needs of (Confidentiality, Integrity, Availability) becomes a vital quality attribute to be tested. In case of AI/ML the specific security needs arise from the new category of threats namely adversarial attacks which attack models with poisoned data, and fool the models. Important that we include security testing for adversarial attacks a primary component of any AI/ML test. We should test for potential attacks on current training data. This kind of test can simulate both kinds of attacks below:
4.1 White Box attacks: Here there is a knowledge of the parameters of the model, which is used to attack.
4.2 Black Box attacks: Here without any knowledge of the model, attacks are carried out.
5. Test for Coverage: While existing processes of randomized testing including cross validation ensure that a reasonable effort is made to represent the broad variety of potential unseen data, it still falls short on covering all kinds of input scenarios. In that context, it is important that we perform a systematic approach to ensuring coverage of the scenarios (a concept very common in conventional testing like MC/DC coverage etc). This can be achieved in two ways:
5.1 Pseudo Oracle Based Metamorphic testing : In conventional testing, we have the luxury of a fixed output against which we can test each input test case. However in AI we do not have a concept of an oracle due to the non deterministic nature of outputs which are models indicating the patterns in data. To overcome the lack of oracle, we take recourse to a pseudo oracle approach to testing such applications via metamorphic testing. In metamorphic testing, we assert certain metamorphic relations between transformations of inputs and outputs of those, in relation to original inputs and outputs. Like for example, in case of ML evaluation, let us say we have a model which has a certain accuracy. We can assert a Metamorphic relation that for certain set of transformations of inputs like say image rotation, it still does not change accuracy significantly. Thus using such metamorphic relations we can identify the portions of input space not covered adequately by current model.
5.2 White box coverage based testing: Here we look at the model architecture and try to evaluate the concept of coverage of the actual model structure. Idea of using such coverage techniques is to generate input data via transformations such that we trigger specific parts of the model which are not covered adequately leading to poor performance on those transformations. Several such coverage notions have emerged in literature of testing AI models. Prominent coverage metrics include as below:
5.2.1 Neuron Coverage based testing: Here the basic notion is that of neuron coverage, which measures the fraction of neurons that are activated (based on a threshold) on a test set of inputs. First such system developed on this notion was DeepExplore https://arxiv.org/abs/1705.06640
5.2.2 Layer level coverage: Instead of a universal neuron level coverage criterion they look at the top neurons at each layer and their combinations. This was shown to be more effective in the system DeepGauge https://tingsu.github.io/files/ase18-deepgauge.pdf
5.2.3 Sign Value combination coverage: Extending the notion of MC/DC coverage as in testing to neural networks, this uses the notion of the level of influence of neurons. This can be of four types: Sign-Sign, Sign-Value, Value-Sign, and Value-Value. To illustrate Sign Sign refers to the concept of testing whether the change in output sign of a neuron in layer n independently affects the output sign of a specific neuron in layer n + 1. Likewise value-value coverage determines whether a change in output value of a neuron in layer n independently affects the output value of a specific neuron in layer n + 1
6. Test for Bias / Fairness: This is one of the crucial tests in view of the recent large scale withdrawal of AI models owing to their discriminatory behaviour. We should include a test of the fairness of the models in ML for any discriminatory behaviour based on specific attributes like gender race etc. So a fairness test is very important for testing all AI ML models.
7. Test for Privacy: A heightened awareness of people on how IT systems are using their personal data (commonly referred to as PII personally identifiable information) is mandating systems to be tested on whether they are breaching any PII. In this context of testing AI/ML systems it is vital that we test AI models for any privacy breaches at two levels: At model level via testing for privacy attacks which make it possible to infer data, and then to check if the inferred data has PII embedded inside it.
8. Test for Performance: A crucial test for any IT system is the need for the system to be tested for whether it is able to cater to real end user needs. In that context the system should be scaled and tested whether it is able to handle different patterns of input loads, including spike pattern like what we can expect in an e-commerce site during boxing day etc. So ML Model deployments should be thoroughly tested for the metrics under different load patterns,
9. Test for Concept Drift: Unlike the lab scenario when ML model development usually happens on historical data, the real life ML model once deployed, is often subjected to real data. In a lot of scenarios it is possible that the pattern captured in current data is not being followed in data after deployment due to data drift. Like in fashion AI application the fashion trends may change every quarter. This necessitates a constant testing of AI/ML deployed models to check for data drift and hence the model drift which causes the deployed model to perform badly on newer data.
10. Test for Agency: This covers a broad set of AI induced additional characteristics as per EuroSTAR 2018 tutorial Rik Marselis Testing Intelligent Machines (slideshare.net) which ask for testing for closeness to human behaviour. These include dimensions of personality like mood etc. , and natural interaction under intelligent behaviour. We feel these should be semi-automated A/B testing like process should be in place to compare human and model performance, or to compare two different models, to evaluate the AI ML models on these dimensions of AI quality like natural interaction and personality.
Overall we feel that these set of 10 tests on AI ML DL models are a good assurance that our AI models perform well in field and thereby reduce AI accidents.
At testAing.com the product AIEnsured (https://testaing.com/product/ ) is aimed at achieving these set of comprehensive tests for AI/ML/DL models.
Hoping to see accident and incident free AI ML DL deployments via extensive testing.
- Dr Srinivas Padmanabhuni
srinivas AT testAIng.com