Can we trust in AI for the most important of medical decisions?
This post concerns building reliable artificial intelligence (AI) systems that we can trust. We will look at reliable AI with respect to a specific application: the Child Growth Monitor (CGM), an AI-driven app to detect malnutrition among children.
Building a reliable AI system cannot be achieved by a single method alone. Only multiple approaches combined allow to set up a reliable system. We will discuss some approaches that directly look at data, as well as different approaches that analyze model predictions. We will also study the monitoring of this IT system and organizational methods.
The author works as an AI Engineer in the CGM project. Together with his team, he builds AI models while understanding practical use, bias, interpretability, and related topics.
Project context
CGM is a project of Welthungerhilfe, a German NGO working in the fields of development cooperation and emergency aid. The project’s goal is: Zero hunger by 2030! This is very ambitious, hence can only be achieved through innovative, disruptive solutions.
Hunger and malnutrition are not a lack of food only, it is a more complex health issue. Child malnutrition is a global problem. Parents and frontline workers often don’t know that the child is malnourished and take measures too late. The magnitude of the nutrition crisis — both in emergencies and chronic hunger situations — is blurred frequently. Current standardized measurements done by aid organizations and frontline healthcare workers are time-consuming and expensive. Children are moving, and accurate measurement, especially of body height, is often difficult. Accurate data on the nutritional status of children are unreliable or non-existent. This hampers a determined response by emergency workers and policymakers.
Anthropometric measurement by a smartphone app provides the game-changer for data collection and processing in the case of malnourished children under the age of 5 years. The solution is based on a mobile app using augmented reality in combination with artificial intelligence. By determining weight and height through a 3D scan of children, the app provides a quick and touchless way to measure children and detect early warning signs of malnutrition.
Welthungerhilfe receives support from leading IT partners for the research and development of the measurement prediction. The CGM project is open-source: code and the full documentation can be accessed via GitHub.
AI-based predictions
In order to diagnose malnutrition, public health organizations and NGOs are mostly interested in measuring body height, weight, and middle-upper arm circumference (MUAC). According to the World Health Organization, the combination of these three measurements with age and sex of a child enables a diagnosis of malnutrition.
Throughout the CGM project, the data of tens of thousands of children have been collected. The collection procedure is as follows: Enumerators take both a scan with a smartphone as well as a manual measurement of body height, body weight, and MUAC. The gathered data allows us to train machine learning algorithms.
Once trained, the machine learning model will be able to predict the body height, body weight, and MUAC of children. Hence, by the time the algorithm is deployed, manual measurements are mostly unneeded.
When trying to evaluate our system, we compare the manual measurements to model predictions. The difference between measurement and prediction is expressed by the mean absolute error (MAE). By ensuring a small MAE, we can show that our model predicts accurately for scans of children in the test set.
What is reliability?
Reliability is a crucial concept for the CGM project. The term “reliability” is used in different fields: in statistics, reliability is the overall consistency of a measure. In computer networking, reliability notifies the sender whether the delivery of data to intended recipients was successful. In the context of CGM, we define reliability as the quality of being trustworthy and of performing consistently well.
Given data of an individual child, the CGM application must determine whether it can make a confident prediction. If a confident prediction cannot be made, a reliable system will not blindly predict. Instead, it can let the user know about this uncertainty. Only by detecting its own uncertainty, an AI-system will ensure not to harm any single child. With respect to diagnosing if a child is malnourished or not, a fully reliable system would not make any false negative predictions.
How can we ensure reliability?
In order to ensure that our AI-system is reliable, we use a variety of approaches.
Reject problematic scans
Before even considering ML models, let us only consider the data. In our application, we expect the scanned images to contain a young child in a certain position, e.g., standing and facing the camera.
If the scan’s images are already erroneous, even a good model won’t be able to make any decent predictions — but wait a minute: What could possibly be wrong with the data? Actually a lot of things can go wrong: Imagine an app user didn’t know how to use the app properly, hence scanned an elderly man — but of course the app is intended only for children. Here is another thing that could go wrong: There could be no person in the image at all, because the phone was angled at the floor, or perhaps only the feet of the child were scanned. There could also be multiple children in one image. Due to bad lighting conditions the image could appear completely dark.
Ok, it turns out a scan can go dreadfully wrong in lots of different ways. How do we address this? We follow two different, short-term and mid-term approaches.
First approach (short-term): We are building a scan-inspection tool where scans are shown to human annotators, who will check if the scans were taken correctly. For example: The annotator would see if the image doesn’t contain a child, because the smartphone is actually facing the floor.
Second approach (mid-term): We build automated ways to inspect the scan. Algorithms like PoseNet [2] can mark images that do not contain a single child. Face detection algorithms can check whether the child is facing the camera. Moreover, we can classify a child’s scan into standing vs. lying to check if the child is in the expected pose, which was selected in the app.
Compared to a human annotator, an algorithm will not get tired of rejecting problematic scans 24/7. In contrast, a human annotator will detect special cases that the algorithm doesn’t know how to deal with.
Model uncertainty
In the section above, we looked at the input data; now let’s also consider the AI-model, which predicts height, weight, and MUAC. This model will be more certain for some scans and less certain for others. Usually a machine learning model is only making a prediction, e.g., “the height is 96.7cm”. This could mean that the model is quite sure that the child is 96.7cm. But hang on — this could also mean that the model is completely unsure about this prediction: It merely guessed some height it has seen in other scans before. The truth is: with such a system, we have no idea. Such a system bears a huge risk: Although the model can be uncertain, the unaware user trusts the prediction.
To tackle this problem, we look at solutions where a model does not only make a prediction but also conveys its uncertainty. For example, ensembles of deep neural networks are good at modeling uncertainty[3]. As a result, when the AI-model is certain, we can trust the model’s prediction. Contrarily, when we can’t trust the models prediction, the application could not show this prediction at all or at least mark the prediction as uncertain.
Distribution of errors
If the model was not exposed during training to examples for a specific demographic subgroup, the model might blunder for this subgroup’s children. To address this issue, we look at the distribution of prediction errors of different subgroups. After training, the model will be evaluated on a subgroup’s data unseen during training. In case it performs well, we will use the model in a production setting.
Example: Let’s assume our AI-model fails on people with dark skin color. Until we build a better model, we can highlight predictions of this subgroup as unreliable. Of course, the CGM team will constantly improve AI-models that predict incorrectly on a certain demographic subgroup. To mitigate this, we can source data from people of dark skin color and include this data into the training of the AI-model.
We found that our current models perform worse on young children. This might be due to several causes: Young children often can’t yet stand hence lie. Due to their fidgeting during the scanning process they are intrinsically hard to measure. Since this challenge also occurs during manual measurement, young children’s training data carry more label noise.
Monitoring
Once the algorithm is deployed and runs in a productive system, we are quite certain that it works well. It has been shown, however, that models degrade over time [1], i.e., it may happen that AI-models learn something other than we expect. In the infamous urban legend about tanks, a neural network trained to detect tanks learns to detect the time of day instead.
To illustrate with respect to the CGM project, let’s consider what could happen: During summer, we collect images and use them to train a model. At the end of summer, we put the model into productive use and it works really well — Hooray! In autumn, the model starts to get slightly worse. In winter, the model’s predictions are totally off — Damn! So in this case, the AI-model might learn something about children in summer, yet when we use the AI-model in winter it may behave unforeseeable ways. We don’t yet know what the cause of this misbehavior might be, but we can monitor and decide to get alerted when the model misbehaves.
Unforeseen challenges can be detected by monitoring our system during productive use. By collecting a few manual measurements once in a while, we can see if the predictions are still similar. If the predictive performance of the algorithm lowers, CGM staff will be alerted.
Organizational Measures
To avoid flaws, the organization should have well-defined processes. In the medical domain, especially when the patient’s health is at risk, the quality of processes must be ensured. This includes communication, documentation, quality of the underlying software (tests of the software, code review, third party libraries). Responsibilities for each step in the process of writing and running software can be formally defined and documented.
Besides the software team, well-defined processes can also help for the work of the field teams, which take scans and manual measurements. Ensuring a good data collection process will lead to high-quality data.
When we measure a child periodically — for example every three months — we have another way of ensuring precise predictions. For instance, we should become suspicious in case the data show that a child goes from underweight to overweight in only three month.
Conclusion
A fully reliable system can only be obtained by combining multiple approaches. This blog post presents various approaches for building a reliable system: Some approaches look at the data directly to find abnormal scans, which would result in unqualified predictions. Other approaches analyze the model’s prediction with respect to a single scan or a subgroup. Additionally, monitoring and organizational methods need to be added. This being said, the list of approaches presented here is not exhaustive.
If you are interested in our AI-for-good, open-source project, feel free to reach out! We are welcoming contributions from innovators, nutritionists, data scientists, product experts, and many more.
References:
#datalift: Deploy data analytics and machine learning (English Edition) by AI Guild — Dânia Meira and Chris Armbruster (editors) published by Amazon Media EU S.à r.l. https://www.amazon.de/dp/B08SBR89L9/
[1] https://towardsdatascience.com/why-machine-learning-models-degrade-in-production-d0f2108e9214
[2] https://arxiv.org/abs/1505.07427
[3] https://arxiv.org/abs/2006.13570
Acknowledgements
Thanks for your help: Tristan Behrens, Markus Matiaschek, Chris Armbruster, Jasmin Ziegler, Markus Pohl.