Can we trust in AI for the most important of medical decisions?

The CGM project is concerned with anthropometric measurements of children. The image shows a measurement of middle-upper arm circumference using MUAC measuring tape.

Project context

CGM is a project of Welthungerhilfe, a German NGO working in the fields of development cooperation and emergency aid. The project’s goal is: Zero hunger by 2030! This is very ambitious, hence can only be achieved through innovative, disruptive solutions.

AI-based predictions

In order to diagnose malnutrition, public health organizations and NGOs are mostly interested in measuring body height, weight, and middle-upper arm circumference (MUAC). According to the World Health Organization, the combination of these three measurements with age and sex of a child enables a diagnosis of malnutrition.

What is reliability?

Reliability is a crucial concept for the CGM project. The term “reliability” is used in different fields: in statistics, reliability is the overall consistency of a measure. In computer networking, reliability notifies the sender whether the delivery of data to intended recipients was successful. In the context of CGM, we define reliability as the quality of being trustworthy and of performing consistently well.

How can we ensure reliability?

In order to ensure that our AI-system is reliable, we use a variety of approaches.

Figure 1: Variety of approaches to ensure reliability

Reject problematic scans

Before even considering ML models, let us only consider the data. In our application, we expect the scanned images to contain a young child in a certain position, e.g., standing and facing the camera.

Model uncertainty

In the section above, we looked at the input data; now let’s also consider the AI-model, which predicts height, weight, and MUAC. This model will be more certain for some scans and less certain for others. Usually a machine learning model is only making a prediction, e.g., “the height is 96.7cm”. This could mean that the model is quite sure that the child is 96.7cm. But hang on — this could also mean that the model is completely unsure about this prediction: It merely guessed some height it has seen in other scans before. The truth is: with such a system, we have no idea. Such a system bears a huge risk: Although the model can be uncertain, the unaware user trusts the prediction.

Distribution of errors

If the model was not exposed during training to examples for a specific demographic subgroup, the model might blunder for this subgroup’s children. To address this issue, we look at the distribution of prediction errors of different subgroups. After training, the model will be evaluated on a subgroup’s data unseen during training. In case it performs well, we will use the model in a production setting.​

Monitoring

Once the algorithm is deployed and runs in a productive system, we are quite certain that it works well. It has been shown, however, that models degrade over time [1], i.e., it may happen that AI-models learn something other than we expect. In the infamous urban legend about tanks, a neural network trained to detect tanks learns to detect the time of day instead.

Changing seasons might cause our model to misbehave [src]

Organizational Measures

To avoid flaws, the organization should have well-defined processes. In the medical domain, especially when the patient’s health is at risk, the quality of processes must be ensured. This includes communication, documentation, quality of the underlying software (tests of the software, code review, third party libraries). Responsibilities for each step in the process of writing and running software can be formally defined and documented.

Conclusion

A fully reliable system can only be obtained by combining multiple approaches. This blog post presents various approaches for building a reliable system: Some approaches look at the data directly to find abnormal scans, which would result in unqualified predictions. Other approaches analyze the model’s prediction with respect to a single scan or a subgroup. Additionally, monitoring and organizational methods need to be added. This being said, the list of approaches presented here is not exhaustive.

References:

#datalift: Deploy data analytics and machine learning (English Edition) by AI Guild — Dânia Meira and Chris Armbruster (editors) published by Amazon Media EU S.à r.l. https://www.amazon.de/dp/B08SBR89L9/

Acknowledgements

Thanks for your help: Tristan Behrens, Markus Matiaschek, Chris Armbruster, Jasmin Ziegler, Markus Pohl.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Markus Hinsche

Markus Hinsche

You get first hand insides into tech, travels, software engineering, art, machine learning, and data science! Observe! Create! Enjoy!