Marsha Coleman-Adebayo

Tech Updates

Studies Find Bias in AI Models that Recommend Treatments and Diagnose Diseases

7 min read


Research into model-based methodologies for health care for AI and machine learning indicates their commitment in the fields of phenotypic classification,residence-term prevention and death, as well as the suggestion for action. But generally, the models are being handled as black boxes in that they do not justify or explain the rationale behind their ideas. In addition to the bias in their training data sets, this lack of interpretability threatens to impair the efficacy of such critical care technology.

In week two, studies underline the difficulties that still have to be overcome in the application of AI to care facilities. The first one was the evaluation of the fairness of medical information marketing models trained for IV intensive care, the most extensive publicly available health record set of data, by researchers from the University of Southern California. Another one, co-authored by researchers at Queen Mary, explores the technical obstacles to the formation of impartial models of healthcare. Both conclude that “fair” models for the diagnosis of disease and recommendation of treatments can be undesirable, unintended and harmful to gender and race.

MIMIC-IV includes de-identified information from 383220 patients who are admitted to an ICU or an emergency department at Beth Israel Deaconess Medical Center in Boston, Massachusetts, from 2008-2019, according to researchers at the University of South California. The co-authors concentrated on a subset of 43005 ICU residencies, filtering out patients under 15, who’d already visited the ICU multiple times, or who remained under 24 hours.

The researchers trained a model to suggest one out of five categories of mechanical ventilation in one of the so many experiments to detect the extent to which the bias throughout the MIMIC-IV subset could exist. They discovered that the suggestions of the model differed between different ethnic groups. In Black & Hispanic cohorts, ventilation therapies were on average less probable as well as a shorter treatment period was also received.

The scientists also said that the insurance status did play a role in the decision-making of the ventilator treatment model. Private insurance providers tended to be treated for more prolonged and more ventilation than Medicaid and Medicare patients, probably because gracious insurance patients can actually afford therapy.



The investigators warn that “multiple confounders” in MIMIC-IV could have led to the partiality of the predictions for the ventilator. Even so, they highlight this as a reason why models in health care, as well as the data sets used to train them, should be closely looked at.

The priority had been on the fairness of the classification of medical images in this study published by Queen Mary Research groups. The co-authors trained a model to predict one out of five pathologies through one image using CheXpert, a chest X-ray analysis benchmark that contains 224316 annotated radiographs. The forecasts of the model for male-to-female patients then began to look for imbalances.

The research teams used three types of “regularizers” to minimize bias prior to training the model. In contrast to the intended effect — the model was far less fair, so if trained with regularizers than that when trained without regularizers. The scientists note that even a regularizer, an “equal loss” regularizer, improves gender equality. However, this parity has cost more differences in forecasts among age groups.

“Models can override training data quickly and therefore give the training a falsified sense of fairness which is not generalized to the test set,” the scientists wrote. “Our findings outline some of the limits of current training-time interventions for deep learning fairness.” ”

Both studies are based on past research showing overarching predictive models in health insurance. Because data, code and techniques are reluctant to be published, much of the data for algorithm training for the diagnosis & treatment of diseases could perpetuate inequality.

A U.K. group recently found that too many data from patients in Europe, North America, as well as China are based on almost all eye condition databases, which means that eye disease-diagnosing algorithms operate less well for underrepresented racial groups. In an additional study, students at Stanford University tried to claim that almost all US data from New York, California, as well as Massachusetts, are used for studies involving the medical use of Artificial Intelligence. An analysis of an algorithm by the UnitedHealth Group showed that perhaps the number of Black patients in need of more care could be underestimated by half.

Researchers from Toronto University, MIT and Vector Institute have shown that chest ray datasets are widely used to encompass racially, gender as well as socio-economic preference. A growing number of works recommend that algorithms that detect skin cancer appear less accurately if used in patients with black skin, partly because AI models have been mostly trained in pictures of light skin patients.

Bias is not a simple problem to resolve; however, the co-authors of a new study advice health professionals to apply “rigorous” fairness analyses as just a solution before deployment. They also recommend that clear disclaimers regarding the process of data collection as well as the potential outcome could help improve clinical utilization assessments.


Leave a Reply

Your email address will not be published.