icon-folder.gif   Conference Reports for NATAP  
  The International Liver Congress™
EASL - European Association for the
Study of the Liver
Aug 27-29
Digital ILC 2020
Back grey_arrow_rt.gif
Suboptimal reliability of liver biopsy evaluation has implications for randomized clinical trials
  Download the PDF here
June 27 - Jnl of Hepatology Beth A. Davison, PhD, Stephen A. Harrison, MD, Gad Cotter, MD, Naim Alkhouri, MD, Arun Sanyal, MD, Christopher Edwards, BS, Jerry R. Colca, PhD, Julie Iwashita, Gary G. Koch, PhD, Howard C. Dittrich, MD
Histological assessment of liver biopsies serves as the reference standard for development of both non-invasive tools and drugs for the treatment of NASH. In this study, substantial discordance between repeat readings and across readers was confirmed in the context of the EMMINENCE trial which has important implications for the future of NASH trial design and analysis.
Given the imprecision of biopsy for both entry and endpoints that were identified in this study, it would be important to consider alternative criteria for inclusion into NASH studies as well as for assessing efficacy. In addition, as we aim to optimally develop better non-invasive biomarkers, a more reproduceable measure of disease severity should be considered. Reliance on current semi-quantitative histology could lead to effective biomarkers and interventions being discarded as inaccurate or ineffective when in fact the opposite may be true.
• Liver biopsy is a critical inclusion criteria and outcome measure in NASH studies.
• Reader variability was tested in 678 biopsies in a NASH study of MSDC-0602K.
• Kappas were poor for the diagnosis of NASH, its resolution and fibrosis improvement.
• Almost half of the patients would have been excluded from entry by one of the readers.
• Poor reliability allows improper entry, misclassification, and diminishes treatment effect
Background & aims

Liver biopsies are a critical component of pivotal studies in nonalcoholic steatohepatitis (NASH) constituting main inclusion criteria, risk stratification factors and endpoints. We evaluated the reliability of NASH Clinical Research Network scoring of liver biopsies in a NASH clinical trial.
Digitized slides from 678 biopsies for 339 patients with paired biopsies randomized into the EMMINENCE study examining a novel insulin sensitizer (MSDC-0602K) in NASH were read independently by three hepatopathologists blinded to treatment code and scored using the NASH CRN Histological Scoring System. Various endpoints were computed from these scores.
Inter-reader linearly weighted kappas were 0.609, 0.484, 0.328, and 0.517 for steatosis, fibrosis, lobular inflammation, and ballooning, respectively. Inter-reader unweighted kappas were 0.400 for the diagnosis of NASH, 0.396 for NASH resolution without worsening fibrosis, and 0.366 for fibrosis improvement without worsening NASH. In the current study, 46.3% of the patients included in the study based on one hepatopathologist's qualifying reading were deemed by at least one of the three hepatopathologists as not meeting the study's histologic inclusion criteria. The MSDC-0602K treatment effect was lowest for those histologic features with lower inter-reader reliability. Simulations show that the lack of reliability of endpoints and inclusion criteria can drastically reduce study power - from > 90% in a well-powered study to as low as 40%.
Reliability of hepatopathologists' liver biopsy evaluation using currently accepted criteria is suboptimal. This lack of reliability may affect NASH pivotal studies by introducing patients who do not meet NASH study entry criteria, misclassifying fibrosis subgroups, and attenuating apparent treatment effects.
Nonalcoholic fatty liver disease (NAFLD) is a global pandemic with an estimated prevalence of 24% and a rapidly increasing incidence. The prevalence of nonalcoholic steatohepatitis (NASH) is estimated to be about 20% in patients with NAFLD. NASH can progress to hepatocellular carcinoma or cirrhosis, and is the second most common reason for liver transplantation in the United States. No drugs are currently approved for the treatment of NASH, and there is a perceived urgent need to develop effective preventive and therapeutic strategies for NASH.
The U.S. National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK) established the NASH Clinical Research Network (CRN) to study "the pathogenesis of NASH, its natural history, prognostic features, and treatment" with a primary focus on clinical research. Histologically, the elements of disease activity - reflecting the biological processes driving the liver injury - include the severity of steatosis, inflammation and hepatocellular ballooning while fibrosis reflects the stage of disease, that is, the consequence of the activity reflecting how far the disease has progressed to cirrhosis.
NASH is diagnosed on liver biopsy as hepatic steatosis, lobular inflammation and ballooning, with or without fibrosis. The NAFLD Activity Score (NAS), a semiquantitative liver biopsy scoring system, was developed by the NASH CRN and published in 2005.( The design of clinical trials in NASH, including endpoints to be used based on this scoring system, were developed by the American Association for the Study of Liver Diseases (AASLD) and further developed by the multi-stakeholder Liver Forum.This system has been used by the NASH CRN to perform several clinical trials.
Both the U.S. Food and Drug Administration (FDA) and the European Medicines Agency (EMA) have issued draft guidance for the development of drugs for NASH, largely reflecting the Liver Forum recommendations. Despite acknowledgment that liver biopsy and histology are deficient as a result of sampling error and intra- and inter-observer variability(19) - and at the same time associated with significant patient burden, invasiveness, and the associated risks of morbidity(20) and potentially even mortality - both the EMA(17) and the FDA(18) require biopsy for both entry and efficacy criteria for phase 2b and phase 3 clinical trials. Moreover, fibrosis grade is used for risk stratification. Both EMA and FDA draft guidance require that patients with fibrosis stage 2 or 3 be included in pivotal studies. Finally, given the latency period to develop cirrhosis and other clinical hepatic outcomes, outcomes-based trials for NASH are logistically challenging. Because of the perceived unmet medical need, liver histology-related outcomes have been accepted as likely surrogates for longer-term hepatic outcomes including cirrhosis (also based on biopsy) and its sequelae. Thus, both agencies accept histological measures as the sole basis for accelerated/conditional approval and the partial basis for final approval. Agencies have accepted the NASH CRN scoring system as "the best validated and most widely accepted" NASH grading system, and guidance for the appropriate study population and endpoints are given in terms of this system. First, for accelerated/conditional approval the agencies require that 1-2 years treatment with the intervention increases the proportion of patients with NASH resolution without worsening fibrosis and/or a ≥1-stage fibrosis improvement without worsening of NASH. For definitive final approval, a combined endpoint comprising all-cause mortality, liver-related transplantation, decompensated liver events and progression to cirrhosis by biopsy is required. However, the main driver of this combined endpoint is progression to cirrhosis by liver biopsy with some studies reporting more than 80% of outcomes reliant on biopsy-determined progression to cirrhosis.
The EMMINENCE study was a phase 2 dose-ranging study of MSDC-0602K, a second-generation insulin sensitizer, designed to assess effects on liver histology. While the study found significant effects of MSDC-0602K on insulin sensitivity and liver injury markers, it failed to demonstrate significant effects on the study's primary and secondary histological endpoints. Biopsies were scored according to NASH CRN criteria by a single expert hepatopathologist, who read screening biopsies twice - first for study eligibility and a second time randomly mixed with 12-month biopsies. As pre-specified, the second read of screening biopsies was used as baseline for analyses. In addition to unexpectedly high placebo response rates, upon re-reading the same screening biopsy nearly a quarter of patients qualified to enter the study were found ineligible, and for approximately 1 in 8 patients a NASH diagnosis was not confirmed. Thus, to evaluate the reliability of NASH CRN scoring and the effect of hepatopathologists' interpretations on the endpoints in this NASH clinical trial, we conducted a second study in which biopsies from EMMINENCE were re-read by two additional expert hepatopathologists.
Histological assessment of liver biopsies serves as the reference standard for development of both non-invasive tools and drugs for the treatment of NASH. In this study, substantial discordance between repeat readings and across readers was confirmed in the context of the EMMINENCE trial which has important implications for the future of NASH trial design and analysis.
We found the reliability of NASH CRN scores for fibrosis and lobular inflammation to be particularly low, which magnifies the problem for both patient selection and endpoint assessment. First, identification of the appropriate study population for a NASH trial based on biopsy scores is uncertain. EMMINENCE histologic entry criteria relied on NAS (with an average inter-reader weighted kappa 0.495), and scores for steatosis (kappa 0.609), ballooning (0.517), inflammation (0.328), and fibrosis (0.484). Our three expert hepatopathologists only agreed unanimously that about half of the patients who qualified for the EMMINENCE clinical trial met the histologic eligibility criteria. Additionally, biopsy interpretation can be subject to bias, as seen by the downward shift in baseline scores when read in isolation for study qualification and again mixed with follow-up biopsies as a basis for analysis. This temporal bias may have been introduced by pressure to qualify patients for enrollment, although appeals from investigators to qualify a disqualified biopsy occurred infrequently, or by knowledge that all biopsies being read early in the study were pre-treatment in patients for whom a biopsy was indicated. When the original reader was asked to re-read the same baseline biopsies mixed with follow-up ones, the re-reads showed histologic improvement to the degree that 16% of patients were determined to have "met" the primary endpoint of the EMMINENCE study just by the original histopathological reader re-reading the same baseline/qualifying biopsies a second time. Given that the whole treatment effect expected in a current NASH study is on the order of a 10-20% absolute increase in responders in active versus placebo, a 16% response rate observed simply by re-reading the same baseline biopsies is likely to "drown" the study's signal. Thus, liver biopsy scores as currently performed are not a reliable basis on which to determine eligibility for NASH clinical trials.
Second, fibrosis is proposed in the guidelines as a way to stratify patients with NASH (i.e., include in pivotal NASH studies only patients with fibrosis stage 2 and 3). However, the reliability of histopathological classification of fibrosis is especially poor making the risk stratification unreliable.
Importantly, liver biopsy results were not reliable treatment response measures. Inter-reader weighted kappas for the two intermediary approvable endpoints in NASH studies were especially poor: 0.396 for NASH resolution without worsening of fibrosis and 0.366 for fibrosis improvement without worsening of NASH. Surprisingly, the inter-reader reliability for steatosis was higher than that for fibrosis, a finding that is in contrast to the generally held belief that biopsies are most accurate in detecting fibrosis and changes in fibrosis.
In the EMMINENCE study, MSDC-0602K administration was associated with beneficial effects on multiple biomarkers of liver injury and fibrosis but only small effects on liver biopsy measures. This finding suggests that the lack of reliability of liver histopathology findings may have led to smaller apparent treatment effects, although some non-invasive markers are not directly assessing liver histology and hence the response rates are not directly comparable.
Treatment effects were smaller on histologic components with lower reliability, suggesting that, at least partially, the reduced efficacy observed on some components by some readers was driven by the reading unreliability. Reduced precision in a subjective assessment was shown to be associated with reduced observed treatment effects and is not surprising.
The intermediate endpoints suggested in the guidance documents - NASH resolution without worsening fibrosis and/or fibrosis improvement without worsening NASH - have been accepted as 'reasonably likely surrogates' by the U.S. FDA(30) but lack prospective validation in larger outcome studies showing that histologic progression in terms of these endpoints is associated with progression to clinical hepatic outcomes, or that effective reduction of these endpoints is reflected in reduced risk of clinical hepatic outcomes. Only fibrosis stage has been associated with risk of adverse liver outcomes. We have found that histologic responders have larger reductions at 12 months in AST and ALT - two well-established liver injury markers - than non-responders. An analysis of the PIVENS trial also found that ALT reductions at 96 weeks were greater in patients with histological improvement or resolution of NASH, with relative odds of improvement of 1.28 and 1.37 per 10 U/L decrease, respectively. While earlier studies including PIVENS(13) and FLINT(12) were able to demonstrate histological responses to interventions, the endpoints employed differed slightly from those proposed in the guidance, and these studies showed that different histologic features can be responsive to a given therapy.
NASH CRN researchers reported a "reasonable inter-rater reproducibility" of their scoring system with inter-rater weighted kappa coefficients of 0.79, 0.84, 0.45, and 0.56 for steatosis, fibrosis, lobular inflammation, and ballooning, respectively, for 32 adult hepatology cases read by 9 hepatopathologists; intra-rater weighted kappa coefficients were somewhat higher at 0.83, 0.85, 0.60, and 0.66, respectively. Consensus training for all readers was conducted immediately prior to assessment of intra-rater variability, which could account for a better agreement level. The NASH CRN reassessed these estimates in 2019 in 446 patients with NAFLD with paired biopsies 1 year apart, with similar inter-observer kappas for steatosis (0.77), lobular inflammation (0.46), and ballooning (0.57), but with lower reliability for fibrosis (kappa 0.75, 95% CI 0.67-0.82).Training of two pathologists failed to improve inter-observer agreement in 65 liver biopsies for suspected NAFLD, with post-training kappas of 0.74, 0.56, 0.20 and 0.18, respectively. In another series of 100 adult Iranian patients clinically diagnosed with NAFLD, biopsies were read twice by two of four pathologists using NASH CRN scoring criteria.
Inter-observer ICCs were 0.654, 0.504, 0.288, and 0.012 for steatosis, fibrosis, lobular inflammation, and ballooning, respectively; intra-observer ICCs were 0.754, 0.744, 0.420, and 0.563, respectively. In EMMINENCE, in a population of 339 adult patients with presumed NASH, our findings demonstrate lower reliability than those reported by the NASH CRN - particularly for fibrosis - with inter-reader linearly weighted kappas of 0.609, 0.484, 0.328, and 0.517 for steatosis, fibrosis, lobular inflammation, and ballooning, respectively, for 678 liver biopsies read by 3 expert hepatopathologists. Intra-reader weighted kappas for one hepatopathologist's readings of a random sample of 400 biopsies in the current study were 0.863, 0.854, 0.662 and 0.840, for steatosis, fibrosis, lobular inflammation, and ballooning, respectively. Another study, in 65 liver biopsy slides from United Kingdom patients with liver disease evaluated by three pathologists, found lower inter-observer reliability for fibrosis staging than the NASH CRN with an average weighted kappa of 0.54 for NASH CRN fibrosis stage. Although interpretation of kappa coefficients varies, values of 0.01-0.20 have been suggested as slight, 0.21-0.40 fair (as observed in our study for inflammation), 0.41-0.60 moderate (the rest of the variables assessed), 0.61-0.80 substantial (none of the variables assessed), and 0.81-1.00 nearly perfect agreement. Other researchers have suggested alternate interpretive scales, and it has been suggested that kappa <0.60 represents simply inadequate reliability.
The kappa statistic adjusts the observed agreement for that expected by chance and can be illustrated as follows. If a reading of a measure by a hepatopathologist interpreter is almost always read as grade 2 or 3 on a 1-4 scale (i.e., the readers almost never use grades 1 and 4) then chance alone would have the hepatopathologist readers agree in close to 50% of the cases. Hence 70% agreement (as represented by a kappa of 0.4) is only 20% better than chance, making it likely that most "responders" are not really responders and many "non responders" are truly responders. Nevertheless, the lack of equal proportions across response categories is known to reduce the value of kappa, and for all four binary endpoints the proportion of non-responses is much larger than the proportion of responses. Simulations conducted suggest that even if a new intervention has a substantial beneficial effect, the lack of reliability of the outcomes, and the potential introduction of a non-responsive patient subgroup due to incorrect classification, can substantially reduce the power of a clinical trial resulting in "false negative" studies, i.e., most well-powered studies will be "negative" (P<0.05) despite the drug being substantially effective. For studies to be mostly positive, the treatment effect will need to be much higher, e.g., response rates of 50% in active vs 10% in placebo or beyond that, a treatment effect that is not realistic.
The question remains as to whether the reliability of histology results can be improved. Greater precision in the identification and scoring of individual features of NASH may be needed. Different reading schemes have been employed across clinical trials including using the qualifying read as the baseline (introducing a temporal trend), reading biopsy pairs for patients at trial's end, and having two or more pathologists read a given slide and reviewing discordances together to form a consensus. Although the NASH CRN has employed consensus scoring by a pathology committee, the reliability of scores based only on reads by individual pathologists has been reported. Quality control measures to reduce temporal drifts in scoring criteria could certainly be incorporated in trial design. Unfortunately, the current study and the existing literature do not provide guidance as to the optimal approach, highlighting the need for greater precision in histological assessment, if its use is to be continued. Recent advances in machine learning approaches, or direct quantitative measures, may provide greater precision.
Estimates of intra-reader agreement were assessed in a limited manner and are biased: re-reads of screening biopsies by Pathologist A were systematically lower than qualification reads, and Pathologist B re-read only a random sample of patients' biopsies presented as pairs which may have introduced some bias towards higher intra-reader kappas. Pathologist C's intra-reader reliability was not assessed.
Digitized slide images rather than glass slides were read which could have reduced the estimated reliability. While studies suggest that the primary diagnosis derived from whole slide imaging (WSI) agrees quite well with that derived from microscopy in surgical pathology across multiple organ systems, and in needle liver biopsies specifically, WSI has not been explicitly validated using the semi-quantitative scoring system employed here and a comparison between readings of digitized slide images and glass slides was not performed in the current study.
The analysis is further limited by the specific expert histopathologists who read the biopsies. Although these were expert hepatopathologists commonly interpreting phase 2 and 3 NASH studies, it is possible that other hepatopathologists would have had better inter- and intra-reader agreements.
Given the imprecision of biopsy for both entry and endpoints that were identified in this study, it would be important to consider alternative criteria for inclusion into NASH studies as well as for assessing efficacy. In addition, as we aim to optimally develop better non-invasive biomarkers, a more reproduceable measure of disease severity should be considered. Reliance on current semi-quantitative histology could lead to effective biomarkers and interventions being discarded as inaccurate or ineffective when in fact the opposite may be true.