iconstar paper   Hepatitis C Articles (HCV)  
Back grey arrow rt.gif
 
 
Liver biopsy: The best, not the gold standard EDITORIAL vs surrogate markers
 
 
  Journal of Hepatology
Volume 50, Issue 1, Pages 1-3 (January 2009)
 
Pierre Bedossa1, Fabrice Carrat2
 
1 Department of Pathology, Beaujon Medical Center, Assistance Publique-Hopitaux de Paris, INSERM, U 773, Paris-Diderot University, Clichy, France
2 Epidemiology of Infectious Diseases, UMR-S 707, UPMC & INSERM, Public Health Unit, Assistance Publique-Hopitaux de Paris, Hopital Saint-Antoine, Paris, France
 
"Novel strategies are needed to move the field forward".
"To date, liver biopsy remains the gold/best standard for accurate staging and grading in chronic hepatitis C and the major question that remains concerns the moment at which such an accurate evaluation is needed in chronic hepatitis C"
 
while in the main article below the authors say (Shruti H. Mehta1, Bryan Lau12, Nezam H. Afdhal3, David L. Thomas12)--
 
"A perfect surrogate marker of liver fibrosis could already exist but not be recognized....Our results strongly suggest that major improvements in surrogate markers are unlikely when evaluated against liver biopsy. Thus, novel strategies are needed to move the field forward. In particular, long-term prospective studies of markers against clinical gold standards, such as development of end-stage liver disease are needed to assess the best measures of intermediate disease stages. Likewise, the validity of all outcome measures must be carefully considered when assessing the validity of surrogate markers in biomedical research or clinical practice."
 
Fibrosis, the hallmark of chronic liver diseases, is one of the major deleterious processes associated with chronic hepatitis C. Staging of fibrosis relies on an evaluation of several histological features including assessment of extent of the extracellular matrix deposit, the localization of the deposits within the liver lobule and changes in lobular architecture. These features are then integrated into a semiquantitative scoring system. Histological staging of fibrosis has gained acceptance as a major element in evaluation of liver damage in hepatitis C. Indeed, staging mirrors the natural evolution of chronic hepatitis, predicts evolution toward development of cirrhosis and end-stage liver complications, contributes to predicting a sustained response to antiviral treatment. This is crucial as cirrhosis, the end-point of fibrosis, is the main cause of morbidity and mortality in chronic liver diseases [1], [2], [3], [4].
 
Because fibrosis implies morphological damage, liver biopsy has come to be the natural gold standard for staging the disease. However, the high prevalence of chronic hepatitis C in addition to the cost and constraints generated by this procedure has triggered an intensive search for alternative methods for staging the disease. How to evaluate the performance of these surrogates and how the inherent limits of the biopsy influence the evaluation of accuracy of surrogates are discussed in this issue of the Journal by Mehta and colleagues [5]. This is a relevant question since liver biopsy carries potential limitations including sampling errors and interobserver variations [6], [7]. Although several means exist for minimizing these risks such as procurement of biopsies of sufficient length [8] and interpretation of biopsies by experienced liver pathologists [9], staging of fibrosis with biopsy will always carry a risk, albeit low, of misclassification thus making the term "best" standard more appropriate than "gold" standard for liver biopsy.
 
The performance of any surrogates is classically evaluated by calculation of the area under the receiver operating characteristic curve (AUROC) using liver biopsy as the reference. In this setting, the AUROC represents the probability that a surrogate will correctly rank two randomly chosen patients, one with a liver biopsy considered "normal" and the other "diseased". Because liver biopsy is not the gold standard but is the best available standard, a perfect surrogate will never reach maximal value (i.e. 1). Taking into account a range of accuracies of the biopsy and a range of prevalences of significant disease (that influence the AUROC), Metha et al. demonstrate that in the most favorable scenario, an AUROC>0.90 cannot be achieved when assessing the so-called "significant fibrosis" even for a perfect marker [5]. This is important for several reasons. First, studies have already shown that these maximal AUROC values have been reached for surrogates, especially when assessing cirrhosis versus non-cirrhosis, suggesting that these surrogates may be at least as good as liver biopsy in the diagnosis of cirrhosis [10]. Second, Metha et al. suggest that a definitive method for assessing the performance of surrogate markers would employ a clinical end-point rather than biopsy as gold standard. These conclusions should be discussed in further detail before accepting them definitively.
 
The main alternatives to liver biopsy that have been developed in the past 10 years are based on two very different concepts: serum markers and liver stiffness [11]. They differ substantially both in their rationale and in their conception.
 
Stiffness, as assessed by ultrasound (Fibroscan) and more recently by MRI, evaluates the velocity of propagation of a shock wave within the liver tissue. This method examines a physical parameter of liver tissue which is related to its elasticity. Thus, liver biopsy is used to choose the best discriminative thresholds to predict histological stage. The main drawback is that additional space-occupying lesions often encountered in hepatitis C such as steatosis, edema and inflammation will develop within an organ wrapped in a distensible but non-elastic envelope (Glisson's capsula), contribute to modifying liver texture and may act as a confounding factors when stiffness is concerned. Nevertheless, there exist strong arguments supporting the hypothesis that elasticity parallels staging at precirrhotic or cirrhotic stages. A recent meta-analysis showed that the AUROC reaches the "holy grail" of 0.90 for diagnosis of cirrhosis with Fibroscan [8]. However, it is noteworthy that changing the definition of "diseased" liver from F4 to F3F4 or F2F3F4 is associated with a progressive decrease in the AUROC, suggesting that this approach is valid for diagnosis of cirrhosis but less adequate when assessing transition from one stage to the upper one, a crucial goal for treatment decision or patient follow-up. In this setting, the proposal of a clinical reference (liver-related death, end-stage liver complications) for comparing the performances of Fibroscan and biopsy for diagnosis of cirrhosis is meaningful and seems feasible. In the mean time, assessing the prognostic value of the wide range of stiffnesses observed within cirrhotic livers should be useful since this would overcome one major limitation of the biopsy (i.e. one histological stage for all type of cirrhosis).
 
Validation of surrogates compared to a reference other than biopsy is completely different when addressing serum markers. Serum markers are combinations of several blood parameters that are optimized to mirror the stage of liver fibrosis. Despite the wide number of proposed combinations, they are all designed in the same way: they are meant to optimize the choice of blood parameters and to maximize the algorithm to match histological stages as assessed using liver biopsy. This is a fundamental difference compared to Fibroscan. While Fibroscan assesses one genuine characteristic of liver tissue, serum marker algorithm is built to mimic biopsy irrespective of the biopsy accuracy. In that case, the findings presented by Mehta et al. will hold only if biopsy and serum marker misclassifications are not correlated at any given stage of fibrosis - a challenging hypothesis. Otherwise, since biopsy was used for choosing the optimal combination of serum markers, a perfect serum marker could theoretically reach an AUROC of 1.0 and a lower AUROC value is related to serum marker own limitations rather to limitation of biopsy for assessing fibrosis.
 
One major limitation of any of these surrogates lies in their conception and/or their validation using a dichotomized approach (significant versus non-significant fibrosis). In addition to the question of what is considered to be "significant" fibrosis, a definition which is variable according to the study and aims pursued, staging fibrosis cannot be summed up by such a binary approach. Histological staging systems comprise 5 (METAVIR) or even 7 (Ishak score) different stages [7], [12]. This level of complexity has been shown to be relevant not only for individual assessment and follow-up of disease evolution, but also for defining the rate of fibrosis progression and the right moment for using antiviral therapy or starting prevention of complications from cirrhosis. The dichotomized approach used for surrogates is imposed by the use of AUROC that tests a binary hypothesis. Using this approach there is a significant loss of information and a dependency on the proportion of each stage of fibrosis in the study sample. Other accuracy measures designed for ordinal gold standard have recently been published and should overcome these limitations [13]. However in most works these limitations have been bypassed by considering the different histological stages as linear variables and extrapolating intermediate values for each of the stages. However, this is an erroneous supposition since scores are categories not continuous variables. When considering the extent of fibrosis, a variable that can be easily quantified by image analysis, studies have shown the absence of linearity between extent of fibrosis and histological stage [8], [14]. Such an approximation explains why, when considering only adjacent stages (F1vsF2 or F2vsF3...) AUROC values are unacceptably low, prompting us to consider the surrogate as an inadequate tool for individual follow-up [15].
 
There is an urgent need to pursue the development of a surrogate for staging fibrosis. Because of the conditional relationship with biopsy, the serum marker might represent a dead-end. Hopefully, physical imaging will eventually be refined to an acceptable level of accuracy, especially for evaluation of early stages of fibrosis. Indeed, promising results have recently been shown using elastography with MRI.
 
Although much effort has been made in evaluation of fibrosis as a major decision criterion for hepatologists, it is only one among the many elementary histopathologic features present at the same time on liver biopsy performed for hepatitis C. Fibrosis is not an autonomous feature, but rather a tissue lesion resulting from other pathologic mechanisms such as inflammatory, degenerative or dystrophic processes leading to other pathologic mechanisms such as hepatocellular carcinoma and portal hypertension. In order to provide relevant information, fibrosis should be viewed in light of its full histopathologic context. Simultaneous evaluation of necroinflammation allows to assess whether fibrosis is the result of a past event that has stabilized or even regressed or is an ongoing process that may continue to worsen. Frequently, biopsy also detects associated lesions such as steatosis or steatohepatitis which provide information useful for management and prognosis of patients with chronic hepatitis C [16]. Finally, it is noteworthy that, in diseases with a high prevalence, like hepatitis C, liver biopsy may also reveal that abnormal liver function tests are related to an unexpected liver disease in addition to hepatitis C. Clearly, all this information may influence patient management. Therefore, equating chronic liver disease with the extent of fibrosis alone is an oversimplification that could be useful for physicians but it could also prove misleading.

 
After more than 10 years of active investigations, alternatives to liver biopsy for staging chronic liver diseases have revealed both their strength and weakness. As emphasized by Mehta et al. "Novel strategies are needed to move the field forward". This implies not only long-term prospective studies using clinical end-points to validate surrogate markers that might be difficult to perform especially when addressing validation of markers for the diagnosis of early stages of fibrosis but also development of new innovative tools. Whether these tools will reach a satisfactory level of accuracy prior to the discovery of highly efficient and innocuous antiviral treatments remains an open question.
 
To date, liver biopsy remains the gold/best standard for accurate staging and grading in chronic hepatitis C and the major question that remains concerns the moment at which such an accurate evaluation is needed in chronic hepatitis C [17].
 

Exceeding the limits of liver histology markers

 
Jnl of Hepatology (January 2009)
Shruti H. Mehta1, Bryan Lau12, Nezam H. Afdhal3, David L. Thomas12
 
1 Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, 1830 E Monument St, Room 455-ID, Baltimore, MD 21287, USA
2 Department of Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
3 Liver Center, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, USA
 
Background/Aims
 
Alternatives to liver biopsy for staging liver disease caused by hepatitis C virus (HCV) have not appeared accurate enough for widespread clinical use. We characterized the magnitude of the impact of error in the "gold standard" on the observed diagnostic accuracy of surrogate markers.
 
Methods
 
We calculated the area under the receiver operating characteristic curve (AUROC) for a surrogate marker against the gold standard (biopsy) for a range of possible performances of each test (biopsy and marker) against truth and a gradient of clinically significant disease prevalence.
 
Results
 
In the 'best' scenario where liver biopsy accuracy is highest (sensitivity and specificity of biopsy are 90%) and the prevalence of significant disease 40%, the calculated AUROC would be 0.90 for a perfect marker (99% actual accuracy) which is within the range of what has already been observed. With lower biopsy sensitivity and specificity, AUROC determinations >0.90 could not be achieved even for a marker that perfectly measured disease.
 
Conclusions
 
We demonstrate that error in the liver biopsy result itself makes it impossible to distinguish a perfect surrogate from ones that are now judged by some as clinically unacceptable. An alternative gold standard is needed to assess the accuracy of tests used to stage HCV-related liver disease.
 
Background
 
Liver biopsy is widely considered as the gold standard for assessment of treatment urgency in persons with hepatitis C virus (HCV)-related liver disease [1], [2], [3]. Because of biopsy expense and medical risk, there is a widespread effort to develop a safer, less expensive surrogate [4], [5]. Candidate surrogates have included blood tests, algorithms based on the results of multiple serum markers [6], [7], [8], [9], [10], [11], [12], liver elastography [13], and others. However, in scores of studies of different surrogates, the diagnostic accuracy of candidate tests (compared to biopsy) has failed to exceed 0.88 of the area under the receiver operating characteristic curve (AUROC) [6], [7], [8], [9], [10], [11], [12], [14]. A recent review of studies of the most widely validated surrogate markers, FibroTest and Fibroscan reinforced that surrogate markers have not been widely adopted in clinical practice primarily because of these perceived limitations in diagnostic accuracy [15].
 
It is widely appreciated that there is error in the liver biopsy measurement itself. Marked reductions in the sensitivity for detection of significant fibrosis have been demonstrated with biopsies less than 3cm in length [16], [17], fragmentation [18] and steatosis [19] which, together with regional differences in fibrosis (e.g., left vs. right lobe) and lack of agreement among those examining slides, comprise error in this gold standard [20]. Even among biopsies up to 4cm in length, substantial error has been observed when biopsy specimens have been compared to the full liver [16]. Thus, an alternative interpretation of the limited diagnostic accuracy of surrogate markers is that it is due to error of the biopsy measurement itself [6], [19], [21], [22].
 
When errors in a diagnostic test and the gold standard are independent, the observed sensitivity and specificity of the diagnostic test will be underestimated [23], [24], [25]. However, the degree to which measurement error in the biopsy may impact the observed diagnostic accuracy of fibrosis marker panels has not been estimated. This is a major limitation since, depending on the magnitude of effect, it is possible that a valid surrogate might already exist and could not be differentiated from an inadequate test as long as the liver biopsy result is the comparator. In other words, biopsy error could make it impossible to distinguish a perfect and clinically inadequate surrogate. To estimate the magnitude of the bias, we characterized the optimum performance of surrogate markers based on a range of conservative estimates of biopsy error.
 
Results
 
The results of this investigation confirm the hypothesis that biopsy error causes the true validity of surrogate tests to be underestimated by an amount that would make a clinician falsely misperceive the test as inaccurate. Even with conservative estimates of biopsy error such as sensitivity and specificity of biopsy of 80%, true liver disease prevalence of 40%, and marker vs. true disease AUROC of 0.80, the calculated AUROC of the marker vs. biopsy would be 0.70 (Fig. 2). For the same assumptions of disease prevalence and biopsy sensitivity and specificity, a perfect test (AUROC of marker vs. true disease of 0.99) would have an expected validity (AUROC of marker vs. biopsy) of 0.76. If the biopsy sensitivity and specificity were 90% and disease prevalence remained 40%, a perfect marker would have an expected AUROC of 0.90. Interestingly, observed AUROC values of the marker vs. biopsy for many published studies fall within the range of 0.76-0.88 [6], [7], [8], [9], [10], [11], [12], [14].
 
These data also imply that a marker panel with an observed AUROC as compared with the liver biopsy at the lower bound of 0.76 may truly have an AUROC (vs. true disease) between 0.93 and 0.99 under a sensitivity and specificity of biopsy of 80% and prevalence between 0.3 and 0.5. When the sensitivity and specificity of biopsy are 90%, the marker vs. true disease AUROC would be 0.83, thus still exceeding the observed AUROC of 0.76 (when prevalence is 0.5).
 
Discussion
 
The results of this investigation demonstrate that even a perfect non-invasive marker could not be distinguished from less valid assays with most tenable assumptions of biopsy sensitivity and specificity. In addition, our findings explain why existing published marker validity estimates cluster in an AUROC range of 0.76-0.88 [6], [7], [8], [9], [10], [11], [12], [14]. Moreover, the maximal expected real world performance of the surrogate marker occurred when the disease prevalence exceeded 40% and the sensitivity and specificity of the biopsy exceeded 90%, which is not feasible in most settings.
 
These calculations have implications for the interpretation of the performance of surrogate markers as well as their application in clinical practice. A perfect surrogate marker of liver fibrosis could already exist but not be recognized. Alternatively, correlated error (identifying the same false-positive and negative results using the biopsy and marker) could be misinterpreted as an improvement in observed validity of the marker. Since markers are developed by using biopsy data, the latter consideration is especially germane and probably already occurs.
 
Accumulating evidence regarding the limitations of biopsy have led some to suggest that non-invasive markers should replace biopsy as the initial method for disease staging [30], [31], [32], [33]. However, guidelines and practice patterns differ between countries and even within a given country. Others have considered alternate strategies where both non-invasive markers and biopsy are used in combination since complementary information can be obtained [33]. Further research is needed to evaluate the long-term effectiveness of these strategies before a global recommendation can be made.
 
In this study, we considered measurement of significant liver fibrosis in our calculations. Other thresholds exist, such as detection of cirrhosis or 'no' vs. 'some' fibrosis. We chose significant fibrosis to correspond with treatment guidelines and many published studies [1], [26]. Most studies suggest that the measurement and observer error for detection of cirrhosis is lower [16], [28]. This may explain why markers often appear to be more valid representations of this stage [6]. Further, our calculations did not consider the full range of fibrosis stage. As described previously, the underlying spectrum of disease represented by a dichotomization into significant liver fibrosis vs. not can be quite broad [18], [34]. It is likely that surrogate markers would perform better against a liver biopsy when the extremes are overrepresented (e.g., high representation of F0 and F4). Though we did not address this issue specifically, our calculations can be extended to comparisons of adjacent (e.g., F1 vs. F2) or nonadjacent stages of fibrosis (e.g., F1 vs. F4) to address this concern.
 
The calculations presented within this paper further rely on the assumption of conditional independence of the surrogate marker and biopsy results. We recognize that there have been several recent demonstrations of non-parametric approaches to estimate ROC curves [35], [36] as well as a latent class model approach [37]. However, our goal was to illustrate why previous results for the AUROC that have not utilized specialized methods to correct for imperfect gold standards find limited AUROC estimates. Furthermore, the discrepant resolution method requires an imperfect standard test plus an additional method to resolve discrepancies and the composite reference standards method requires several imperfect reference tests that may be combined together to which the surrogate markers may be compared against [35], [36], [38]. These methods may be useful in future studies that consider samples where biopsy measurements, elastography data and serum marker data are available.
 
Finally, we have not addressed the issue of discordance between biopsy results and surrogate markers. Even studies that observe high AUROC values have a large number of patients with discrepant biopsy and surrogate marker results. Interestingly, these studies often suggest that when there are differences between the two methods, biopsy has underestimated disease [28]. This is not surprising given that liver biopsy is more likely to miss fibrosis when it is actually present as opposed to the reader overestimating the presence of fibrosis. Further, some non-invasive marker (e.g., APRI) levels tend to be higher when the Fibroscan estimates a higher disease burden but the biopsy suggests a low disease stage [33].
 
Our results emphasize the importance of minimizing biopsy error in studies developing surrogate markers. Since measurement error increases markedly when biopsy size is less than 3.0cm, one application is that only such samples be used to characterize marker validity [16], [17]. Likewise, future studies should make every effort to minimize reader error. Lacking another gold standard, we cannot assess with confidence whether it is even possible to increase biopsy validity sufficiently to substantively differentiate a new marker from those we already have. However, these calculations make it clear that attempts to validate markers in 'real world' settings will always be constrained since biopsy sensitivity and specificity is much lower.
 
Although some clinicians already use liver biopsy surrogate markers in their practices, others are waiting for more valid tests. Our results strongly suggest that major improvements in surrogate markers are unlikely when evaluated against liver biopsy. Thus, novel strategies are needed to move the field forward. In particular, long-term prospective studies of markers against clinical gold standards, such as development of end-stage liver disease are needed to assess the best measures of intermediate disease stages. Likewise, the validity of all outcome measures must be carefully considered when assessing the validity of surrogate markers in biomedical research or clinical practice.
 
 
 
 
  iconpaperstack View Older Articles   Back to Top   www.natap.org