Introduction
The Food and Drug Administration (FDA) has long been in a crossfire between those who complain of overly strict standards for approving new drugs and those who charge the opposite.1 Who is correct? What should FDA's standards be? What are they now? Do they differ for different classes of drugs? Should they, and, if so, how much? When the FDA places its stamp of approval -- "safe and effective" -- on a drug, could this mean, e.g., only a 5% chance that it is not effective, a 25% chance that it is not effective, or that it is more likely than not that it is ineffective? From a normative perspective, should the FDA ever approve a drug that reviewers believe is probably not effective?
The short answer to most of these issues is: We don't really know. Note that the question here is not whether the FDA engages in risk-benefit assessments. It certainly does, tolerating greater side effects or risks of adverse reactions when clinical benefits are greater. Instead, the issue is whether the standard of proof used to judge whether a drug is "effective" varies with benefits and risks. "Standard of proof" refers most obviously to the level of statistical significance required but includes all other factors relevant to assessing the evidentiary burden a drug sponsor has to bear -- e.g., the number of studies, the comparability of control groups or the use of surrogate endpoints of less certain clinical validity.
For many years, the FDA resolutely adhered to the view that the same standard of proof for assessing effectiveness should apply to all drugs whether "breakthrough" drugs or "me-toos." It argued that the statute's requirement to demonstrate "efficacy" should be construed in narrowly scientific terms rather than in a policy context where the appropriate standard of proof depends upon the relative harms from delaying "good" drugs and approving "bad" ones. If only scientific criteria are relevant, then all drugs should be subject to the same standard of proof. Treating the issue as a purely scientific one helped to protect the agency from challenges to its authority. Its expertise, after all, lies in assessing evidence of effectiveness. In contrast, quantifying the uncertainties and highlighting the value judgments exposes both the limits of science and the policy discretion that the FDA must exercise.
Only in the last few years, as a result of pressures from constituencies concerned with AIDS, has the FDA taken steps to acknowledge that a different standard of proof for assessing evidence of effectiveness should sometimes be used. This acknowledgment that the choice of the standard of proof for assessing effectiveness involves policy judgments makes this an opportune time to ask what those standards currently are and what they should be.
A major reason why we have such limited understanding of the policies that the FDA actually follows in reviewing drugs is that the agency makes little attempt to explain and justify its interpretation of the evidence and its use of its discretion. (In contrast, most regulatory decisions of the Occupational Safety & Health Administration or the Environmental Protection Agency are accompanied by tens or hundreds of pages of Federal Register preamble.) Particularly striking is the absence of explicit attempts to quantify key variables and the failure to place them in a framework where the trade-offs and uncertainties can be highlighted.
These omissions also make it more difficult to decide whether the FDA is making the right choices. Normative analysis is hindered if we cannot systematically compare the gains and losses from approving a drug (or, in the case of prospective analysis, the expected gains and losses). Of course, even with such comparisons, available clinical data may often be too ambiguous to justify any clear normative conclusions.
Decision analysis could provide the framework that is needed to consider the trade-offs and uncertainties. It can be a useful tool for assessing a sequence of decisions made under uncertainty. And, its structuring of a problem requires explicit attention to describing the possible uncertain events, estimating the probability of the different outcomes, and valuing each possible outcome.
Because decision analysis makes assumptions of an analysis explicit, it could shed light on some of the positive issues of drug review, e.g.: Are drugs for serious diseases without good alternative treatments approved with less evidence of effectiveness than other drugs?
At least in the near term, the FDA is not likely -- for a mix of political and analytical reasons -- to make explicit use of decision analysis. Therefore, this paper proposes a strategy that might be called "shadow advisory committees." Its task would be to test and develop the methodology. To the extent that it proves feasible, shadow committee findings could both provide a standard against which to assess FDA decisions and help illuminate FDA's decision-making process.
Before turning to that proposal, we first review the positive and normative issues involved with the standard of proof. Although this is not the only decision variable for the FDA, it is probably the most important one and deserves special attention.
The FDA's Standard of Proof
As noted above, we use "standard of proof" to refer to the evidentiary burden that a drug sponsor must bear to support claims of effectiveness and safety. The Federal Food, Drug, and Cosmetic Act provides very little guidance,2 e.g., Sec. 505(d) states:
... "substantial evidence" means evidence consisting of adequate and well-controlled investigations, including clinical investigations, by experts qualified by scientific training and experience to evaluate the effectiveness of the drug involved, on the basis of which it could fairly and responsibly be concluded by such experts that the drug will have the effect it purports or is represented to have under the conditions of use prescribed, recommended, or suggested in the labeling or proposed labeling thereof.
The FDA interpreted the plural in "investigations" normally to require at least two studies, but not until 1970 did the FDA regulations begin to define "adequate and well-controlled" investigation.3 These stated that experimental and control groups should be comparable, bias should be controlled and quantitative evaluations should be conducted. While useful, these criteria remained fairly general; moreover, the agency retained the right to waive them when it chose.
A drug is considered "effective" if substantial evidence shows that it does what its proposed labeling says that it does. A drug is considered "safe" if its risks are acceptable in light of its benefits. It is up to the drug's sponsor to carry out pre-clinical tests and clinical trials to persuade the FDA that these criteria have been met.
Over time, additional guidelines have been developed on significant issues like the endpoints that clinical trials must address. For example, when must a cancer drug show not merely activity against tumors, but survival gains, and when must the survival gains be equal to or greater than that provided by alternative treatments?4 The FDA produces a "Summary Basis of Approval" document for each drug approved, and although it reviews (in about 80 single-spaced pages) the findings from studies, it does not present any arguments about risk-benefit trade-offs or probabilistic statements about the overall weight of the evidence. Advisory Committees that are involved in most new drug decisions do sometimes discuss broader issues but they rarely attempt to make quantitative evidentiary judgments or explicit normative judgments about weights to be accorded different health outcomes.
In some respects, the FDA has treated drugs differently as a function of clinical importance and risks. As noted, the FDA accepts higher patient risks when the drug also can confer greater clinical benefits. Also, the FDA gives priority in the review queue to drugs that represent more important clinical advances and is more willing to grant pre-approval access to drugs for serious and life-threatening diseases. In practice, the FDA has approved drugs within a few months and on the basis of a single study in cases like L-dopa for Parkinson's disease in 1970 and AZT for AIDS in 1987.5 Yet, these were exceptions, and the FDA plausibly claimed that the single studies were so persuasive that approval represented no lowering of its standard of proof.
However, in the fall of 1991, the FDA approved the drug DDI following the recommendation of its Anti-Viral Advisory Committee; and the transcript of that meeting shows clearly that a different, lower standard of proof was being employed. As one participant stated:6
It is, indeed, hard to say that efficacy has been shown by adequate and well-controlled trials. I do not really think that the data do meet the standard. But we ought to approve. We ought not to fudge on it. We intend to operate on a lower standard on this occasion than the standard that is ordinarily applied.
In April, 1992, the FDA proposed to approve drugs on the basis of trials7
establishing that the drug has an effect on a surrogate endpoint that is reasonably likely... to predict clinical benefit. Approval could be granted where there is some uncertainty as to the relation of that endpoint to clinical benefit...
as long as the sponsor agreed to conduct post approval studies and to accept expedited procedures to remove drugs when the further studies failed to confirm effectiveness. This accelerated review was to be limited to drugs that "provide meaningful therapeutic benefit over existing treatment for patients with serious or life-threatening diseases." Although reliance on surrogate measures is nothing new -- e.g., approving drugs to prevent strokes on the basis of evidence that they lower blood pressure -- this regulation is the first formal acceptance of surrogates of "uncertain" validity.
Thus the FDA has taken some steps to adopt a lower standard of proof for important new drugs for serious diseases. But that fact still leaves us with many questions: What exactly is this new, lower standard of proof? What was the old one? More generally, does the notion of a uniform standard of proof -- or indeed of a new, two-tier system -- really describe FDA's practices? Does each FDA division behave differently? Do results vary as FDA staff change or as different members become influential on its advisory committees?
Normative Perspectives on the Standard of Proof
The FDA frequently defines its criterion for approval in terms of a finding that the clinical benefits that a drug will confer outweigh the risks that it poses. Using this criterion, the errors that the FDA can make are 1) allowing the marketing of a drug even though the benefits do not justify the risks and 2) not allowing the marketing of a drug even though the benefits do justify the risks. In this framework, FDA's proper objective may be posed as choosing the standard of proof that will minimize the sum of the costs to society of regulatory errors. It has long been recognized that this framework may be useful for thinking about the choice of the standard of proof.8
Also, decision analysis seems well-suited for helping to think through individual drug review decisions. In addition to comparing the expected gains and losses from approval decisions, decision analysis is tailored to assess the highly relevant option of postponing a decision until additional information has been obtained. In clinical practice, decision analysis has made some headway in the last decade, especially in informing judgments about the desirability of diagnostic procedures.9 Although physicians (like most people) dislike being prodded to make explicit probabilistic estimates and overt value judgments, a process that requires those steps to be taken, and taken openly, has potential to lead to better informed, more thoughtful choices.
An analysis of error costs can be used to evaluate different regulatory policies employing different standards of proof. Suppose country A has a strict standard of proof, allowing six "good drug errors" and only three "bad drug errors." While country B's more relaxed review standard leaves it more prone to "bad drug errors" (seven) than to "good drug errors" (three). If the goal is to minimize the sum of error costs, then we need to know the costs of the errors to be able to conclude which policy is better. If the average costs of both error types are the same, then country A's total costs will be lower because it has fewer total errors. If the expected cost of "good drug errors" is much greater, then country B's less stringent standard of proof will lead to the lower total costs.
The closest that anyone has come to actually applying this framework was Wardell in the 1970's -- comparing the U.S. and Britain. In a series of articles10 he demonstrated that 1) Britain had approved many more drugs than the U.S. had and that mutually approved drugs had generally been approved earlier there and 2) many of the drugs in which Britain had the lead were viewed by physicians there as important therapies. He also argued that Britain did not experience a very large burden of extra toxicity as a result of its earlier approvals, although the limited available data precluded firm, quantitative conclusions.
Wardell's argument eventually won at least partial acceptance among top FDA staff and was well publicized at the first Congressional hearing to criticize the FDA for being too strict rather than too lax.11 Still, although several studies12 have updated the counts of drug approvals in Britain and the U.S., no one has attempted a new, comprehensive assessment of the clinical significance of gains and risks.
One reason may be that such comprehensive studies are difficult to carry out. Another is that, at least with respect to Britain, the lag has disappeared. There is still a U.S. lag when we consider all drugs approved since 1962, but, for the most recent period for which we have full data (1983-87), the U.S. introduced considerably more drugs and introduced mutually available drugs sooner than Britain.13 To the extent that standards of proof contributed to this shift, most observers attribute it to toughened British standards rather than to looser U.S. standards.14 Whatever the reasons, the apparent disappearance of a lag reduces incentives to explore the clinical and normative significance of the differences. It also eliminates the alternative drug regulation regime needed to compare different standards of proof. Its disappearance leaves us in more of a quandary about how to assess U.S. performance.
We can, however, ask whether the FDA has acted in accord with an important implication of the error cost minimization principle: The standard of proof should vary with the size of the error costs.15 When potential drug benefits are large relative to risks, the expected cost of withholding it is large. Therefore, the standard of proof should be lower. The FDA should not have one standard of proof, but many. As discussed above, it has moved to adopt a lower standard for AIDS drugs and perhaps others, but we still do not have much insight into what the different standards ought to be.
How to Proceed?
What research strategies can we adopt for learning more about the normative and positive aspects of drug review, including the standard(s) of proof that the FDA uses and should use? Further studies of international differences in the availability of drugs may continue to yield suggestive findings, but, since many factors affect availability, linking those differences to regulatory decisions is often difficult. In any event, international differences have been shrinking and will probably continue to do so as Europe, the U.S. and Japan attempt to make their regulatory policies for drugs more consistent.
Expert reviews of FDA Summary Basis of Approval documents and of transcripts of advisory committee meetings may also prove useful for understanding FDA decision making. However, we believe that a proposal for using the framework of decision analysis to review drug approval decisions could not only provide information to address both the positive and normative issues, but could also help to advance the use of techniques that will lead to more rational regulatory policies.
Specifically, the proposal is to convene panels that will review the data on drugs that the FDA has recently reviewed. The main tasks of these panels would be to: (1) elicit explicit judgments from the experts about the probability distribution of the benefit and risk profiles; (2) compare the risks and benefits, preferably using a common metric (e.g., quality adjusted life years), to provide quantitative assessments of the net clinical benefits associated with the use of the drug in different subgroups; and (3) show what the different assumptions imply in terms of net clinical benefits.
It should be apparent why it will be difficult for the FDA to embrace the use of decision analysis publicly. In addition to the undeveloped state of the technique's application to drug review, its use is fraught with political risks. Making explicit judgments about how to weigh the risks of certain adverse events against the benefits of treatment is controversial. Acknowledging explicitly that drugs are being approved despite a non-trivial probability that they are not effective poses a threat, as we argued above, to FDA autonomy.
The panels proposed here would not be asked to recommend what the FDA should do or should have done. Unlike FDA Advisory Committees, their task is to make explicit, to the extent permitted by the available data, the building blocks for those judgments. For example, if the objective of FDA policy is to maximize net clinical benefits, then given the expected gains and losses from approving or not approving a drug, what probability of effectiveness should be required before a particular drug is approved?
With the many complexities posed by drug approval decisions, will it be possible to carry out this exercise and learn anything useful? The probable answer is: Not always, but sometimes. Utility will depend largely on the quality of available data, which will affect panelists' ability to reach agreement about the facts. It will also depend upon their ability to agree about how to value clinical gains and losses. Yet, even when panelists cannot agree, the process may be useful by highlighting areas of uncertainty and focusing attention on how the FDA proceeded to resolve them.
Risk-benefit decisions are facilitated when risks and benefits can be placed in a common metric. The use of quality of life measures in assessments of medical technologies has increased swiftly and become more sophisticated.16 A survey of pharmaceutical firms conducted in the late 1980's reported that over 60% used quality-of-life instruments in clinical trials.17 Many quality-of life-measures are specific to particular diseases (e.g. the Arthritis Ladder Scale) or to aspects of illness (e.g. the Self Esteem Scale). However, a number of generic instruments have been developed to compare outcomes across populations and interventions.18 These will be extremely useful, but questions will remain about whether they properly weight different dimensions of clinical impacts, and use of single numbers to sum up complicated problems runs the risk of obscuring rather than clarifying value judgments that must be made.19
When generic measures are not available, panelists will still have to judge the weight of different impacts. The basis for judgments should be clearly explained. With respect to both facts and values, the panels need a process to facilitate consensus without imposing a false one.
Panels could not do their work until sufficient information has been made public. Generally, the earliest time would follow an FDA advisory committee meeting to review a New Drug Application. Then, evidence on efficacy and safety is generally available for public scrutiny; panels would also have the benefit of the transcript of the advisory committees' deliberations. Panels should address new drugs for which the data base is relatively good. For drugs in which quality-of-life concerns are very important, data should be available on those concerns, preferably using generic measures to ease comparisons of benefits and risks. The panels could review drugs that the advisory committees (and the FDA) disapprove as well as those they approve.
It will also be sensible to review drugs for which the advisory committee split its vote concerning "safety and effectiveness." (Based on a review of recent advisory committee meeting results, non-unanimous votes occur in a sizable minority of cases.) A review of these drugs is more likely to identify the reasons why people disagree and the boundaries that currently separate adequate and inadequate evidence and acceptable and unacceptable costs of drug use.
The panels would include the types of clinical experts who comprise FDA advisory committees, but they would be supplemented by experts in decision analysis and in the measurement of health status. Because each panel would focus on only a single drug, its membership could be small, with perhaps only three or four clinical experts.
All of the analytic tasks described above are difficult, but they are precisely the ones implied by FDA's objective of showing that beneficial effects outweigh adverse effects. It typically makes this analysis in an informal, qualitative way. A more formal, quantitative approach could serve several functions: (1) shed light on the desirability of particular drug-approval decisions; (2) raise more general questions about the trade-offs that should be made for different classes of drugs; and (3) suggest insights about the actual decision rules that the FDA and its advisory committees are using.
No one or even several such panels can provide unambiguous evidence about what FDA's current practices are or about what they should be. There are often too many unknowns in each case to be certain about what the FDA or the advisory committee members must have been assuming; or about what standard of proof would be the proper one to use in light of the expected costs and benefits of approval and whether that standard was, in fact, met. Yet this approach, even on a small scale, seems likely to suggest tentative answers that could be the basis for more thorough research. And, if the use of a decision analytic framework seems useful, over time the panels should accumulate stronger evidence about the review process.
A Final Case
In conclusion, we can consider a case where decision analysis might contribute some insights. One issue that the FDA and National Cancer Institute (NCI) Working Group could not agree on was the proper endpoints for drugs for previously untreated advanced ovarian cancer, where standard therapy produces substantial toxicity and modest improvements in survival.20 The use of carboplatin for this disease had been the subject of hot debate in 1989 between Robert Temple from the FDA and Bruce Chabner from NCI before the National Committee to Review Current Procedures for Approval of New Drugs for AIDS and Cancer (the "Lasagna committee").21
Temple, Director of FDA's Office of Drug Evaluation, acknowledged that carboplatin had fewer side effects than the standard treatment, cisplatinum, which some patients could not tolerate. The available evidence showed that carboplatin did have anti-tumor activity, but there were no studies comparing it to cisplatinum or suggesting that it was as effective as cisplatinum. Temple stated that the FDA would have no trouble approving carboplatin as second-line therapy, i.e., for patients who failed to respond to or who could not tolerate the standard therapy. But:22
The reservation, though, that I would have is that if one simply approved carboplatin for unvarnished, unmodified front line therapy without knowing how it compares, I think there is a potential serious loss until you have reasonable evidence that survival is comparable.
He stated that the Advisory Committee believed that "comparable complete response rates alone are not enough in ovarian cancer because of the known dramatic survival gain."
In response, Bruce Chabner, Director of the Cancer Treatment Division at the NCI, gave a different twist to the same facts:23
I feel personally that for most solid tumors we do not currently have effective therapies. We have some therapies that produce responses. We have therapies that improve survival to a small extent. But just to take the example of ovarian cancer, when Bob [Temple] says it is important to compare carboplatin against standard therapy, I don't agree. I don't think that we have good therapy for ovarian cancer now. The cure rate is only 20% at most and this is in somewhat selected patients. I can't consider a disease where 80% of the patients are dying of that cancer to have a standard therapy.
Reading this argument, one is struck by the different framing of the choice. Tversky and Kahneman24 have shown that when choices are framed in terms of losses, people are more likely to be willing to take risks: If 80% will die with standard treatment, then we must be willing to gamble on something new; but, if we can already save 20%, then we should be cautious about jeopardizing those gains. But the perspectives of Temple and Chabner also reflect the different missions of their organizations. The Division of Cancer Treatment's charge is to develop better treatments for dying patients; the FDA's, to ensure that new drugs are safe and effective.
Given the impact that framing and organizational interests may have on perceptions of drug approval issues, the case for systematically examining these choices seems even stronger.
* Dr. Mendeloff is Professor and Director, Public Management and Policy Program, University of Pittsburgh Graduate School of Public and International Affairs. He received his A.B. from Harvard College and his M.P.P. and Ph.D. from the University of California (Berkeley).
1 See, e.g., Subcomm. on Science, Research, and Technology of the House Comm. on Science and Technology, Oversight: The Food and Drug Administration's Process for Approving New Drugs, 96th Cong. 1st sess. (1979).
2 Pub L. No. 75-717, 52 Stat. 1040 (1938), as amended.
3 Part 130.12 of the New Drug Regulations. 37 F.R. 7250 ff. (1970).
4 Joyce A. O'Shaughnessy et. al., Commentary Concerning Demonstrations of Safety and Efficacy of Investigational Anti-Cancer Agents in Clinical Trials, 9 J. Clin. Oncology 2225 (1991).
5 Kenneth I. Kaitin, Case Studies of Expedited Review: AZT and L-Dopamine, Law, 19 Med. & Health Care 242 (1991).
6 Paul Meier in transcript of FDA Anti-Viral Advisory Committee Meeting, July 18, 1991, at 186.
7 Food and Drug Administration, Proposed Rule: New Drug, Antibiotic, and Biological Drug Product Regulations; Accelerated Approval, 57 F.R. 13234 (1982).
8 Henry Grabowski & John Vernon, The Regulation of Pharmaceuticals: Balancing the Benefits and Risks (1983); Talbot Page, A Framework for Unreasonable Risk in The Toxic Substances Control Act, Managing Risks from Carcinogens (William Nicholson, ed. 1981).
9 Harold C. Sox, Jr., Michael C. Higgins & Keith I. Marton, Medical Decision Making (1988); Milton Weinstein & Harvey Fineberg, Clinical Decision Analysis (1980).
10 William Wardell, Introduction of New Therapeutic Drugs in the United States and Great Britain: An International Comparison, 14 Clin. Pharmacol. & Ther. 773 (1973); William Wardell, Therapeutic Implications of the Drug Lag, 15 Clin. Pharmacol. & Ther. 73 (1974); William Wardell, British Usage and American Awareness of Some New Therapeutic Drugs, 14 Clin. Pharmacol. & Ther. 1022 (1973); William Wardell & Louis Lasagna, Regulation and Drug Development (1975); and William Wardell, The Drug Lag Revisited: Comparison by Therapeutic Areas of Patterns of Drugs Marketed in the United States and Great Britain from 1972 to 1976, 24 Clin. Pharmacol. & Ther. 499 (1978).
11 Supra note 1.
12 Kenneth I. Kaitin et al., The Drug Lag: An Update of New Drug Introductions in the United States and in the United Kingdom, 1977 through 1987, 46 Clin. Pharmacol. & Ther. 121 (1989); Hans Berlin & Bengt. Jonsson, International Dissemination of New Drugs: A Comparative Study of Six Countries, 7 Managerial Decision Economics 235 (1986); Paul L. Coppinger, Carl C. Peck & Robert J. Temple, Understanding Comparisons of Drug Introductions between the United States and the United Kingdom, 46 Clin. Pharmacol. & Ther. 139 (1989).
13 Coppinger, Peck & Temple, supra.
14 Alister Dunning, Regulation, New Drug Development, and the Question of Delay, 41 F. D. Cos. L.J. 139 (1986).
15 Joanna Siegel & Marc Roberts, Reforming FDA Policy: Lessons from the AIDS Experience, Regulation, Fall 1991, at 71.
16 Jennifer Falotico-Taylor, Mark McClellan & Frederick Mosteller, The Use of Quality-of-Life Measures in Technology Assessment, in Quality of Life and Technology Assessment (F. Mosteller & J. Falotico-Taylor, eds. 1989).
17 Brian Luce, Joan Wechsler & Carol Underwood, The Use of Quality-of-Life Measures in the Private Sector, in op cit.
18 Robert Kaplan et al., The Quality of Well-Being Scale, 27 Med. Care S.27 (1989); David Feeny & George Torrance, Incorporating Utility-based Quality-of-Life Assessment Measures in Clinical Trials, 27 Med. Care S.190 (1989); and Donald Patrick & Richard Deyo, Generic and Disease-Specific Measures in Assessing Health Status and Quality of Life, 27 Med. Care S.217 (1989).
19 Frederick Mosteller, John Ware, Jr. & Sol Levine, Final Comments on the Conference on Advances in Health Assessment, 27 Med. Care S.282 (1989).
20 Aaron Tversky & Daniel Kahneman, The Framing of Decisions and the Psychology of Choice, 211 Science 453 (1981).
21 National Committee to Review Current Procedures for Approval of New Drugs for AIDS and Cancer, Unedited transcript of meeting Feb.1, 1989.
22 Id. at 45.
23 Id. at 44.
24 Tversky & Kahneman, supra note 20.