The impact of rater training on clinical outcomes assessment data: a literature review


  • Michael E. Sadler eResearch Technology, 500 Rutherford Ave., Boston, MA, United States
  • Rinah T. Yamamoto eResearch Technology, 500 Rutherford Ave., Boston, MA, United States
  • Laura Khurana eResearch Technology, 500 Rutherford Ave., Boston, MA, United States
  • Susan M. Dallabrida eResearch Technology, 500 Rutherford Ave., Boston, MA, United States



Rater training, Clinical trials, Reliability, Accuracy


Rater training is a well-recognized approach to minimizing inaccuracy and variability in clinical outcomes assessments common in clinical trials. However, there is a dearth of empirical research on the types of rater training and qualifications that contribute to improved accuracy, inter-rater reliability and intra-rater reliability. Herein, we discuss the need for rater training in clinical trials and review publications that report data on rater characteristics, training modalities and outcomes in terms of accuracy and reliability of clinical outcomes data. 


Author Biography

Michael E. Sadler, eResearch Technology, 500 Rutherford Ave., Boston, MA, United States

Scientific Advisor


Hay M, Thomas DW, Craighead JL, Economides C, Rosenthal J. Clinical development success rates for investigational drugs. Nature Biotechnol. 2014;32(1):40-51.

DiMasi JA, Grabowski HG, Hansen RW. Innovation in the pharmaceutical industry: New estimates of R&D costs. In: Cost of developing a new drug. Boston: Tufts Center for the Study of Drug Development, Tufts University School of Medicine; 2014: 30.

Sertkaya A, Wong HH, Jessup A, Beleche T. Key cost drivers of pharmaceutical clinical trials in the United States. Clin Trials. 2016;13(2):117-26.

Mulsant BH. Interrater reliability in clinical trials of depressive disorders. Am J Psychiatry. 2002;159(9):1598-600.

Kobak KA, Kane JM, Thase ME, Nierenberg AA. Why do clinical trials fail? The problem of measurement error in clinical trials: Time to test new paradigms? J Clin Psychopharmacol. 2007;27(1):1-5.

Kobak KA, Feiger A, Lipsitz JD. Interview quality and signal detection in clinical trials. Am J Psychiatry. 2005;162(3):628.

Walton MK, Powers JH 3rd, Hobart J, Patrick D, Marquis P, Vamvakas S, et al. Clinical outcome assessments: Conceptual foundation-report of the ispor clinical outcomes assessment - emerging good practices for outcomes research task force. Value Health. 2015,18(6):741-52.

Fitzpatrick R, Davey C, Buxton MJ, Jones DR. Evaluating patient-based outcome measures for use in clinical trials. Health Technol Assess. 1998;2(14):i-iv,1-74.

FDA. Guidance for industry patient-reported outcome measures: Use in medical product development to support labeling claims MD: FDA. Available at: Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM193282.pdf Accessed on 3 March 2017.

Becker RE, Greig NH, Giacobini E. Why do so many drugs for alzheimer's disease fail in development? Time for new methods and new practices? J Alzheimers Dis. 2008;15(2):303-25.

Williams JB, Kobak KA. Development and reliability of a structured interview guide for the montgomery-asberg depression rating scale. Br J Psychiatry. 2008;192:52-8.

Williams JB. A structured interview guide for the Hamilton Depression Rating Scale. Arch Gen Psychiatry. 1988;45(8):742-7.

Berth-Jones J, Grotzinger K, Rainville C, Pham B, Huang J, Daly S, et al. A study examining inter- and intrarater reliability of three scales for measuring severity of psoriasis: Psoriasis Area and Severity Index, physician's global assessment and lattice system physician's global assessment. Br J Dermatol. 2006;155(4):707-13.

Puzenat E, Bronsard V, Prey S, Gourraud PA, Aractingi S, Bagot M et al. What are the best outcome measures for assessing plaque psoriasis severity? A systematic review of the literature. J Eur Acad Dermatol Venereol. 2010;24 Suppl 2:10-6.

Spuls PI, Lecluse LL, Poulsen ML, Bos JD, Stern RS, Nijsten T. How good are clinical severity and outcome measures for psoriasis? Quantitative evaluation in a systematic review. J Invest Dermatol 2010;130(4):933-43.

16. Prasad BP, Bhatta RC, Chaudhary J, Sharma S, Mishra S, Cuddapah PA, et al. Agreement between novice and experienced trachoma graders improves after a single day of didactic training. Br J Ophthalmol. 2015;100(6):762-765.

Ramaker C, Marinus J, Stiggelbout AM, Van Hilten BJ. Systematic evaluation of rating scales for impairment and disability in parkinson's disease. Mov Disord. 2002;17(5):867-76.

Colell MG-V, March J, Sedway J. Rater qualifications in early alzheimer’s disease clinical trials. In: Alzheimer's & Dementia. vol. 10: Elsevier; 2014:4-178.

Salvarani C, Girolomoni G, Di Lernia V, Gisondi P, Tripepi G, Egan CG, et al. Impact of training on concordance among rheumatologists and dermatologists in the assessment of patients with psoriasis and psoriatic arthritis. Semin Arthritis Rheum. 2016;46(3):305-11.

Armstrong AW, Parsi K, Schupp CW, Mease PJ, Duffin KC. Standardizing training for psoriasis measures: Effectiveness of an online training video on psoriasis area and severity index assessment by physician and patient raters. JAMA Dermatol. 2013;149(5):577-82.

Charman C, Chambers C, Williams H. Measuring atopic dermatitis severity in randomized controlled clinical trials: What exactly are we measuring? J Invest Dermatol. 2003;120(6):932-41.

Pincus T. Limitations of a quantitative swollen and tender joint count to assess and monitor patients with rheumatoid arthritis. Bull NYU Hosp Jt Dis 2008;66(3):216-23.

Rudick RA, Larocca N, Hudson LD, Msoac. Multiple sclerosis outcome assessments consortium: Genesis and initial project plan. Mult Scler. 2014;20(1):12-7.

Dias N, Durand E, Gary S, Tuller J, Dallabrida S. Patients with gastrointestinal disorders prefer electronic and interactive training when participating in a clinical trial. In: DIA. Chicago, IL; 2017.

Dias N, Zhao L, Durand E, Gary S, Tuller J, Dallabrida S. Errors in patient reported outcomes (pros): Patients' understanding of how to record a headache day. In: ISPOR 22nd Annual International Meeting. Boston, MA; 2017.

Yamamoto R, Durand E, Gary S, Tuller J, Dallabrida S. Patient reported outcomes (pros) are subject to interpretation errors: Patients' understanding of how to report pain severity over a period of time. In: ISPOR 22nd Annual International Meeting. Boston, MA; 2017.

Kobak KA, Lipsitz JD, Williams JB, Engelhardt N, Jeglic E, Bellew KM. Are the effects of rater training sustainable? Results from a multicenter clinical trial. J Clin Psychopharmacol. 2007;27(5):534-6.

Rothman B, Yavorsky C, De Fries A, Gordon J, Opler M. P02-88 - quantifying rater drift on the ham-d in a sample of standardized rater training events: Implications for reliability and sample size calculations. European Psychiatry 2011;26:683.

Engelhardt N, Feiger AD, Cogger KO, Sikich D, DeBrota DJ, Lipsitz JD, et al. Rating the raters: Assessing the quality of hamilton rating scale for depression clinical interviews in two industry-sponsored clinical drug trials. J Clin Psychopharmacol. 2006;26(1):71-4.

Markus KA. FDA briefing document dermatologic and ophthalmic drugs advisory committee meeting. Silver Spring, MD: U.S. Food and Drug Administration, 2016.

English R, Lebovitz Y, Griffin R. Transforming clinical research in the United States: Challenges and opportunities: Workshop summary. In: Forum on Drug Discovery, Development, and Translation; Institute of Medicine. Washington DC: National Academies Press (US); 2010.

Keefe RSE, Harvey PD. Implementation considerations for multisite clinical trials with cognitive neuroscience tasks. Schizophrenia Bulletin 2008;34(4):656-63.

Miller J. Complex clinical trials are posing new challenges across the clinical supply chain. BioPharm International 2010;23(4).

Small GW, Schneider LS, Hamilton SH, Bystritsky A, Meyers BS, Nemeroff C. Site variability in a multisite geriatric clinical trial. Int J Geriatric Psychiatry. 1996;11:1089-95.

Delaney KA. Tools to standardize assessments across multi-site trials: Methods to improve standardization of neuropsychological assessment in clinical trials. Maryland: U.S. Food and Drug Administration, 2015.

Pariser A. Rare disease and clinical trials. In. Edited by Administration USFaD. Maryland; 2014: 30.

FDA. Advancing the development of pediatric therapeutics workshop. Silver Springs, Maryland, 2015.

Kobak KA, Lipsitz JD, Williams JB, Engelhardt N, Bellew KM. A new approach to rater training and certification in a multicenter clinical trial. J Clin Psychopharmacol 2005;25(5):407-12.

Targum SD. Evaluating rater competency for cns clinical trials. J Clin Psychopharmacol. 2006;26(3):308-10.

Daniel D, Opler MGA, Wise-Rankovic A, Kalali A. Consensus recommendations on rater training and certification. In.: CNS Summit: Rater Training and Certification Workgroup; 2013:9.

EMA. Reflection paper on risk based quality management in clinical trials. UK; 2013. Available at: document_library/Scientific_guideline/2013/11/WC500155491.pdf Accessed on 4 April 2017.

Kobak KA, Engelhardt N, Williams JB, Lipsitz JD. Rater training in multicenter clinical trials: Issues and recommendations. J Clin Psychopharmacol. 2004;24(2):113-7.

West MD, Daniel DG, Opler M, Wise-Rankovic A, Kalali A. Consensus recommendations on rater training and certification. Innov Clin Neurosci. 2014;11(11-12):10-3.

Axelrod BN, Alphs LD. Training novice raters on the negative symptom assessment scale. Schizophr Res. 1993;9(1):25-8.

Henrique-Araujo R, Osorio FL, Goncalves Ribeiro M, Soares Monteiro I, Williams JB, Kalali A, et al. Transcultural adaptation of grid hamilton rating scale for depression (grid-hamd) to brazilian portuguese and evaluation of the impact of training upon inter-rater reliability. Innov Clin Neurosci. 2014;11(7-8):10-8.

Jeglic E, Kobak KA, Engelhardt N, Williams JB, Lipsitz JD, Salvucci D, et al. A novel approach to rater training and certification in multinational trials. Int Clin Psychopharmacol. 2007;22(4):187-91.

Kobak KA, Lipsitz JD, Feiger A. Development of a standardized training program for the hamilton depression scale using internet-based technologies: Results from a pilot study. J Psychiatr Res. 2003;37(6):509.

Kobak KA, Opler MGA, Engelhardt N. Panss rater training using internet and videoconference: Results from a pilot study. Schizophr Res. 2007;92(1-3):63-7.

Lundh A, Kowalski J, Sundberg CJ, Landen M. A comparison of seminar and computer based training on the accuracy and reliability of raters using the children's global assessment scale (CGAS). Adm Policy Ment Health. 2012;39(6):458-65.

Müller MJ, Rossbach W, Dannigkeit P, Muller-Siecheneder F, Szegedi A, Wetzel H. Evaluation of standardized rater training for the positive and negative syndrome scale (PANSS). Schizophr Res. 1998;32(3):151-60.

Müller MJ, Wetzel H. Improvement of inter-rater reliability of PANSS items and subscales by a standardized rater training. Acta Psychiatr Scand 1998;98(2):135-9.

Müller MJ, Dragicevic A. Standardized rater training for the hamilton depression rating scale (HAMD-17) in psychiatric novices. J Affect Disord. 2003;77(1):65.

Rosen J, Mulsant BH, Marino P, Groening C, Young RC, Fox D. Web-based training and interrater reliability testing for scoring the hamilton depression rating scale. Psychiatry Res. 2008;161(1):126-30.

Tabuse H, Kalali A, Azuma H, Ozaki N, Iwata N, Naitoh H, et al. The new grid hamilton rating scale for depression demonstrates excellent inter-rater reliability for inexperienced and experienced raters before and after training. Psychiatry Res. 2007;153(1):61-7.

Wagner S, Helmreich I, Lieb K, Tadic A. Standardized rater training for the hamilton depression rating scale (HAMD(17)) and the inventory of depressive symptoms (IDSC30). Psychopathology. 2011;44(1):68-70.

Cusick A, Vasquez M, Knowles L, Wallen M. Effect of rater training on reliability of melbourne assessment of unilateral upper limb function scores. Development Med Child Neurol. 2005;47(1):39-45.

Kaufmann P, Levy G, Montes J, Buchsbaum R, Barsdorf AI, Battista V, et al. Excellent inter-rater, intra-rater, and telephone-administered reliability of the alsfrs-r in a multicenter clinical trial. Amyotroph Lateral Scler. 2007;8(1):42-6.

Russell DJ, Rosenbaum PL, Lane M, Gowland C, Goldsmith CH, Boyce WF, et al. Training users in the gross motor function measure: Methodological and practical issues. Physical Therapy. 1994;74(7):630-6.

Schuld C, Wiese J, Franz S, Putz C, Stierle I, Smoor I, et al. Effect of formal training in scaling, scoring and classification of the international standards for neurological classification of spinal cord injury. Spinal Cord. 2013;51(4):282-8.

Wilson JT, Slieker FJ, Legrand V, Murray G, Stocchetti N, Maas AI. Observer variation in the assessment of outcome in traumatic brain injury: Experience from a multicenter, international randomized clinical trial. Neurosurgery. 2007;61(1):123-8.

Youn SW, Choi CW, Kim BR, Chae JB. Reduction of inter-rater and intra-rater variability in psoriasis area and severity index assessment by photographic training. Ann Dermatol. 2015;27(5):557-62.

Inada T, Matsuda G, Kitao Y, Nakamura A, Miyata R, Inagaki A, et al. Barnes Akathisia Scale: Usefulness of standardized videotape method in evaluation of the reliability and in training raters. Int J Methods Psychiatr Res. 1996;6(1):49-52.

Loonen AJ, Doorschot CH, van Hemert DA, Oostelbos MC, Sijben AE. The schedule for the assessment of drug-induced movement disorders (sadimod): Inter-rater reliability and construct validity. Int J Neuropsychopharmacol. 2001;4(4):347-60.

Hansen T, Elholm Madsen E, Sørensen A. The effect of rater training on scoring performance and scale-specific expertise amongst occupational therapists participating in a multicentre study: A single-group pre-post-test study. Disabil Rehabil 2015;38(12):1216-26.

Macnab AJ, Levine M, Glick N, Phillips N, Susak L, Elliott M. The Vancouver Sedative Recovery Scale for Children: Validation and reliability of scoring based on videotaped instruction. Can J Anaesth. 1994;41(10):913-8.

Schaeffer N. Student training to perceptually assess severity of dysphonia using the dysphonic severity percentage scale. J Voice. 2013;27(5):611-6.

Teal CR, Haidet P, Balasubramanyam AS, Rodriguez E, Naik AD. Measuring the quality of patients' goals and action plans: Development and validation of a novel tool. BMC Medical Info Decision Making. 2012;12(1):152-9.

Williams JB. A structured interview guide for the hamilton depression rating scale. Arch General Psychiatry. 1988;45(8):742-7.

69. Kay SR, Fiszbein A, Opler LA. The positive and negative syndrome scale (PANSS) for schizophrenia. Schizophr Bull. 1987;13(2):261-76.

Fredriksson T, Pettersson U. Severe psoriasis--oral therapy with a new retinoid. Dermatologica. 1978;157(4):238-44.

Hilsabeck RC, Nations KR, Tanenbaum R, Grubb B, Choudhry A. Inter-rater reliability and error analysis of the scales for outcomes of parkinson’s disease: Cognition (scopa-cog) in moderato–a randomized double blind placebo controlled study to assess the effect of rasagiline on mild cognitive impairment in pd. In: Alzheimer's and Dementia. vol. 10; 2014: 854.

Busner J, Kott A, Sachs G. Increasing signal over noise in mdd clinical trials: Improvement after efficacy scale rater training among experienced mdd investigators. Eur Neuropsychopharmacology. 2013;23:348-8.

West MD, Daniel DG, Opler M, Wise-Rankovic A, Kalali A. Consensus recommendations on rater training and certification. Innov Clin Neurosci. 2015;11(11-12):10-3.

Perkins DO, Wyatt RJ, Bartko JJ. Penny-wise and pound-foolish: The impact of measurement error on sample size requirements in clinical trials. Biol Psychiatry. 2000;47(8):762-6.

Lipsitz J, Kobak K, Feiger A, Sikich D, Moroz G, Engelhard A. The rater applied performance scale: Development and reliability. Psychiatry Res. 2004;127(1-2):147-55.

Khan A, Yavorsky WC, Liechti S, DiClemente G, Rothman B, Opler M et al. Assessing the sources of unreliability (rater, subject, time-point) in a failed clinical trial using items of the positive and negative syndrome scale (PANSS). J Clin Psychopharmacol. 2013;33(1):109-17.

Kinon BJ, Potts AJ, Watson SB. Placebo response in clinical trials with schizophrenia patients. Curr Opin Psychiatry. 2011;24(2):107-13.

Gwet KL. Handbook of inter-rater reliability, 4th edition: The definitive guide to measuring the extent of agreement among raters Maryland: Advanced Analytics, LLC; 2014.

Leon AC. Implications of clinical trial design on sample size requirements. Schizophr Bull. 2008;34(4):664-9.

Hallgren KA. Computing inter-rater reliability for observational data: An overview and tutorial. Tutor Quant Methods Psychol. 2012;8(1):23-34.






Review Articles