Standards for Selecting Quality Tools
Quality Tools Require:
a) Standardization (meaning that a test is given in exactly the same way to large sample of children in order to, in the case of assessment-level and diagnostic tools) produce a normative set of age-equivalent scores and quotients). The standardization sample must be current, i.e., in the last 10 years, because national demographics change frequently. The sample must also be representative, meaning that it reflects national, contemporary demographics, e.g., parents’ levels of education, geographic areas of residence, income, languages spoken at home, etc.). This means that screening tests should not be "normed" on atypical samples (such as children referred to special services). Census Bureau data is used to illustrate usually via matching percentages, that the characteristics of a test's sample reflect the country as a whole.
b) Standardization in the common prevalent languages (e.g., in the US in both English and Spanish at the very least). But, this does not mean that separate norms should be generated for each language--all children must be held to the same performance standards so that examiners can tell who is behind, i.e., likely to be unready for kindergarten where curricular goals and determination of school success are virtually universal. Still it is critical that the translation(s) be thoroughly vetted and carefully done so that dual-language learners and bilingual children are given an equal opportunity for success (or failure).
c) Proof of reliability. There are several kinds of reliability that should be included in every test manual:
- Inter-rater, meaning that two different examiners can derive the same results when testing the same child in a short period of time. This illustrates that the directions are clear and that the test norms can be used confidently;
- Test-retest, meaning that children’s performance is highly similar if tested again in a short period of time. This means that test stimuli and test directions are clear enough to both examiner and child;
- Internal Consistency, meaning that performance on similar kinds of items “hang together” (e.g., that motor skills don’t cluster with language items—which would suggest that directions for motor tasks demand too many language skills to be a meaningful measure of motor skills).
d) Proof of validity of which there are various kinds, only the most critical of which are described below:
- Concurrent validity, i.e., high correlations with diagnostic measures along with indicators of how age-equivalent scores or quotients compare with diagnostic measures;
- Discriminant Validity. Ideally, but not always, manuals include proof of discriminant validity meaning there are unique patterns of test performance for children with unique disabilities; i.e., that children with cerebral palsy perform differently than children with language impairment, and that performance on each domain measured correlates highly with performance similar domains on diagnostic measures.
- Predictive Validity. Rare but valuable is to find proof of predictive validity meaning that test results predict future outcomes and thus that current test results have meaningful long-term implications. Such longitudinal studies are expensive, time-consuming, and arduous to conduct, which is why they are uncommon. Nevertheless, some do exist and if not in the test manual, in the research literature (most particularly in the ERIC and PsychInfo databases). These are also a great source for finding a range of validity studies conducted by various authors.
e) Proof of accuracy (also known as criterion-related validity).
This is a critical requirement for screening tests because they must establish cutoff scores that determine whether a child is probably behind versus probably OK-- so that swift decisions can be made about whether referrals are needed. To establish cutoffs, screening test scores are compared to concurrent diagnostic testing, informally known as the "gold standard". Ideally the diagnostic battery is a comprehensive one that includes a range of measures determining the presence of common disabilities that require intervention. The common disabilities are, in order of prevalence, language impairment, learning disabilities, intellectually disabilities, and autism spectrum disorder. ADHD is not always considered a disability by early intervention or special education programs, but rather more of a barrier to success, in the same way that stairs rather than wheel-chair ramps are a barrier to those with physical disabilities.
Nevertheless, it is critical to remember that screening tests only indicate a probability of problems or absence of problems. A diagnosis should never be given on the basis of a screen (see the Explaining Results section of this module for guidance on how to interpret screening test results).
Indicators of accuracy are most often defined as:
1. Sensitivity --the percentage of children with problems correctly detected. We want to see at 70% to 80% of children with difficulties identified by poor performance on a screen (actually we'd like to see 100% BUT given the inconsistency of children's performance, the fact that developmental problems are developing, that deficits are subtle especially in younger children, 70% - 80% is acceptable IF we know we will rescreen in the near future (and thus catch problems that might have been missed--hence why repeated screening is needed). In health care, when someone scores below cutoffs, this is called "a positive screen" (just to add to the confusion)!
2. Specificity is the percentage of children without problems correctly detected. We want to see closer to 80% of children without problems identified as typically developing. Because there are far more children without problems, mistakes with specificity mean unnecessary over-referrals).In health care, scoring above cutoffs is called "a negative screen."
Sensitivity and specificity sound so much alike that it is easy to confuse them. Remembering that about 8 out every 10 children are coming along OK, here is a (sort of silly but) hopefully memorable analogy. Say you go bowling. Before you start your game, all 10 pins are in place. Without much skill (meaning even if using a poor quality test), you may well knock down 8 of them. So think of that first throw as detecting the typical kids, i.e., specificity. Typically developing children will, most of the time, score above cutoffs--meaning that you are quite likely to correctly detect them--on just about any screen.
But now, you have a more difficult shot to make. These last two pins represent the challenge of sensitivity-- children with delays. Identifying children with delays is difficult because their problems are subtle, especially when they are very young. So making this shot requires skill, i.e., a well-crafted screening measure-- that is specific as well as sensitive.
But... there is a third issue with screening test results. Developmental and behavioral status are on a continuum and so a two point decision such as pass/fail, above/below cutoffs, or positive/negative scores doesn't do justice to the fact that there are shades of gray in between.
3) In the "gray zone" are children who fail screens but when referred for further testing, don't qualify--at least not yet. They are known as over-referrals (or in medical parlance, false-positives). Why does this happen? Most children who are over-referred are somewhat delayed (especially in the domains of development that be predict school success: intelligence, language, and pre-academic/academic skills). These children also tend to have lots of psychosocial risk factors. This means that over-referred children still need help to prevent their delays from burgeoning into substantive problems. They don't qualify for early intervention or public school special education, and so we need to think broadly about other ways to help, for example by referring them to Head Start, quality day care or preschool, with very careful monitoring to make sure these at-risk children don't fall further behind, and by helping their families with parent-training, developmental promotion, mental health or social services.
Some researchers argue that there should be standards for false-positive/over-referral rates, such as fewer than 30% over-referred, i.e., 70% positive predictive value (meaning that 70% of children who fail a screen will be found to qualify for special education or early intervention). I disagree with setting such standards because, it would keep us from finding the "gray zone" children and offering them other types of needed services.
So, false-positive/over-referral results provides a helpful indicator that we need a different action plan. For more information please see our research pages and particularly the peer-reviewed paper called "Are Over-Referrals On Screening Tests Really a Problem."
Note #1: For assessment-level and diagnostic measures, accuracy indicators in the form of sensitivity and specificity are rarely reported, and not actually needed (because such tools are usually administered to children who have already failed screens and for whom determination of eligibility and progress monitoring are the more central task at hand). Nevertheless, it is important to know differences in results between assessment-level and diagnostic tools (e.g., Are the age-equivalent scores or quotients deflated or inflated? How well do they match with other tests?). So, occasionally and also optimally, assessment-level tools tie results to screening-type cutoff scores to illustrate when a developmental age equivalent is sufficiently behind chronological age to warrant a referral. This helps professionals make wise decisions about families' needs.
Note # 2: Each (US) State has unique criteria for early intervention and special education eligibility. Age-equivalent scores (used to render a percentage of delay) and quotients are a typical part of that definition. Examiners who help determine eligibility (e.g., early intervention intake, public school psychologists) will need to refer to State standards to guide decisions about referrals.