Introduction

Absolute pitch (AP) is the uncommon ability to identify and label isolated musical pitches in the absence of a reference tone. It contrasts to relative pitch, the ability to use relationships between pitches in a musical context. While relative pitch is a necessary skill for musicians and can be developed through practice (Miyazaki et al., 2018), AP is thought to be present in only a small percentage of musicians, although estimates vary widely from < 1% to 65% across studies; Deutsch et al., 2006; Leite et al., 2016; Miyazaki et al., 2012; Miyazaki et al., 2018) and in most studies cannot be reliably trained (Bittrich et al., 2015; Brady, 1970; Cuddy, 1968, 1970; Gregersen et al., 1999; Leite et al., 2016; Profita & Bidder, 1988; Sakakibara, 2014; although see Van Hedger et al., 2019 for evidence of the skill acquisition theory of AP). Although not necessary for musicianship, individuals with AP describe it as integral to their perception of the auditory world, with one AP musician musing that “people who did not have absolute pitch must be tone deaf to a certain extent” (Boggs, 1907, p. 204). AP can be beneficial (e.g., singing in tune unaccompanied) but can also be a hindrance (e.g., difficulty listening to music using non-standard tuning; West Marvin et al., 2020).

AP is of interest due to its rarity, its discreteness as a behavioural trait, and the mechanisms by which such an unusual ability is acquired and maintained. A substantial body of literature has explored AP regarding environmental and heritable predisposing factors (Baharloo et al., 1998; Brown et al., 2002; Deutsch et al., 2009; Levitin & Zatorre, 2003; Miyazaki et al., 2012; Vanzella & Schellenberg, 2010; Vitouch, 2003; Wilson et al., 2012), its relationship to other musical skills (Dohn et al., 2014; Dooley & Deutsch, 2010, 2011; Jiang et al., 2010; Miyazaki, 2004b; West Marvin et al., 2020; Ziv & Radin, 2014), cognitive correlates (Benassi-Werke et al., 2012; Brancucci, di Nuzzo, et al., 2009a; Burnham et al., 2015; Deutsch & Dooley, 2013; Greber & Jäncke, 2020; Hou et al., 2021; Hou et al., 2014; Hutka & Alain, 2015; Wenhart & Altenmüller, 2019), and neuroanatomical markers (Bermudez et al., 2009; Brauchli et al., 2019; Burkhard et al., 2019, 2020; Dohn et al., 2015; Elmer et al., 2015; Greber et al., 2018; Jäncke et al., 2012; Kim & Knösche, 2016; Leipold et al., 2019a, 2019b, 2019c; Maeshima et al., 2018; McKetton et al., 2019; Schulze et al., 2013; Wengenroth et al., 2014; Wilson et al., 2009).

Among the earliest AP research was the suggestion that AP is at least partly heritable due to its appearance in early childhood without deliberate practice, and its tendency to accrue in musical families (Bachem, 1940, 1948; Boggs, 1907; Seashore, 1939, 1940). Bachem (1940), for example, observed that 39% of AP possessors in a sample of 103 had relatives with AP. Subsequent findings have also supported a heritable component, including proposed models of inheritance and chromosomal loci of interest (Baharloo et al., 1998, 2000; Bairnsfather et al., 2022a, 2022b; Gregersen et al., 1999, 2013; Profita & Bidder, 1988; Theusch & Gitschier, 2011; Theusch et al., 2009). While this is certainly suggestive of genetic variants for AP, further progress in this area has been hindered by a lack of consensus regarding the AP phenotype.

AP is of particular relevance in the study of individual differences. Exploration of trait heritability has, over time, moved from classical twin modelling (Polderman et al., 2015) to genome-wide association studies (Abdellaoui & Verweij, 2021), with both approaches showing that behavioural traits are broadly heritable. AP is a useful model to explore this heritability, as it is a rather discrete behavioural trait which has been documented to run in families. As such, it is important to accurately phenotype AP, both due to the intrinsic fascination of the ability itself, and for its potential applicability to broader behavioural genetics research.

Phenotyping refers to efforts to classify observable characteristics in behavioural traits and syndromes, and is a necessary foundation on which to build an understanding of a trait’s biological mechanisms and genetic influences. The AP phenotype, conceptually described as pitch identification without a reference, is typically measured by participant performance on a behavioural task, most commonly pitch-naming (Takeuchi & Hulse, 1993). In such a task, participants are presented with a series of auditory pitches and are required to identify their musical labels (e.g., G, B flat). Individuals with AP should be able to complete this task effortlessly and with a high level of accuracy, in the absence of external aids. Those without any AP ability can only guess their responses and are therefore expected to perform around chance level (1/12 or 8.3% as there are 12 chroma, or pitch classes, in the Western musical scale). While this appears to be a clear phenotypic distinction, multiple factors make the delineation of an AP phenotype more complex.

First, some individuals are able to identify pitches above chance, but below typical AP-levels of accuracy. These individuals are variously referred to as possessing quasi-AP (QAP, (Bachem, 1937), partial AP or white-key note AP (Miyazaki, 2004a), raising the idea of multiple phenotypes. QAP possessors are thought to be able to identify some, but not all chroma, and may be able to use relative pitch strategies to identify unknown chroma from an internal reference of their preferred chroma (Bairnsfather, Osborne et al., 2022; Wilson et al., 2009). QAP has been considered in relatively few investigations of AP. Thus, it has not been established whether it can be reliably distinguished from AP using a pitch-naming accuracy threshold (as in Aruffo et al., 2014; Bairnsfather, Osborne, et al., 2022; Chavarria-Soley, 2016; Leipold, Oderbolz, et al., 2019; Wilson et al., 2012; Wilson et al., 2009), or whether AP and QAP should be considered along a pitch-naming continuum (for a recent discussion, see Van Hedger et al., 2020).

Second, even when intermediate pitch-naming performance is not explicitly included, accuracy thresholds for AP possession vary across studies (e.g., 90%, Aruffo et al., 2014; 68%, Athos et al., 2007). While AP possessors are expected to be highly accurate, the precise level of performance required has not been agreed upon. Thresholds can also be rendered less conservative by including credit for small errors. As AP possessors age, they may report a shift in the accuracy of their internal pitch templates, prompting them to make occasional pitch-naming errors (Athos et al., 2007). Some researchers choose to compensate for this by assigning partial or full credit to errors within a semitone (a distance of one chroma) of the correct response for all participants or those within specific age ranges (Athos et al., 2007).

Aside from scoring and threshold concerns, the characterisation of an AP phenotype is further hindered by the lack of a gold-standard pitch-naming task. One of the most salient task characteristics is the timbre of the presented stimuli. Although some individuals with AP can identify a predominant pitch in environmental sounds, such as a spoken voice or car engine (Heaton et al., 2008), studies generally use either pure tones (e.g., Burkhard et al., 2019) or synthesised or recorded instrumental tones (e.g., Deutsch et al., 2006). Pure tones are chosen as their lack of additional harmonics or timbral features (Baharloo et al., 1998) ensures that no additional cues are used to help participants identify their pitch. Instrumental timbres are rich in contextual detail, and are often easier to identify (Wilson et al., 2012). These are chosen for ecological validity, as they are more representative of musical sounds heard on a daily basis and thus may be better able to capture the extent of an individual’s ability. Other task characteristics related to both the stimuli and to task administration also vary and contribute to ongoing difficulties characterising the AP phenotype.

Given the high degree of heterogeneity among pitch-naming tasks and thresholds, it is unsurprising that a consensus regarding the AP phenotype has not yet been reached. To further advance AP research, particularly the search for genetic variants and biological mechanisms, a phenotype (or phenotypes) must first be clearly defined and accepted. Using a consistent definition, task parameters and thresholds across studies ensures that findings are comparable and improves replicability in the field. An important first step in this endeavour is to catalogue the current tasks used to profile AP and examine their effects on phenotype identification. In this systematic review, we therefore aimed to 1) investigate the methods and replicability of pitch-naming tasks for the assessment and classification of individuals with AP; and 2) examine the ways in which variability in methods impacts our understanding of the AP phenotype.

Methods

Eligibility criteria

We included empirical, peer-reviewed original research in which AP was a primary outcome measure, as determined by its inclusion in the title or abstract, excluding case studies and case series. We excluded theses, abstracts, and conference proceedings, and studies published in languages other than English.

Studies were restricted to those with neurotypical adult participants with normal hearing, excluding populations such as those with synaesthesia or autism spectrum disorder.

We were interested in how studies assessed and classified AP, so we focused on studies that used: a) a pitch-naming task; and/or b) self-report to determine AP. Studies using self-report alone were included to assess the extent to which our understanding of the AP phenotype is drawn from research that does not include an objective measure of pitch-naming. Studies that developed novel AP measures were included if they also screened their participants with a pitch-naming task, with only the pitch-naming task included in this review. Pitch-naming tasks were limited to those that used conventional Western tuning, excluding those that incorporated stimuli mistuned from the 12 standard chroma, or those that had fundamental frequencies removed. Studies that attempted to either teach AP to novices or to pharmacologically alter pitch perception were excluded, as were studies that screened for AP to exclude AP possessors from their sample. Studies that investigated latent/implicit pitch memory were not considered to be studies of AP as commonly defined.

Search strategy

We searched the following databases using the search terms “absolute pitch” OR “perfect pitch” on 31 October 2019, restricting results to those published since 1992 to span a 30-year period including: Scopus, PsycInfo, ProQuest Music Periodicals Database, Music Index and JStor (search performed 2 November 2019). This search was repeated on 31 January 2022 and 23 May 2024 to capture any studies published since the original search.

Data collection

Study selection

JB screened the search results for duplicates and removed irrelevant papers based on title. This author then screened by title and abstract to determine articles to be retrieved for full-text search and evaluated these full-text results based on the inclusion criteria. The determination of whether AP was a primary outcome measure was agreed upon by discussion with all authors.

Data extraction

JB extracted data from the selected studies using a template agreed upon by all authors, as shown in.

Table 1 Data were extracted from information available in the published paper and were augmented by raw data or supplementary materials where these were available on the relevant journal website. Where studies reported that their methods were available in a previously published paper, these details were extracted and included in the data for the citing study.

Table 1 Details extracted from selected studies

Search results

Searches run on 31 October and 2 November 2019 yielded 3704 records, 2861 of which remained after the removal of duplicates, as shown in Fig. 1A. These records were then screened for relevance, and 266 were retained for full-text retrieval. From these, 128 were removed for not meeting the inclusion criteria. During data extraction, a further eight studies were excluded for the same reason (see Table 2 for details). Additional searches were performed on 31 January 2022 and 23 May 2024 to capture records published since the initial search (see Supplementary Fig. 1). A further 27 articles were identified, resulting in a final total of 157 articles included in the present review. From these, 160 unique studies were identified as three papers included multiple studies with different participants. All included studies are highlighted in the References section, and raw data is included as a supplementary file.

Fig. 1
figure 1

Search strategy for 2019 search. Note. Further details for removal of studies not meeting inclusion criteria can be found in Table 2

Table 2 Reasons for study exclusion

Data analysis

Data were presented graphically to show the variety in approaches to pitch-naming tasks across the literature. Where necessary, means and 95% confidence intervals were calculated from raw data or other reported summary statistics.

To map the relationships between tasks used by different research groups, we developed pitch-naming publication ‘trees’ that identify ‘source tasks’ used in early studies and show the flow of publications that stem from each source task as cited in the methods of each study.

To investigate how differences in task parameters impacted the AP phenotype, the performance means from the studies’ AP groups were compared across each parameter, using correlations or two-tailed t tests as appropriate. Only those studies using the same scoring practices were included in these tests to ensure consistency of comparisons.

All analyses were performed in RStudio (version 2023.06.2 + 561), using packages tidyverse, version 2.0.0 (Wickham et al., 2019), psych, version 2.3.6 (Revelle, 2024), scales, version 1.2.1, (Wickham et al., 2022), forestplot, version 3.1.3, (Gordon & Lumley, 2024), and lattice, version 0.21–8 (Sarkar, 2008).

Results

Definition of AP

The definition of AP was extracted from each study as a direct quote (see Supplementary Table 1). Most studies agreed that AP refers to the ability to identify notes without a reference tone, with some also including the ability to produce notes without reference. In total, 150 of the 151 studies (99%) specifying a definition agreed on this, with a single study providing a definition referencing neither identification nor production, but instead highlighting long-term pitch memory (Wayman et al., 1992). The remaining six studies did not provide a definition for AP.

Participants

Details of study participants are shown in Table 3. Across 160 studies, there was a total of 23,221 participants. This figure, however, does not account for potential participant overlap among studies, so the number of unique participants is likely to be somewhat smaller but difficult to estimate as not all studies reported this. Of the total participants, 6493 were classified as having AP. This figure is an estimate, as some studies reported the percentage of the total sample to have AP rather than raw N. Participants classed as neither AP nor non-AP were classified as intermediate pitch-namers (n = 1133), non-musicians (n = 964), or were in studies that did not group participants into AP categories (n = 3545). These do not sum to the total of 23,221 as some participants were initially assessed but not included in a study’s final group classification. Both AP and non-AP groups were generally quite small in individual studies (median AP group n = 16). Most studies including an AP group used participants with musical experience as a comparison group (137/145 studies, 94%), with 21% (31/145 studies) also including a group without musical experience.

Table 3 Studies and participants

Pitch-naming tasks

Most studies (150/160, 94%) used a pitch-naming task to classify participants as belonging to an AP or non-AP group based on their performance. It should be noted that tasks did not provide feedback to participants throughout pitch-naming procedures. Five studies (5/160, 3%) used a pitch-naming task but considered AP to be a continuous ability, so participants were not divided into AP/non-AP groups. Only three studies (3/160, 2%) relied on self-report alone for AP group assignment, and two studies (2/160, 1%) did not describe how they determined group membership.

Three studies (Keenan et al., 2001; Ngan et al., 2023; Schulze et al., 2013) used two separate pitch-naming tasks to assign AP group membership. As these were part of the group determination, rather than novel tasks designed to follow initial AP classification, both tasks are considered here. Both tasks used sine tones of equal duration, but differed in the pitch range of stimuli and number of trials.

Pitch-naming publication trees

In total, 157 tasks were used to measure pitch-naming performance. Of these, 95 (61%) were either direct replications or adaptations of previously published tasks. Over a third of the pitch-naming tasks described in the literature were therefore either novel or did not explicitly cite a basis for their pitch-naming methods. Pitch-naming publication trees showing the relationships between tasks are shown in Fig. 2 and Supplementary Fig. 2.

Fig. 2 
figure 2figure 2

Publication trees showing relationships among tasks and their replication. Note. Studies are linked by arrows, with the arrowhead pointing towards the study that cites the previous study’s task. Dotted lines are for readability and are used the same way as solid lines. (A) Tasks used in multiple subsequent studies. Tasks that are direct replications of their parent task are in plain text, while those that are adaptations are in grey. (B) Tasks derived from reviews. No replication/adaptation distinction is made here as the source papers do not include specific tasks

The first of these set of publication trees (Fig. 2A) indicates that while six influential papers have formed the basis of many pitch-naming tasks, there is limited replication across research groups. However, these tasks are used in subsequent publications by the same research groups. In particular, Fig. 2(A) shows that there are no links between the six publication trees, rather only links within each tree. A further issue is the degree of modification to the source task in subsequent studies, which limits the extent to which a source task can be said to be ‘replicated’. Modifications may be minor, such as Athos et al. (2007) shifting the Baharloo et al. (1998) paradigm to be delivered online or adjustments that change the number and range of trials as well as stimulus timbre (e.g., Weisman et al., 2012). Some modifications are subsequently employed across multiple papers (e.g., a single adaptation is used across all adapted tasks derived from the shared Oechslin, Imfeld et al., 2010/Oechslin, Meyer et al., 2010a, 2010b paradigm). Figure 2(B) shows tasks derived from reviews rather than individual studies. These reviews synthesise an understanding of methodological choices and inform how subsequent researchers choose to construct their own distinct tasks. Further tasks are included in Supplementary Fig. 1, each of which has been used or modified in a limited number of subsequent studies.

Characterising the AP phenotype

Scoring

The most common method of scoring pitch-naming tasks was to count the number of correct responses. Other scoring methods, usually used in conjunction with the total accuracy score, included mean absolute deviation from the target tone (e.g., Bermudez & Zatorre, 2009; Dohn et al., 2014), internal consistency of responses (rather than objective correctness, used in Burns & Campbell, 1994), and tallying octave errors (e.g., Bahr et al., 2005).

In considering total accuracy scores, some were raw scores consisting of the sum of correct responses, while others also assigned partial or full credit for semitone errors. Out of 151/157 tasks reporting total scores, 41/151 (27%) assigned semitone credit when determining AP group membership, with the remaining 110/151 (73%) using raw scores. Some studies used raw and semitone-credit scoring systems (e.g., Li, 2021) but used raw scores alone to classify AP. Credit for semitone errors included 0.25 points (n = 1), 0.5 points (n = 9), 0.75 points (n = 9), one full point (n = 13), or varying credit depending on participant age (n = 5).

Thirty-two tasks (32/151, 21%) required participants to identify the octave alongside the chroma label, but octave errors were usually considered a separate metric rather than contributing to the raw accuracy score.

Thresholds

Studies that specified accuracy thresholds to determine AP group membership are shown in Fig. 3. The strictest threshold for classifying AP was 100% pitch-naming accuracy (Matsuda et al., 2013), while the least stringent raw score threshold was 20% (Maeshima et al., 2018). Where semitone error credit was applied, thresholds were less conservative than those that only considered raw scores. For raw scores, the mean AP threshold was 77% (SD = 20, median = 85%), while it was 71% (SD = 16, median 68%) for studies including semitone error credit.

Fig. 3 
figure 3

Accuracy thresholds used across studies. Note. Scores classified into AP, non-AP, and intermediate groups are shown by shaded bars. Green bars refer to AP performance, red bars to non-AP performance, and teal/blue/orange refer to intermediate groups. Lighter versions of the colours (e.g., Gruhn et al., 2018) indicate studies for which semitone error credit was applied. Hou et al., (2014, 2021, 2023) include a cross-hatching over the AP group to denote that only white-key notes were used in this task. Diagonal fill indicates that no non-AP groups completed the pitch-naming task in these studies. Asterisks next to study names indicate that additional metrics were used beyond these thresholds to determine group membership. aThis paper is represented twice as it contains two separate studies. bN includes a non-musician group that did not complete the pitch-naming task

While most studies included both AP and non-AP performance groups, only 32/95 (34%) that included an AP classification threshold also specified a threshold for non-AP performance. Eighteen studies (19%) of the 95 reporting AP thresholds also considered intermediate performance levels (e.g., QAP). This was usually classed as a single intermediate group, although was sometimes further broken down into 10% performance bands (Miyazaki et al., 2012, 2018).

Mean performance

While thresholds represent the potential limits of classification of performance on a pitch-naming task, some studies also reported actual participant performance. Figures 4, 5, 6 and 7 contain forest plots of mean AP and non-AP participant performance with 95% confidence intervals (CI). Separate plots are shown for studies that used raw accuracy scores (Figs. 4, 6) and those that included semitone error credit (Figs. 5, 7) given these metrics are not directly comparable.

Fig. 4
figure 4

Mean performance of AP participants in studies using raw accuracy scores. Note. Error bars are 95% confidence intervals around the mean (omitted when relevant data were unavailable). The mean is shown by the grey vertical line

Fig. 5 
figure 5

Mean performance of AP participants in studies assigning credit to semitone errors. Note. Error bars are 95% confidence intervals around the mean (omitted when relevant data were unavailable). The mean is shown by the grey vertical line

Fig. 6 
figure 6

Mean performance of non-AP participants in studies using raw accuracy scores. Note. Error bars are 95% confidence intervals around the mean (omitted when relevant data were unavailable). The blue vertical line indicates chance performance (8.3%), and the mean is shown by the grey vertical line

Fig. 7 
figure 7

Mean performance of non-AP participants in studies assigning credit to semitone errors. Note. Error bars are 95% confidence intervals around the mean (omitted when relevant data were unavailable). As chance performance varies according to the amount of credit assigned to semitone errors, a chance line is not included. The mean is shown by the grey vertical line

The mean AP performance across 58 studies was 85.9% (95% CI 83.1–88.8%), while performance in the 25 studies assigning semitone error credit was 89.1% (95% CI 86.3–91.8%). The mean non-AP performance across 50 studies was 17.0% (95% CI 14.0–20.0%) based on raw accuracy scores (where chance performance is 8.3%) and 24.5% (95% CI 19.7–29.3%) for the 19 studies assigning semitone error credit.

The influence of task parameters on the expression of the AP phenotype

Pitch range

The majority of tasks specified the stimulus pitch range (139/157 tasks, 89%). As shown in Fig. 8(A), almost all tasks reporting a range included the central octave (C4–B4), with the exception of six studies: one that used just one trial in its pitch-naming task (Van Hedger et al., 2016), another that used ten specific chroma between G#1 and G6 (Di Giuseppe Germano et al., 2021), one which used two tasks – the first of which used the range C5 – B5 (Ngan et al., 2023), and three that used only the white notes from the central octave (Hou et al., 2014, 2021, 2023). The range varied from a single octave to over eight octaves, exceeding the range of a piano, with most studies using a range that spanned three octaves (see Fig. 8A). A Pearson correlation analysis of the pitch range and mean performance of the AP group in each study using raw accuracy scores (n = 50) showed that performance did not differ according to the range used (r(48) = – 0.14, p = 0.320, 95% CI [– 0.41, 0.14], see Fig. 8(B) for all studies regardless of scoring method).

Fig. 8
figure 8

Pitch range of pitch-naming task stimuli. Note. (A) The pitch range as reported for 139/157 tasks. Each blue line represents a single task, with endpoints representing the upper and lower limits of each task’s specified range. Middle C (C4) is indicated with a red vertical line, while the range of a piano is shown by green vertical lines. (B) The correlation between mean pitch-naming performance and task stimulus range for all tasks regardless of scoring method, n = 77

Timbre

The timbre used for pitch-naming stimuli was a highly reported task parameter, with 151/157 tasks (96%) reporting this information. As shown in Fig. 9(A) the two most common timbres were sine and piano tones, used as stimuli in 125/151 (83%) tasks. Other timbres used were synthesized complex tones and voice. The remaining 19 studies (13%) used multiple timbres within their tasks. Of these, 16 included piano tones, and 11 included sine tones. Other timbres included: triangle tones (3) harpsichord (1), guitar (2), violin (6), organ (2), unspecified woodwind (2), unspecified brass (2), voice (5), unspecified string (1), cello (1), flute (2), clarinet (1), bassoon (1), trumpet (1), trombone (1), French horn (1), tuba (1), “random” (1), synthesised complex (1), viola (3), synthesised voice (2), smooth tones (1), and participant’s own instrument (1). Among studies reporting mean performance scores for raw accuracy, those using piano tones only (n = 43) reported higher scores (M = 93.0, SD = 6.1) than those using sine tones only (n = 39, M = 81.4, SD = 9.4; t(43.97) = 5.06, p < 0.001, 95% CI [6.97, 16.21]). Figure 9(B) shows the mean performance of AP groups across all timbres, without accounting for different scoring methods.

Fig. 9 
figure 9

Timbres used in the pitch-naming tasks. Note. (A) Proportion of tasks using different timbres. (B) Mean pitch-naming performance of AP groups for these different timbres. Error bars are 95% confidence intervals around the mean

Number of trials

Stimulus presentation and response characteristics of the pitch-naming tasks are shown in Table 4, including details of the number of task trials. Across tasks for which the number of trials was reported (152/157, 96.8%), the most common number was 108 trials, as a result of multiple published studies using the 108-trial paradigm developed by Oechslin, Imfeld et al. (2010; Oechslin, Meyer et al., 2010a, 2010b; see Fig. 2(A)). This paradigm presents each chroma nine times. Importantly, the majority of tasks included sufficient trials for each chroma to be presented more than once (≥ 24 trials), with a median of 60 trials.

Table 4 Stimulus presentation and response characteristics of the pitch-naming tasks

The relationship between the number of trials and pitch-naming performance is shown in Fig. 10, incorporating all studies reporting the mean for their AP group regardless of scoring method. The figure shows a strong negative correlation with performance accuracy sharply decreasing with an increasing number of trials (r(82) = – 0.47, p < 0.001, 95% CI [– 0.62, – 0.28]). This remained significant even when excluding the outlier in the bottom right of the figure (r(81) = – 0.26, p = 0.02, 95% CI [– 0.45, – 0.05]; Bahr et al., 2005). When considering those studies reporting mean raw accuracy scores alongside the number of trials (n = 57), the negative correlation between the number of trials and task performance remained large (r(55) = – 0.64, p < 0.001, 95% CI [– 0.77, – 0.46]).

Fig. 10
figure 10

Mean pitch-naming performance of AP groups for tasks with varying numbers of trials

Stimulus duration

Summary statistics for stimulus duration indicate that most studies used either 500- or 1000-ms tones (see Table 4), although this characteristic was less consistently reported than other parameters (119/157 tasks, 75.8%). A Pearson correlation analysis indicated that pitch-naming task accuracy for studies using raw scores (n = 46) was not associated with stimulus duration (r(44) = 0.00, p = 0.974, 95% CI [−0.29, 0.29]; see Fig. 11 for all tasks regardless of scoring method (n = 61).

Fig. 11
figure 11

Mean pitch-naming performance of AP groups for tasks with varying stimulus duration

Response window

While the majority of tasks (100/157, 63.7%) included information on response windows (see Table 4), it was somewhat difficult to quantify the typical period allowed for participant responses. This was because it was often unclear whether the response window was inclusive of the duration of the presented stimulus, with inconsistent reporting between studies purportedly using the same task. Based on studies using raw scores in which this information was clear (n = 40), a Pearson correlation analysis showed that pitch-naming accuracy did not differ according to the length of the permitted response window (r(38) = – 0.26, p = 0.100, 95% CI [– 0.53, 0.05]). Figure 12 shows all tasks regardless of scoring method (n = 59), excluding the three tasks that reported responses as self-paced.

Fig. 12
figure 12

Mean pitch-naming performance of AP groups for tasks with varying response window

Response method

Details of how participants were asked to name pitch stimuli were provided for 119/157 (76%) tasks. The most common method was for the participant to write the chroma name down (52 studies). Other methods included: (i) indicating the correct chroma on a piano keyboard (either a physical [muted] or visual representation; n = 19); (ii) selecting an onscreen chroma label (n = 23); (iii) pressing a labelled computer key or response button (n = 13); (iv) responding verbally (n = 9); or (v) writing the correct note on a musical staff (n = 3). One study presented its pitch-naming task twice to participants, one using an onscreen label to record responses and the other using an onscreen piano keyboard, with no performance difference found between these response methods (Brancucci et al., 2009a, 2009b). Pitch-naming accuracy performance for the various response methods is shown in Fig. 13. It indicates that variability is lowest when the response is a labelled button, although this is somewhat misleading as two of the nine tasks using this method and reporting mean AP group performance used the same sample (Hsieh & Saberi, 2008a, 2009), while another two were separate tasks completed by the same participants in the same study (Ngan et al., 2023). Use of a piano key response produced the least accurate and most variable responding, though these studies also used low thresholds for AP group membership (40–79%), which may explain the relatively low performance here. Overall, AP group performance was variable across all response methods, and as the number of tasks per response method is limited (from n = 2 to n = 26 per method that report mean AP group performance), it is difficult to draw firm conclusions regarding the relationship between response type and pitch-naming accuracy.

Fig. 13 
figure 13

Mean pitch-naming performance of AP groups across response methods. Note. All tasks reporting the mean for their AP group are included in this figure regardless of scoring method (n = 66). Error bars are 95% confidence intervals around the mean

Inter-trial distracter stimuli

Most studies did not include additional auditory stimuli between response trials (see Table 4). Of the 32/157 (20%) that did, 18 used brown noise, seven used white noise, three used distorted tones, and four used a rapid sequence of notes or glissando. As shown in Fig. 14, pitch-naming accuracy was lower in studies using distracter stimuli, with a comparison between studies reporting raw accuracy scores (n = 59) indicating a significant difference (Mnosound = 88.79, SD = 9.88, Msound = 82.06, SD = 11.00; t(45.92) = 2.40, p = 0.021, 95% CI [1.08, 12.37]).

Fig. 14
figure 14

Mean pitch-naming performance of AP groups according to the presence of a distracter sound. Note. All studies reporting the mean for their AP group are included in this figure regardless of scoring method. Error bars are 95% confidence intervals around the mean

Discussion

In investigating the methods of pitch-naming tasks, we found that researchers near-universally agree on the conceptual definition of AP. The ability of AP possessors to identify and label isolated musical tones is the foundation of our understanding of this phenotype. The broad definitional uniformity across studies of AP is, in a sense, a validity check of the selection criteria for this review. One would expect a degree of consensus across studies for which AP is a primary outcome measure, and which largely rely on pitch-naming tasks as the method of assessment. There is a lack of coherence, however, in how this core feature is best measured by pitch-naming tasks. With close to 40% of studies using unique pitch-naming paradigms, there is a sense that AP studies are constantly ‘reinventing the wheel’.

The high degree of heterogeneity in both pitch-naming methods and the accuracy of AP group performance reflects a relative lack of maturity in the field of AP research. Linden and Hönekopp (2021) argue that high heterogeneity indicates a mismatch between data and concept, and that reduction of this disparity is necessary for fields of research to progress. Although we did not employ formal heterogeneity measures for effect size (e.g., I2, Borenstein et al., 2021), the data in this review nevertheless point to a heterogeneous understanding of AP in how we translate a broad conceptual understanding to a specific, measurable phenotypic index. To move AP research to a more mature field of study, we must explore the sources of this heterogeneity and address them from both a methodological and theoretical perspective. Our review aims to primarily target the methodological aspect of this challenge, though our recommendations below are theoretically informed.

A major contribution of this review is to demonstrate how variability in methodological choices for specific task parameters impacts the expression of the AP phenotype. Perhaps the most striking example of this is the choice of accuracy thresholds for AP classification. Figure 4 presents a clear picture of the heterogeneity in the levels of pitch-naming performance considered to characterise AP. We acknowledge that this is a somewhat simplified view, as some studies use multiple metrics to classify AP rather than thresholds alone (e.g., response time in Van Hedger et al., 2018; mean absolute deviation alongside accuracy scores in Chavarria-Soley, 2016). Even taking this into account, however, it is notable that the AP phenotype is often characterised by pitch-naming performance that overlaps with other partial phenotypes or even non-AP performance, especially when scoring differences such as semitone errors are considered. These scoring differences make it difficult to directly compare studies across various scoring metrics. If researchers choose to assign credit to semitone errors, reporting would be improved by including both raw and error-corrected scores. This would ensure that studies can be more easily compared, rather than dividing them into raw and error-corrected categories, as we have needed to do here.

Exploring participant performance yields information beyond the potential limits defined by thresholds. Analysis of mean performance shows that pitch-naming ability is a dimensional trait, with scores lying along a spectrum from chance to ceiling, regardless of participant classification into AP and non-AP groups. To score above chance on a pitch-naming task, participants must have some degree of pitch-naming ability. However, the forest plots and threshold plot show that above-chance participants are sometimes included in the ‘non-AP group’, which reduces the discriminatory power of studies to find differences between AP and non-AP participants, and thus accurately characterise the AP phenotype. Furthermore, the use of thresholds (particularly a priori thresholds) assumes that AP can meaningfully be divided into discrete categories, perpetuating dichotomous AP/non-AP classification in a somewhat circular manner between measurement and conceptualisation and reinforcing existing definitions of AP phenotypes. This review acts as an essential step in the development of a taxonomy of AP phenotypes, stepping away from the problem of circularity.

A further complication is that non-AP scores have not been consistently reported, so the performance of the comparison group cannot always be gauged from the published work. Ideally, thresholds should be set at or around chance performance (8.3%) to ensure that all degrees of pitch-naming ability are being captured, including intermediate forms, such as QAP. Consideration of participants across the full range of ability would then allow characterisation of different pitch-naming phenotypes that differ in their pitch-naming accuracy, as well as the cognitive strategies used, and the extent of specificity to contextual cues such as timbre. Data-driven techniques such as taxometric analysis (Ruscio et al., 2006) should be used to assess the extent to which these phenotypes are discrete by employing multiple AP tasks, rather than relying on a threshold from a single pitch naming task that is necessarily arbitrary. This will move the field towards a more robust method of phenotyping AP.

Each task parameter investigated in this review varied considerably across studies, including in how consistently it was reported. No single parameter was described across all 157 tasks. The number of trials was reported most reliably, followed by stimulus timbre, pitch range, participant response method, stimulus duration, and response window. Omitting key details from published methods reduces our ability to assess the replicability of findings and contributes to the continued development of novel pitch-naming tasks, as evident from the pitch-naming publication trees.

Due to the significant variability among pitch-naming task methods, it is difficult to assess the effects of specific task parameters on expression of the AP phenotype. However, where possible, we examined whether systematic variation in a specific task parameter was associated with varying expression of the AP phenotype. From this, we have derived some initial recommendations for future studies to promote greater homogeneity in measuring the AP phenotype by endorsing key characteristics that should be captured and reported by a gold-standard task.

Recommendations for pitch range and task trials

The pitch range of the stimuli and the number of trials over which they are presented are a matter of content validity – that is, whether pitch-naming tasks adequately canvas the range of behaviours AP possessors are expected to exhibit. The conceptual definition of AP does not place an upper or lower limit on the number or pitch range of chroma that AP possessors are expected to be able to identify. It is assumed that AP possessors can effortlessly identify all 12 chroma (although see Miyazaki, 1988 for a discussion regarding preferred chroma even among highly accurate AP participants). As such, a bare minimum of 12 trials, each representing a different chroma, would be a basic starting point. However, a single trial per chroma is unlikely to be sufficient to fully capture participant ability. In particular, a limited number of trials may mask the variability of intermediate-level performance, and thus we caution against relying on too few trials per participant. The reduction in pitch-naming performance as trial numbers increase is somewhat more challenging to interpret. This effect is largely driven by the high degree of variability across studies using 108 trials. This number of trials is shared across multiple paradigms, including those derived from Bermudez and Zatorre (2009) and Oechslin, Imfeld et al. (2010)/Oechslin and Meyer et al., (2010a, 2010b) as shown in Fig. 2(A). This effect therefore is likely to reflect the popularity of tasks using this number of trials (allowing each chroma to be presented nine times) across different accuracy thresholds (from 40%, Kamiyama et al., 2010; to 90%, Coll et al., 2019), rather than an implication that actual participant performance decreases as trial numbers increase. This could be further investigated by checking performance in earlier versus later trials in lengthier paradigms.

The pitch range that trials should cover is similarly unclear, with no significant impact of pitch range on mean task performance shown across studies. Most studies reporting range included, at minimum, the central octave on the piano (C4–B4). Previous research has indicated that pitch-naming accuracy tends to decline at the extremes of the pitch range (Miyazaki, 1989; Takeuchi & Hulse, 1993; West Marvin et al., 2020), but there is no clear expectation of the range in which good performance should occur in AP possessors. Rakowski and Rogowski (2011) suggest a five-octave range based on an investigation using sine tones, in which participant accuracy declined beyond this range. They did, however, note heterogeneity in performance among their small sample, so this is not a universal feature of AP possessors. It is therefore yet to be established whether a single-octave range is too narrow to appropriately gauge AP, or if an eight-octave range unnecessarily contributes to an excessive number of trials. Stimulus range may be pertinent to distinguishing between phenotypes, as per the suggestion of ‘universal’ versus ‘limited’ (range) AP (Bachem, 1937). As such, tasks aiming to make this distinction should include stimuli across a wide pitch range, and include range-related accuracy analysis rather than just raw task-wide performance. At this point, however, the contributions of contextual factors such as range to AP phenotypes need to be further elucidated, so a separate pitch-naming task that measures the limits of range may be appropriate alongside a gold-standard pitch-naming task that is comparable across studies. Based on the most common range and trial numbers among studies in this review, we recommend a pitch range that captures three octaves and uses at least five trials per chroma. This balances: i) the need to canvas a sufficient range that most AP possessors can be expected to identify; ii) multiple trials per chroma; and iii) a sufficiently short task duration to enable additional tasks to be administered as needed to distinguish specific phenotypes.

Recommendations for stimulus timbre

Stimulus timbre, as shown in this review, is largely divided between piano tones and sine tones, with scores on tasks using a piano timbre exceeding those using sine tones. This highlights a divide in the understanding of how AP is conceptualised – prioritising either stimulus ‘purity’ or ecological validity. The ecological validity argument emphasises the importance of context for the AP phenotype, not only in terms of timbral cues but also the context in which the long-term pitch memory was originally encoded. There is strong evidence supporting a critical or sensitive period for AP acquisition, including early practice on the piano (Bairnsfather et al., 2022a, 2022b; Deutsch et al., 2006; Levitin & Zatorre, 2003; Russo et al., 2003; Vanzella & Schellenberg, 2010; Wilson et al., 2012). This indicates that early environmental factors are influential in shaping the expression of the AP phenotype, suggesting AP is a contextually learned behavioural skill rather than a purely psychophysical phenomenon. Neural encoding of the pitch template may be contextually specific, as is seen in increased cortical representations among musicians for piano, but not sine tones (Pantev et al., 1998). The idea of ‘universal’ and ‘limited’ AP phenotypes is again relevant here (Bachem, 1937). Some AP phenotypes may be more contextually bound than others, with individuals able to identify chroma across a limited or broad range of timbres. Tests of AP should therefore aim to capture this variability, with piano tones or other personally tailored timbres more able to do this than context-devoid sine tones. Supporting this, studies including both piano and sine tones generally show a drop in sine tone performance accuracy (Athos et al., 2007; Hsieh & Saberi, 2008b; Lee et al., 2011; Miyazaki, 1989), reflecting that use of sine tones alone risks failing to fully capture the AP phenotype. We recommend that a gold-standard AP task should use the piano timbre as a contextually relevant, ‘neutral’ stimulus. Additional timbres, particularly sine tones, can be utilised in subsequent, specific tasks, depending on the phenotype targeted in individual studies. This would allow the potential limits of AP phenotypes to be tested.

Recommendations for stimulus duration, participant response methods, and distracters

Other task characteristics, such as stimulus duration and response window, also vary between studies, and are less frequently reported than timbre, range, and the number of trials. In this review, these parameters of pitch-naming tasks do not significantly contribute to differences in phenotypic expression, although noting the response window was not always reported clearly. Most commonly, it is not specified whether the permitted response window is inclusive of the time to deliver the stimulus (e.g., Dohn et al., 2012 vs. Dohn et al., 2015). Inclusion of a schematic clearly showing task presentation and timing, as is common in cognitive psychology paradigms, would reduce this ambiguity. As duration and response window do not appear to greatly influence the AP phenotype, we recommend using the most commonly reported methods to maximise comparability among studies – 1000-ms stimulus length, with 4000-ms response window excluding the stimulus duration.

The review also shows that response methods vary widely among tasks, with each associated with different levels of pitch-naming accuracy. The reason for these discrepancies is likely multifactorial and associated with other task parameters alongside the response method used, such as the allowed response window (e.g., writing the response on a musical staff requires i) knowledge of music notation, and ii) more time than pressing a response key). Our recommendation is to avoid response methods that disadvantage some participants, such as piano keys that may be less familiar to non-pianists, or staff notation that requires participants to be able to read music. Response methods such as key/button press or clicking an onscreen button may be particularly useful, as they facilitate the precise capture of response time.

As distracter stimuli between trials are associated with lower participant accuracy, this suggests that they are fulfilling their purpose of preventing relative pitch strategies being used across trials. It would be appropriate, therefore, to recommend their use in a gold-standard pitch-naming task.

Limitations of this review

While this review aimed to canvas a large part of the AP literature, it is by no means exhaustive. Further heterogeneity is apparent in studies beyond the scope of the current review, such as those in which AP was not the primary focus of investigation (e.g., Acevedo et al., 2014; Matsunaga & Abe, 2005; Pfordresher & Kobrina, 2017). Such studies are more likely to rely on self-report of AP possession rather than objectively measuring pitch-naming performance. The validity of self-report as a measure of AP ability is a useful question for further research, though first requires consensus regarding the phenotype that self-reported AP possessors claim to have. Attempts have also been made to measure AP beyond pitch-naming tasks, such as pitch production (Heald et al., 2014), a go/no-go discrimination task (Weisman et al., 2012), Stroop-like tasks (Leipold et al., 2019a, 2019b, 2019c; Schulze et al., 2013), and pitch-naming tasks that test the limits of AP by omitting frequencies or mistuning stimuli (Gruhn et al., 2018; Hsieh & Saberi, 2009; Rogowski & Rakowski, 2010). These tasks may be particularly useful in validating preliminary phenotypes characterised using data-driven analysis of pitch-naming performance, and potentially expanding the number of recognised phenotypes or the features of a given phenotype. Such tasks can also be used to explore AP predisposition among individuals without musical training.

Conclusion

As research into the genetic underpinnings of behavioural traits increases, the necessity for well-described phenotypes is of renewed interest. Indeed, among the aims of the recently founded Musicality Genomics Consortium (https://www.mcg.uva.nl/musicgens/) is the development of “scalable and robust phenotypes” and the harmonisation of “existing measures of musicality phenotypes” (https://www.mcg.uva.nl/musicgens/mission.html). This review is therefore timely and shows how far we still have to go in developing phenotypes for AP.

Overall, this review has shown that while there is strong consensus regarding the conceptual definition of AP in terms of its core features, this does not extend to the methods used to measure pitch-naming ability. The concept is extremely broad and captures many aspects of behaviour, lending itself to varied interpretations when attempting to define AP phenotypes and thus, design tasks to capture them. This lack of precision has led researchers to develop disparate metrics and adopt arbitrary thresholds for AP possession, and there remains no gold-standard pitch-naming task with clearly defined parameters and scoring methods. This has resulted in a highly variable body of literature, with a multitude of pitch-naming tasks differing across all parameters. This, combined with differences in scoring and thresholds to qualify a participant as possessing AP, has resulted in substantial heterogeneity in what is considered to be the AP phenotype. Without a well-described and accepted phenotype, behavioural findings may not be comparable or replicable.

The recommendations we have provided are an important initial step in addressing this. In place of a single task that can capture every phenotypic difference, we advocate for a task that is used across the literature and facilitates replication across studies. Specific phenotypic distinctions can be teased out with subsequent tasks that explore facets such as timbral and range differences. A gold-standard AP task should include multiple (we suggest at least five) trials per chroma to appropriately capture performance variability, spanning three octaves to maximise comparability with existing measures. Stimuli should be piano tones, again to maximise replicability, and to ensure that the timbre is contextually relevant across participants. Additional timbres can be considered in further tasks depending on the phenotypes relevant to the research question. Stimulus length should be 1000 ms, with a 4000-ms response window excluding the stimulus duration. While a variety of response methods is likely to be appropriate depending on the research setting (e.g., lab-based versus online task delivery), eliminating the need for participants to be familiar with piano keyboards or music notation will allow the task to be used across a wider range of participants. We also recommend that distracter stimuli are used between trials to ensure that participant performance is not impacted by previously presented material.

Precise phenotyping is vital for genetic research to ensure that shared genetic variants can be confidently linked to AP rather than to broader or related traits. Moreover, the variability in pitch-naming performance suggests that there may be multiple phenotypes relating to the spectrum of pitch-naming ability. Given the degree of heterogeneity in the current AP literature, an important next step is to characterise intermediate pitch-naming ability. This will help to clarify its relationship to AP and establish assist in determining accuracy thresholds for AP classification. Combined with the findings of previous work exploring different types of AP, more precise phenotypes could then be characterised, forming an empirical basis on which to continue the search for genetic variants for AP.