Bias in data-driven replicability analysis of univariate brain-wide association studies

Burns, Charles D. G.; Fracasso, Alessio; Rousselet, Guillaume A.

doi:10.1038/s41598-025-89257-w

Download PDF

Article
Open access
Published: 19 February 2025

Bias in data-driven replicability analysis of univariate brain-wide association studies

Charles D. G. Burns¹,
Alessio Fracasso¹ &
Guillaume A. Rousselet¹

Scientific Reports volume 15, Article number: 6105 (2025) Cite this article

1472 Accesses
1 Altmetric
Metrics details

Subjects

Abstract

Recent studies have used big neuroimaging datasets to answer an important question: how many subjects are required for reproducible brain-wide association studies? These data-driven approaches could be considered a framework for testing the reproducibility of several neuroimaging models and measures. Here we test part of this framework, namely estimates of statistical errors of univariate brain-behaviour associations obtained from resampling large datasets with replacement. We demonstrate that reported estimates of statistical errors are largely a consequence of bias introduced by random effects when sampling with replacement close to the full sample size. We show that future meta-analyses can largely avoid these biases by only resampling up to 10% of the full sample size. We discuss implications that reproducing mass-univariate association studies requires tens-of-thousands of participants, urging researchers to adopt other methodological approaches.

Replicable brain–phenotype associations require large-scale neuroimaging data

Article 26 June 2023

Reproducible brain-wide association studies require thousands of individuals

Article Open access 16 March 2022

How measurement noise limits the accuracy of brain-behaviour predictions

Article Open access 12 December 2024

Introduction

The question of scientific reliability of brain-wide association studies (BWAS) was brought to the attention of many^1,2 by Marek, Tervo-Clemmens et al.³, reigniting discussions^4,5,6,7 about the ongoing reproducibility crisis in neuroscience and psychology^8,9,10,11,12. Reproducibility is often used as an umbrella term covering replication, namely obtaining similar results by applying the same methods to new data^13,14. Independent researchers are failing to replicate BWAS^15,16, suggesting that such findings are untrustworthy. Our trust in the scientific field therefore relies on how well we can estimate the replicability of its findings.

For any given study design, recruiting more subjects is a reliable way of increasing the likelihood of replication by reducing sampling variability and increasing statistical power⁸. For a BWAS, aimed at characterising associations between brain measures and behaviours, collecting data is expensive. So, how many subjects are required? How do we know? Thousands are required^3,17, according to data-driven approaches which quantify the issue of replicability for BWAS using large neuroimaging datasets from the Human Connectome Project¹⁸ (HCP with n = 1,200), the Adolescent Brain Cognitive Development study¹⁹ (ABCD with n = 11,874), and the UK Biobank²⁰ (UKB with n = 35,735). Among numerous analyses in their study, Marek, Tervo-Clemmens et al.³. estimated statistical errors of univariate BWAS as a function of sample size. Such mass univariate BWAS often involve tens of thousands of correlations between a brain measure and a behavioural measure, most of which fail to replicate even with thousands of participants due to small underlying effect sizes. These replication failures can be explained by statistical errors of a study design such as false positive rates¹¹ and low statistical power^{8,21,22,23,24,25}. To estimate statistical errors in univariate BWAS, Marek, Tervo-Clemmens et al.³. treated a large sample as a population and then drew replication samples by resampling with replacement (henceforth resampling) from that population (Fig. 1). This avoids the expense of collecting new data by instead resampling from a large sample, while using effect sizes from the large sample as replication targets. Statistical power, for example, was then estimated as the proportion of significant effects in the full sample which were significant again in a resample, averaged over 1000 iterations for each resample size and significance threshold.

However, this data-driven method of estimating statistical errors might not generalise to the real-world scenario of repeated sampling from a population. Importantly, traditional replication involves comparing two large samples from a population, while data-driven replication involves comparing a large sample to its own resamples (Fig. 1a). Data-driven simulations implicitly treat resampling from a large dataset as equivalent to sampling from a population, but these are not the same. As a result, we don’t know the difference between error estimates obtained from resamples and error estimates obtained from samples of a population – this difference we refer to as bias in statistical error estimates. Such bias may also be affecting other recent studies that relied on data-driven resampling methods to estimate statistical power^23,27,28 and replicability²⁹ in task-based fMRI methods, not just structural or functional associations. We therefore simulated data as ground truth to quantify bias in estimates of statistical errors.

Results

Resampling methods strongly bias statistical error estimates when there are no true effects

First, we simulated a large null sample with n = 1,000 subjects, each with 1,225 brain connectivity measures (random Pearson correlations) and a single behavioural measure (normally distributed across participants). We correlated each brain connectivity measure with the behaviour across all subjects to obtain 1,225 brain-behaviour correlations. Since brain connectivity estimates and behavioural factors were simulated independently from each other, any resulting brain-behaviour correlations were entirely random. In other words, the population effect size was null (ρ = 0). Data dimensions were chosen to be computationally feasible for reproducibility, however we invite readers to adjust these and re-run analyses using the openly available code (analyses recoded in R with supporting packages^30,31,32 for open-source accessibility https://github.com/charlesdgburns/rwr/). We then resampled our null-sample for 100 iterations across logarithmically spaced sample size bins (n = 25, to 1,000) and estimated statistical errors, following the methods described in Marek, Tervo-Clemmens et al..1. Surprisingly, we saw the same trends of statistical errors and replicability as those reported by Marek, Tervo-Clemmens et al.[3] but with random data (see Fig. 2), noticing strongly biased statistical power estimates.

Figure 2 suggests that the reported trends in estimated statistical errors do not depend on absolute sample size, but on the resample size relative to the full sample size. To establish a ground truth against which to assess these estimates, we consider what statistical error estimates we should expect if replication samples were drawn from a population (of infinite size) rather than resampled from a large sample. By repeatedly generating new null-samples, rather than resampling from a single null-sample, we verified that these statistical error estimates are indeed biased under the null as the resample size approaches the full sample size (Fig. 3). We also demonstrate that this bias stems from the act of resampling from a large sample, rather than how estimations are computed when comparing different large samples. For example, uncorrected (α = 0.05) statistical power at the full sample size (n = 1000) was estimated to be 63% when resampling (Fig. 2d), rather than the expected 5% obtained when generating new null-samples (Fig. 3d). One concern is that power is the most inflated while also being the most relevant for failed replications^8,21,22, which could potentially result in misleading meta-science.

Compounding sampling variability underlies biased statistical errors under the null

To explain why bias arises under the null, we investigated the underlying brain-behaviour correlations used in the calculation of statistical errors. Here we focused on resampling at the full sample size (n = 1,000) where these biases are most dramatic. As indicated by the false positive rate (Fig. 2f.), the null distribution of brain-behaviour correlations is not preserved after resampling at the full sample size (Fig. 4). Instead, resampling subjects and computing correlations again results in a distribution wider than expected (comparing Fig. 4a and c). This is because resampling involves two sources of sampling variability, first at the level of the large sample and again for the resampled replication sample (Fig. 1a). For instance, if a correlation in the large sample is randomly observed to be r = 0.11, then resampling participants and computing the same correlation again results in a correlation which varies around r = 0.11 (Fig. 4e).

We can formalise this mathematically as nested distributions³³, or a convolution³⁴ of two probability distributions, here approximating Pearson null distributions with normal distributions for analytical simplicity. It then follows that given a large sample X ~ N(µ, σ₁²), for each observation in our original sample, x_i ∈ X, resampling participants and recomputing correlations corresponds to sampling from several distributions X_i ~ N(x_i, σ₂²), resulting in a final set of correlations distributed according to X* ~ N(µ, σ₁² + σ₂²). Note that σ₁² depends on the size of the large sample, while σ₂² is determined by the resample size.

The influence on statistical error estimates such as statistical power is two-fold. First, random correlations at the tail in a large sample are more likely to be at the tail of correlations in a resample (Fig. 4a and e). This inflates power when estimated as the proportion of significant effects in the large sample which are significant again in the resample (1 – false negative rates). Second, increased sampling variability alone leads to a wider-than-expected distribution of correlations with more extreme tails. These more extreme tails lead to an inflation of P values close to 0 in our resample (compare Fig. 4b and d) when calculated using a standard correlation function (e.g., ‘corr’ in MATLAB). We note that simply correcting for this widened null distribution will over-correct for bias in statistical error estimates when true effects are present (see Supplementary Information).

Bias in ground truth simulations depends on statistical power of the full sample size

While we have shown clear biases when there are no true effects (ρ = 0), this does not imply that there will be biases when true effects are present (ρ ≠ 0). We therefore investigated bias in statistical error estimates when true effects are present, focusing on statistical power estimates. We note that Marek, Tervo-Clemmens et al.³ have already shown that the largest univariate effect is highly replicable even for moderate sample sizes, so there are at least some true BWAS effects in the real world. However, since the population effect size (ρ) remains unknown (Fig. 1b), we simulated a range of scenarios such that roughly 1% (an arbitrary but conservative proportion) of effect sizes were true effects. To have a clear ground truth separation between null and true effects when estimating error rate bias in each scenario, we therefore sampled 54,778 effects from a null (ρ = 0), and 500 effects from a true effect (ρ ≠ 0) for a total of 55,278 effects (333 choose 2; as in Marek, Tervo-Clemmens et al.,³) in each large sample (Fig. 5a). We chose true effect sizes in each case so that they were evenly spaced from 0.1 to 99% statistical power at the full sample size (n = 1000; Fig. 5b), as determined by an inverse power analysis (see Methods). Because the bias under the null is affected by the false rejection of null hypotheses, here we adopted a fixed significance threshold after Bonferroni correction, which controls for at least one false positive among all comparisons (family-wise error rate). We then estimated statistical power by resampling as in Marek, Tervo-Clemmens et al. (Fig. 5c) and compared these to analytical power levels (Fig. 5b) to quantify bias in statistical power estimates (Fig. 5d).

We show that the bias in statistical power estimates near the full sample size depend on the effect size of true effects, here corresponding to true statistical power of effects given a fixed significance threshold and fixed large sample size (Fig. 5). Very large samples are therefore required to accurately estimate statistical power for very small effects. Power estimates are inflated if the large sample is underpowered, but on the other hand a highly powered large sample may give conservative power estimates. This is driven by a subset of null and true effects in the large sample that are near the significance threshold; due to sampling variability, in a resample these small effects can easily cross the threshold and introduce type I and type II errors. Note that regardless of power at the full sample size, bias in statistical power is largely avoided when subsampling up to around 10% of the full sample size (see also Supplementary Information for subject-level simulations and larger effect sizes).

Discussion

Accurately estimating reproducibility of scientific methods is critical for guiding researcher’s methodological decisions. Our results demonstrate that estimating statistical errors by resampling with replacement from random data results in large biases when the resample size approaches the full sample size. This is explained by compounding sampling variability of test statistics when resampling and its knock-on effects on estimated statistical errors. We further simulate data with true effects to show that statistical power is inflated when the true power of the large sample is low and slightly deflated when true power is high. This could lead to circular reasoning in cases where we must assume we have high statistical power before we can rely on the estimation that we have high statistical power. Lastly, we show that this bias is largely avoided when subsampling only up to 10% of the full sample size after Bonferroni correction. This 10% rule of thumb is consistent with the use of resampling techniques in a recent evaluation of statistical power and false discovery rates for genome-wide association studies with hundreds-of-thousands of participants³⁵, as well as recommendations for 10-fold cross-validation to reduce upwards bias in prediction errors in machine-learning³⁶.

What are the implications for the results presented by Marek, Tervo-Clemmens et al..1? Their estimates have been optimistic by an order of magnitude, implying that replicating mass univariate BWAS requires not thousands, but tens-of-thousands of participants. Revisiting their data, we can compare estimates at the full sample size, where we expect most bias, to estimates at 10% of the full sample size, where we expect no bias. For the strictly denoised Adolescent Brain Cognitive Development (ABCD) sample (n = 3,928), they report around 68% power after Bonferroni correction when resampling at the full sample size (Marek, Tervo-Clemmens et al. Fig. 3d³). When subsampling from the UK Biobank with a full sample size of n = 32,572 Marek, Tervo-Clemmens et al.³ report around 1% power for n = 4,000 and α = 10^-7. We therefore argue that the 68% power reported for the full ABCD sample (n = 3,928, α = 10^-7) more likely reflects methodological bias, rather than increased signal after strict denoising of brain data. While the largest BWAS effects may be highly replicable with 4,000 participants, the average univariate BWAS effect is most likely not. Furthermore, our true effect simulations (Fig. 5) also indicate that the UK Biobank estimates at the full sample size (n = 32,572) could be more accurate, with an underlying power likely between 70% and 90% after Bonferroni correction. However, we note that our simulations and data-driven replication methods only account for sampling variability and do not account for measurement reliability^37,59,60. Statistical power would be lower for less reliable measures, such as 5-minute resting-state functional connectivity compared to structural brain measures. Ultimately, traditional replication of mass univariate BWAS would require tens-of-thousands of individuals.

Recommendations

We stress that our results only have direct implications for mass univariate association studies using the methods in Marek, Tervo-Clemmens et al.³. These methods are only a small subset among many options to study associations between fMRI brain measures and behavioural measures, which warrants further investigations into replicability of studies using other methods. Here we therefore consider how some methodological choices could explain the lack of power in mass univariate BWAS and influence the replicability of neuroimaging studies²⁵.

First, the study design can have a large influence on replicability by increasing statistical power. For example, inter-individual correlation studies offer “as little as 5%-10% of the power” of within-subject t-test studies with the same number of participants²², giving a power advantage to group-level designs³⁸. Another way to increase power is to relieve the constraint on significance thresholds required to control for the many multiple comparisons involved in fMRI³⁹. Studies can therefore focus on fewer pre-selected local brain regions^23,40, or fewer measures which aggregate data across brain regions using networks^7,41,42,43 or multivariate pattern analyses^3,44,45. Studies which limit their analyses by pre-registration^9,46 also tend to produce better powered studies⁴⁷, suggesting that this practice encourages more careful study design.

Second, data processing in fMRI can have a large effect on reliability of brain measures and hence replicability³⁷. For resting-state fMRI, two key confounds are head motion and global signal^48,49 which should be carefully controlled for, noting that the choice of de-confounding methods can strongly influence resulting network measures^50,51. A recent comparison of resting-state fMRI network analysis pipelines⁴⁸ further showed that parcellation choice matters. Brain parcellation reduces the number of brain measures by grouping voxels into parcels if they share activity patterns in time. Pervaiz et al.⁴⁸ recommend low-dimensional independent component analysis (ICA with D = 50 regions) data-driven from the group of subjects within a dataset, in contrast to Marek, Tervo-Clemmens et al.’s³ choice of a pre-determined group-averaged brain parcellation⁵² (with D = 333 regions). We note that both of these parcellations fail to account for individual level variations in resting state functional connectivity^53,54. Future analyses could therefore benefit from recent methods⁵⁵ which do account for variability of networks both between- and within-participants^53,56. A key point here is that different processing choices make different underlying assumptions about brain data^57,58, which can affect reliability. For example, different methods may be more or less sensitive to how much data is collected per participant compared to total data across participants^59,60. Having considered data processing pipelines which increase reliability, we note that researchers should also consider the reliability of behavioural measures⁶¹ when aiming for future studies with greater replicability.

Third, researchers should consider choosing prediction over explanation⁶², reporting results which are directly aimed at generalising to unseen data rather than relying on statistical inference in null-hypothesis significance testing (NHST) within a sample. Issues with NHST are well documented^26,63,64 and combined with small sample sizes it leads to underpowered studies that report distorted effect sizes^8,11,22,65. Another issue with NHST is that P values may be derived from inappropriate null models (as we saw in Fig. 4); choosing an appropriate null for brain-wide statistics⁶⁶ is therefore yet another factor worth considering. These issues are largely avoided in a predictive framework, in which prediction accuracy of a model in a held-out-dataset provides a direct estimate of how well a model generalises. A key approach here is cross-validation, a machine learning strategy to prevent a model from overfitting on a single dataset. Recent studies have shown that when multivariate BWAS are cross-validated they report effect sizes that are replicable with only hundreds of participants^67,68. Predictive models can therefore improve replicability by reporting effect sizes which are closer to true underlying effect sizes, however they should aim to do so while overcoming the challenge of interpretability⁶⁴.

Conclusive remarks

Our results show that previous data-driven estimates of statistical errors and replicability may have been optimistic. The implications are striking for univariate BWAS, but after considering the impacts of other methodological choices, it is also clear that investigations of replicability of wider BWAS methods are required. We urge such meta-analyses to evaluate their meta-analytic methods, for example with null data, so they may reliably evaluate the replicability of scientific methods used in research.

Methods

Simulating null data at subject level

We simulated random phenotype associations with simulated functional connectivity measures. We generated a null-sample with n = 1,000 subjects each with 1,225 edges (random Pearson correlations between 50 random time series) and a single behavioural factor (normally distributed across participants). We correlated each edge with the behaviour across all subjects to obtain 1,225 brain-behaviour correlations. By generating edge connectivity estimates and behavioural factors independently from each other, we ensured that any resulting brain-behaviour correlations are entirely random (ρ = 0), hence obtaining a sample where the null hypothesis is true (i.e., a null-sample).

Estimating statistical errors

We closely followed the methods of Marek, Tervo-Clemmens et al.³, first running analyses on MATLAB using their code ‘abcd_edgewise_correlation_iterative_reliability_single_factor.m’ and ‘abcd_statisticalerrors.m’ (https://gitlab.com/DosenbachGreene/bwas). These analyses were then independently recoded in R with supporting packages^30,31,32 for open-source accessibility https://github.com/charlesdgburns/rwr/. Notably, statistical error estimations involve two-tailed P values derived from parametric null distributions on a given resample size.

Simulating ground truth data with known statistical power

At this stage we take a computationally more efficient approach and simulate summary statistics rather than subject-level data, which allows us to simulate many more true effect scenarios so we can compare estimates with true statistical errors. This approach also lets us increase the number of effects, so we now simulate samples with 55,278 (333 choose 2) effects, the number of resting-state functional connectivity measures which feature in Marek, Tervo-Clemmens et al. Fig. 2³. The size of true effects was determined by an inverse power analysis with a fixed sample size (n = 1,000) and Bonferroni corrected significance threshold (α = 0.05/55278), using a Fisher z-transformation for calculating the critical Pearson r for a given power level (power = 1%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 99%). We derive the minimal effect size required to reach a given level of power, r_critical, using the formula stated below, where $\:{Z}_{\alpha\:}$ and $\:{Z}_{\beta\:}$ are the standard normal deviates for the significance threshold (α) and corresponding power level (1-b) respectively:

$$\:{r}_{critical}=\text{tanh}\left(\frac{{Z}_{\alpha\:}-{Z}_{\beta\:}}{\sqrt{n-3}\:}\right)\:.$$

The Fisher z-transformation⁷⁰, F(r) = atanh(r) = z was also used to sample Pearson correlations using the approximation that the z-statistic is asymptotically normally distributed with mean F(r) and standard deviation $\:1/\sqrt{\left(n-3\right)}$. The z-statistic was then transformed into Pearson correlations to simulate brain-behaviour correlations. For each power level, we simulated a large sample by first drawing 55,278 random effects (ρ = 0) and afterwards replacing 500 of those with effects drawn from an infinite-sized population with true effect size ρ corresponding to the critical r for a given power level computed earlier. The choice of 500 effects being sampled from a true distribution was somewhat arbitrary, leading to a proportion true effects (~ 1%) moderate enough to seem probable while sufficiently large to reduce noise in estimates of bias (see Methods section below). Note that while real world effect sizes of a single BWAS may vary, we instead simulate several effects with the same underlying effect size. This should not be an issue, since the statistical error summary statistics are given as an average across effects, so in this simulation we can think of the underlying effects having an average effect size according to a given power level. Similar estimates would be obtained if the underlying true effect sizes were varied but with an average effect size corresponding to each r_critical.

Simulating statistical power estimations from resampling with replacement

Given only summary statistics rather than individual subjects, we cannot resample participants and recompute P values, but instead also simulate obtaining estimates by resampling with replacement as in subject-level analyses³. First, we generate ground truth data with known statistical power, which we treat as a large sample. Then, we follow the implicit assumption that the observed effects in a large sample are the population effects: for a given resample size n, a resampled effect size was drawn from a normal distribution $\:N\left(\text{a}\text{t}\text{a}\text{n}\text{h}\left({r}^{*}\right),1/\sqrt{n-3}\right)$, where r* is a given Pearson correlation from the large sample, and then Fisher z-transformed. We then derived P values from an uncorrected null distribution of Pearson correlations with degrees of freedom computed relative to the resample size. We resampled across the same range of sample sizes as in previous analyses (n = 25, … 1,000). We continued to estimate statistical power across 1,000 iterations of resampled brain-behaviour correlations as in Marek, Tervo-Clemmens et al.¹, specifically as the proportion of significant effects in a large sample which were significant again in a resample (1 – false negative rates, with α = 0.05/55278). These were then compared to analytical power curves⁶⁹ computed using Fisher z-transformations for varying sample sizes and effect sizes corresponding to critical r for power levels at the full sample size (n = 1,000) computed earlier.

Data availability

No human data was collected for this study. R and MATLAB code used for data simulation, statistical analyses, and plotting is available on GitHub: https://github.com/charlesdgburns/rwr/.

References

Callaway, E. Can brain scans reveal behaviour? Bombshell study says not yet. Nature 603, 777–778 (2022).
Article ADS CAS PubMed Google Scholar
Richtel, M. Brain-imaging studies hampered by Small Data Sets, Study finds. New. York Times (2022).
Marek, S. et al. Reproducible brain-wide association studies require thousands of individuals. Nature 603, 654–660 (2022).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Gratton, C., Nelson, S. M. & Gordon, E. M. Brain-behavior correlations: two paths toward reliability. Neuron 110, 1446–1449 (2022).
Article CAS PubMed Google Scholar
Rosenberg, M. D. & Finn, E. S. How to establish robust brain–behavior relationships without thousands of individuals. Nat. Neurosci. 25, 835–837 (2022).
Article CAS PubMed MATH Google Scholar
Botvinik-Nezer, R. & Wager, T. D. Reproducibility in Neuroimaging Analysis: challenges and solutions. Biol. Psychiatry Cogn. Neurosci. Neuroimaging 8, 780–788 (2023).
PubMed MATH Google Scholar
Helwegen, K. & Libedinsky, I. van den Heuvel, M. P. Statistical power in network neuroscience. Trends Cogn. Sci. 27, 282–301 (2023).
Article PubMed Google Scholar
Button, K. S. et al. Power failure: why small sample size undermines the reliability of neuroscience. Nat. Rev. Neurosci. 14, 365–376 (2013).
Article CAS PubMed MATH Google Scholar
Munafò, M. R. et al. A manifesto for reproducible science. Nat. Hum. Behav. 1, 1–9 (2017).
Article MATH Google Scholar
Open Science Collaboration. Estimating the reproducibility of psychological science. Science 349, aac4716 (2015).
Article Google Scholar
Ioannidis, J. P. A. Why most published Research findings are false. PLOS Med. 2, e124 (2005).
Article PubMed PubMed Central MATH Google Scholar
Kriegeskorte, N., Simmons, W. K., Bellgowan, P. S. F. & Baker, C. I. Circular analysis in systems neuroscience: the dangers of double dipping. Nat. Neurosci. 12, 535–540 (2009).
Article CAS PubMed PubMed Central Google Scholar
Plesser, H. E. Reproducibility vs. replicability: a brief history of a confused terminology. Front. Neuroinformatics 11, (2018).
Barba, L. A. Terminologies for Reproducible Research. Preprint at http://arxiv.org/abs/1802.03311 (2018).
Kharabian Masouleh, S., Eickhoff, S. B., Hoffstaedter, F. & Genon, S. Alzheimer’s Disease Neuroimaging Initiative. Empirical examination of the replicability of associations between brain structure and psychological variables. eLife 8, e43464 (2019).
Article PubMed PubMed Central Google Scholar
Boekel, W. et al. A purely confirmatory replication study of structural brain-behavior correlations. Cortex J. Devoted Study Nerv. Syst. Behav. 66, 115–133 (2015).
Article Google Scholar
Liu, S., Abdellaoui, A., Verweij, K. J. H. & van Wingen, G. A. Replicable brain–phenotype associations require large-scale neuroimaging data. Nat. Hum. Behav. 7, 1344–1356 (2023).
Article PubMed MATH Google Scholar
Van Essen, D. C. et al. The WU-Minn Human Connectome Project: an overview. NeuroImage 80, 62–79 (2013).
Article PubMed MATH Google Scholar
Casey, B. J. et al. The adolescent brain Cognitive Development (ABCD) study: imaging acquisition across 21 sites. Dev. Cogn. Neurosci. 32, 43–54 (2018).
Article CAS PubMed PubMed Central MATH Google Scholar
Sudlow, C. et al. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med. 12, e1001779 (2015).
Article PubMed PubMed Central Google Scholar
Ingre, M. Why small low-powered studies are worse than large high-powered studies and how to protect against trivial findings in research: comment on Friston (2012). NeuroImage 81, 496–498 (2013).
Article PubMed Google Scholar
Yarkoni, T. Big correlations in Little studies: inflated fMRI correlations reflect low statistical power—commentary on Vul et al. (2009). Perspect. Psychol. Sci. 4, 294–298 (2009).
Article PubMed Google Scholar
Cremers, H. R., Wager, T. D. & Yarkoni, T. The relation between statistical power and inference in fMRI. PLOS ONE 12, e0184923 (2017).
Article PubMed PubMed Central MATH Google Scholar
Szucs, D. & Ioannidis, J. P. A. Sample size evolution in neuroimaging research: an evaluation of highly-cited studies (1990–2012) and of latest practices (2017–2018) in high-impact journals. NeuroImage 221, 117164 (2020).
Article PubMed Google Scholar
Poldrack, R. A. et al. Scanning the horizon: towards transparent and reproducible neuroimaging research. Nat. Rev. Neurosci. 18, 115–126 (2017).
Article CAS PubMed PubMed Central MATH Google Scholar
Amrhein, V., Greenland, S. & McShane, B. Scientists rise up against statistical significance. Nature 567, 305–307 (2019).
Article ADS CAS PubMed MATH Google Scholar
Geuter, S., Qi, G., Welsh, R. C., Wager, T. D. & Lindquist, M. A. Effect size and power in fMRI Group Analysis. 295048. Preprint at https://doi.org/10.1101/295048 (2018).
Noble, S., Scheinost, D. & Constable, R. T. Cluster failure or power failure? Evaluating sensitivity in cluster-level inference. NeuroImage 209, 116468 (2020).
Article PubMed MATH Google Scholar
Bossier, H. et al. The empirical replicability of task-based fMRI as a function of sample size. NeuroImage 212, 116601 (2020).
Article PubMed MATH Google Scholar
Ripley, B. et al. MASS: Support Functions and Datasets for Venables and Ripley’s MASS. (2023).
Wickham, H. et al. Welcome to the Tidyverse. J. Open. Source Softw. 4, 1686 (2019).
Article ADS MATH Google Scholar
Kassambara, A. & ggpubr ‘ggplot2’ Based Publication Ready Plots. (2022).
El Otmani, S. & Maul, A. Probability distributions arising from nested gaussians. Comptes Rendus Math. 347, 201–204 (2009).
Article MathSciNet MATH Google Scholar
Convolution of Gaussians is Gaussian. https://jeremy9959.net/Math-5800-Spring-2020/notebooks/convolution_of_gaussians.html
Chen, Z., Boehnke, M., Wen, X. & Mukherjee, B. Revisiting the genome-wide significance threshold for common variant GWAS. G3 GenesGenomesGenetics 11, jkaa056 (2021).
Witten, I. H., Frank, E., Hall, M. A., Pal, C. J. & DATA, M. Practical machine learning tools and techniques. Data Min. Fourth Ed. Elsevier Publ. (2017).
Zuo, X. N., Xu, T. & Milham, M. P. Harnessing reliability for neuroscience research. Nat. Hum. Behav. 3, 768–771 (2019).
Article PubMed MATH Google Scholar
Fröhner, J. H., Teckentrup, V., Smolka, M. N. & Kroemer, N. B. Addressing the reliability fallacy in fMRI: similar group effects may arise from unreliable individual effects. NeuroImage 195, 174–189 (2019).
Article PubMed Google Scholar
Bennett, C., Miller, M. & Wolford, G. Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: an argument for multiple comparisons correction. NeuroImage 47, S125 (2009).
Poldrack, R. A. Region of interest analysis for fMRI. Soc. Cogn. Affect. Neurosci. 2, 67–70 (2007).
Article PubMed PubMed Central MATH Google Scholar
Bullmore, E. & Sporns, O. Complex brain networks: graph theoretical analysis of structural and functional systems. Nat. Rev. Neurosci. 10, 186–198 (2009).
Article CAS PubMed MATH Google Scholar
Betzel, R. F. & Bassett, D. S. Multi-scale brain networks. NeuroImage 160, 73–83 (2017).
Article PubMed MATH Google Scholar
Muldoon, S. F., Bridgeford, E. W. & Bassett, D. S. Small-world propensity and weighted brain networks. Sci. Rep. 6, 22057 (2016).
Article ADS CAS PubMed PubMed Central MATH Google Scholar
Haxby, J. V. Multivariate pattern analysis of fMRI: the early beginnings. Neuroimage 62, 852–855 (2012).
Article PubMed MATH Google Scholar
Haxby, J. V. et al. Distributed and overlapping representations of faces and objects in ventral temporal cortex. Science 293, 2425–2430 (2001).
Article ADS CAS PubMed MATH Google Scholar
Simmons, P., Nelson, J. D., Simonsohn, U. & L. & Pre-registration: why and how. J. Consum. Psychol. 31, 151–162 (2021).
Article MATH Google Scholar
van den Akker, O. R. et al. Preregistration in practice: a comparison of preregistered and non-preregistered studies in psychology. Behav. Res. Methods. https://doi.org/10.3758/s13428-023-02277-0 (2023).
Article ADS PubMed PubMed Central MATH Google Scholar
Pervaiz, U., Vidaurre, D., Woolrich, M. W. & Smith, S. M. Optimising network modelling methods for fMRI. NeuroImage 211, 116604 (2020).
Article PubMed Google Scholar
Power, J. D., Plitt, M., Laumann, T. O. & Martin, A. Sources and implications of whole-brain fMRI signals in humans. NeuroImage 146, 609–625 (2017).
Article PubMed Google Scholar
Mahadevan, A. S., Tooley, U. A., Bertolero, M. A., Mackey, A. P. & Bassett, D. S. Evaluating the sensitivity of functional connectivity measures to motion artifact in resting-state fMRI data. NeuroImage 241, 118408 (2021).
Article PubMed Google Scholar
Saviola, F. et al. Head motion correction shapes functional network estimates: evidence from healthy and Parkinson’s disease cohorts. bioRxiv 2022–12 (2022).
Gordon, E. M. et al. Generation and evaluation of a cortical area parcellation from resting-state correlations. Cereb. Cortex 26, 288–303 (2016).
Article PubMed MATH Google Scholar
Kong, R. et al. Individual-specific areal-level parcellations improve functional connectivity prediction of Behavior. Cereb. Cortex 31, 4477–4500 (2021).
Article PubMed PubMed Central MATH Google Scholar
Gordon, E. M. et al. Precision Functional Mapping of Individual Human brains. Neuron 95, 791–807e7 (2017).
Article CAS PubMed PubMed Central MATH Google Scholar
Bijsterbosch, J. D., Valk, S. L., Wang, D. & Glasser, M. F. Recent developments in representations of the connectome. NeuroImage 243, 118533 (2021).
Article PubMed MATH Google Scholar
Farahibozorg, S. R. et al. Hierarchical modelling of functional brain networks in population and individuals from big fMRI data. NeuroImage 243, 118513 (2021).
Article PubMed Google Scholar
Bijsterbosch, J. et al. Challenges and future directions for representations of functional brain organization. Nat. Neurosci. 23, 1484–1495 (2020).
Article CAS PubMed MATH Google Scholar
Bijsterbosch, J. D. et al. The relationship between spatial configuration and functional connectivity of brain regions. eLife 7, e32992 (2018).
Article PubMed PubMed Central MATH Google Scholar
Shah, L. M., Cramer, J. A., Ferguson, M. A., Birn, R. M. & Anderson, J. S. Reliability and reproducibility of individual differences in functional connectivity acquired during task and resting state. Brain Behav. 6, e00456 (2016).
Article PubMed PubMed Central Google Scholar
Noble, S., Scheinost, D. & Constable, R. T. A decade of test-retest reliability of functional connectivity: a systematic review and meta-analysis. NeuroImage 203, 116157 (2019).
Article PubMed MATH Google Scholar
Kadlec, J. et al. A measure of reliability convergence to select and optimize cognitive tasks for individual differences research. Commun. Psychol. 2, 1–18 (2024).
Article MATH Google Scholar
Yarkoni, T. & Westfall, J. Choosing Prediction over explanation in psychology: lessons from Machine Learning. Perspect. Psychol. Sci. 12, 1100–1122 (2017).
Article PubMed PubMed Central MATH Google Scholar
McShane, B. B., Gal, D., Gelman, A., Robert, C. & Tackett, J. L. Abandon statistical significance. Am. Stat. 73, 235–245 (2019).
Article MathSciNet MATH Google Scholar
Benjamin, D. J. et al. Redefine statistical significance. Nat. Hum. Behav. 2, 6–10 (2018).
Article PubMed MATH Google Scholar
Ioannidis, J. P. A. Why most discovered true associations are inflated. Epidemiology 19, 640 (2008).
Article PubMed MATH Google Scholar
Markello, R. D. & Misic, B. Comparing spatial null models for brain maps. NeuroImage 236, 118052 (2021).
Article PubMed MATH Google Scholar
Spisak, T., Bingel, U. & Wager, T. D. Multivariate BWAS can be replicable with moderate sample sizes. Nature 615, E4–E7 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, J. et al. Relationship between prediction accuracy and feature importance reliability: an empirical and theoretical study. NeuroImage 274, 120115 (2023).
Article PubMed Google Scholar
Designing Clinical Research. (Wolters Kluwer/Lippincott Williams & Wilkins, Philadelphia, (2013).
Fisher, R. A. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large Population. Biometrika 10, 507–521 (1915).
MATH Google Scholar

Download references

Acknowledgements

We thank editors and reviewers for their contributions to this manuscript. A.F. was supported by a grant from the Biotechnology and Biology research council (BBSRC, grant number: BB/S006605/1) and the Fundação Bial, Fundação Bial Grants Programme 2020/21, A- 29315, number 203/2020, grant edition: G-15516.

Author information

Authors and Affiliations

School of Psychology and Neuroscience, University of Glasgow, G12 8QB, Glasgow, Scotland
Charles D. G. Burns, Alessio Fracasso & Guillaume A. Rousselet

Authors

Charles D. G. Burns
View author publications
You can also search for this author inPubMed Google Scholar
Alessio Fracasso
View author publications
You can also search for this author inPubMed Google Scholar
Guillaume A. Rousselet
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

C.D.G.B.: Conceptualisation, design, implementation, analysis, interpretation, writing - original draft. A.F.: Interpretation of results, writing - review & editing. G.A.R.: Conceptualisation, design, interpretation, writing - review & editing, supervision.

Corresponding authors

Correspondence to Charles D. G. Burns or Guillaume A. Rousselet.

Ethics declarations

Competing interests

The authors declare no competing interests.

Ethics

Analyses described here were performed using randomly simulated data and were therefore not subject to ethical review.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Burns, C.D.G., Fracasso, A. & Rousselet, G.A. Bias in data-driven replicability analysis of univariate brain-wide association studies. Sci Rep 15, 6105 (2025). https://doi.org/10.1038/s41598-025-89257-w

Download citation

Received: 21 May 2024
Accepted: 04 February 2025
Published: 19 February 2025
DOI: https://doi.org/10.1038/s41598-025-89257-w