An efficient, not-only-linear correlation coefficient based on clustering
- Novel Statistic: The paper introduces the Clustering-based Correlation Coefficient (CCC), a computationally efficient measure that detects both linear and non-linear relationships, unlike traditional coefficients like Pearson’s \(r\).
- Mechanism: CCC operates by assessing how well the marginal information of two variables aligns with the clusters formed by their joint distribution, allowing it to capture complex, non-monotonic associations.
- Biological Utility: In an application to human gene expression data, CCC successfully identified biologically meaningful non-linear patterns, including those driven by sex differences, that were missed by standard linear methods.
PubMed: 39243756 DOI: 10.1016/j.cels.2024.08.005 Overview generated by: Gemini 2.5 Flash, 28/11/2025
Key Findings: The Clustering-based Correlation Coefficient (CCC)
This paper introduces the Clustering-based Correlation Coefficient (CCC), an efficient and easy-to-use statistical measure designed to identify both linear and non-linear relationships between variables. The core motivation is to overcome the limitation of standard coefficients, such as Pearson’s \(r\), which are fundamentally restricted to measuring linear associations and therefore miss complex patterns prevalent in high-dimensional biological data.
Study Design and Methods: Introducing CCC
The Nature of CCC
The CCC is a not-only-linear correlation coefficient that operates by leveraging a clustering-based statistic. Instead of relying on linear regression assumptions, it assesses the strength of the relationship by quantifying how well the marginal information of each variable separately aligns with the clusters formed by the joint distribution of the two variables.
Methodology
- Joint Clustering: A clustering algorithm is applied to the data points based on the values of both variables simultaneously.
- Marginal Assessment: The CCC then measures the extent to which the values of the first variable alone can explain the clusters identified in step one, and the same is done for the second variable.
- Final Score: The resulting coefficient reflects the consistency between the joint clustering and the clustering explained by the individual variables.
The authors show that CCC is a statistically sound measure that can effectively capture a wide array of patterns, including linear, non-linear (e.g., parabolic, or U-shaped), and non-monotonic dependencies.
Results and Empirical Application
Simulation and Comparison
Through simulated data, the authors demonstrate that the CCC successfully detects relationships where standard measures like Pearson’s \(r\), Spearman’s \(\rho\), and even other non-linear coefficients like the Maximal Information Coefficient (MIC), exhibit reduced performance. The CCC is also shown to be computationally efficient, making it scalable for large datasets.
Application to Genome-Scale Data
The CCC was applied to human gene expression data (a common source of complex, non-linear biological associations) to identify correlated gene pairs.
- Detection of Biological Patterns: When applied to gene expression data, CCC successfully identified non-linear patterns that were not captured by linear-only coefficients.
- Sex Differences: The identified non-linear associations were often explained by sex differences, where the relationship between two genes varied significantly across male and female subgroups. For instance, the expression of certain gene pairs might be highly correlated within sexes but exhibit a non-linear overall pattern when sexes are combined, which CCC could detect while linear methods could not.
Conclusions and Recommendations
The Clustering-based Correlation Coefficient (CCC) is a valuable and robust tool for pattern identification in complex datasets. It provides a computationally efficient, simple, and not-only-linear alternative to standard correlation measures. The demonstrated ability of CCC to reveal biologically meaningful non-linear patterns—such as those linked to sex differences in gene expression—highlights its utility for high-throughput analyses in genomics and other fields where relationships are rarely simple or purely linear.