Surprising Geometrical Properties of High-Dimension Low-Sample Size Data with Devastating Consequences for Data Analysis

Ana Maria Pires; João António Branco

doi:10.60923/issn.1973-2201/20506

Authors

Ana Maria Pires Universidade de Lisboa, Lisboa, Portugal

Department of Mathematics and CEMAT, IST
João António Branco Universidade de Lisboa, Lisboa, Portugal

Department of Mathematics and CEMAT, IST

DOI:

https://doi.org/10.60923/issn.1973-2201/20506

Keywords:

Curse of dimensionality, High-dimension low-sample size data, Mahalanobis distance, Multivariate outliers, Nearest-neighbors, Projection-pursuit

Abstract

The advent of modern technology, permitting the measurement of thousands of variables simultaneously, has given rise to floods of data characterized by many large or even huge datasets. This new paradigm presents extraordinary challenges to data analysis and the question arises: how can conventional data analysis methods, devised for moderate or small datasets, cope with the complexities of modern data? The case of high-dimension low-sample size data is particularly revealing of some of the drawbacks. We look at the case where the number of variables measured in an object is at least the number of observed objects and conclude that (under the further assumptions that the data are observations from continuous random variables and that linear combinations of the variables are meaningful operations) this configuration leads to geometrical and mathematical oddities and is an insurmountable barrier for the direct application of traditional methodologies. If scientists are going to base their conclusions on high-dimension low-sample size data, ignoring fundamental mathematical results arrived at in this paper and blindly use software to analyze data, the results of their analyses may not be trustful, and the findings of their experiments may never be validated. That is why new methods together with the wise use of traditional approaches are essential to progress safely through the present reality.

References

J. AHN, J. S. MARRON (2010). The maximal data piling direction for discrimination. Biometrika, 97, no. 1, pp. 254–259.

J. AHN, J. S. MARRON, K. M. MULLER, Y.-Y. CHI (2007). The high-dimension, lowsample-size geometric representation holds under mild conditions. Biometrika, 94, no. 3, pp. 760–766.

U. ALON, N. BARKAI, D. A. NOTTERMAN, K. GISH, S. YBARRA, D. MACK, A. J. LEVINE (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences, 96, no. 12, pp. 6745–6750.

V. BARNETT, T. LEWIS (1994). Outliers in Statistical Data. 3rd Edition. John Wiley & Sons, Kluwer Academic Publishers, Boston/Dordrecht/London.

Y. M. BARYSHNIKOV, R. A. VITALE (1994). Regular simplices and Gaussian samples. Discrete & Computational Geometry, 11, pp. 141–147.

R. E. BELLMAN (1957). Dynamic Programming. PrincetonUniversity Press, Princeton.

P. J. BICKEL, G. KUR, B. NADLER (2018). Projection pursuit in high dimensions. Proceedings of the National Academy of Sciences, 115, no. 37, pp. 9151–9156.

T. T. CAI, T. LIANG, H. H. ZHOU (2015). Law of log determinant of sample covariance matrix and optimal estimation of differential entropy for high-dimensional gaussian distributions. Journal of Multivariate Analysis, 137, pp. 161–172.

D. CASTRO-REIGÍA, J. EZENARRO, M. AZKUNE, I. AYESTA, M. OSTRA, J. M. AMIGO, I. GARCÍA, M. C. ORTIZ (2024). Yoghurt standardization using real-time NIR prediction of milk fat and protein content. Journal of Food Composition and Analysis, 128, p. 106015.

L. P. CAVALHEIRO, S. BERNARD, J. P. BARDDAL, L. HEUTTE (2024). Random forest kernel for high-dimension low sample size classification. Statistics and Computing, 34, no. 1, p. 9.

A. CHAKRABARTI, R. SEN (2019). Some statistical problems with high dimensional financial data. In F. ABERGEL, B. K. CHAKRABARTI, A. CHAKRABORTI, N. DEO, K. SHARMA (eds.), New Perspectives and Challenges in Econophysics and Sociophysics, Springer International Publishing, Cham, pp. 147–167.

R. CLARKE, H.W. RESSOM, A.WANG, J.XUAN, M. C. LIU, E. A.GEHAN, Y.WANG (2008). The properties of high-dimensional data spaces: implications for exploring gene and protein expression data. Nature Reviews Cancer, 8, no. 1, pp. 37–49.

C.CROUX, P. FILZMOSER, H. FRITZ (2013). Robust sparse principal component analysis. Technometrics, 55, no. 2, pp. 202–214.

C. CROUX, P. FILZMOSER, M. R. OLIVEIRA (2007). Algorithms for projection–pursuit robust principal component analysis. Chemometrics and Intelligent Laboratory Systems, 87, no. 2, pp. 218–225.

C. CROUX, A. RUIZ-GAZEN (2005). High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis, 95, no. 1, pp. 206–226.

D. L. DONOHO (1982). Breakdown properties of multivariate location estimators. Ph.D. Qualifying Paper Harvard University.

D. L. DONOHO, J. TANNER (2005). Neighborliness of randomly projected simplices in high dimensions. Proceedings of the National Academy of Sciences, 102, no. 27, pp. 9452–9457.

M. EASTWOOD, R. PENROSE (2000). Drawing with complex numbers. The Mathematical Intelligencer, 22, pp. 8–13.

P. FILZMOSER, R. MARONNA, M. WERNER (2008). Outlier identification in high dimensions. Computational Statistics & Data Analysis, 52, no. 3, pp. 1694–1711.

P. FILZMOSER, S. SERNEELS, R. MARONNA, C. CROUX (2020). Robust multivariate methods in chemometrics. In S. BROWN, R. TAULER, B. WALCZAK (eds.), Comprehensive Chemometrics (Second Edition), Elsevier, Oxford, pp. 393–430.

P. FILZMOSER, V. TODOROV (2013). Robust tools for the imperfect world. Information Sciences, 245, pp. 4–20.

I. E. FRANK, J. H. FRIEDMAN (1993). A statistical view of some chemometrics regression tools. Technometrics, 35, no. 2, pp. 109–135.

E. G. GATH, K. HAYES (2006). Bounds for the largest mahalanobis distance. Linear Algebra and its Applications, 419, no. 1, pp. 93–106.

R. GNANADESIKAN, J. R. KETTENRING (1972). Robust estimates, residuals, and outlier detection with multiresponse data. Biometrics, 28, no. 1, pp. 81–124.

P. HALL, J. S. MARRON, A. NEEMAN (2005). Geometric representation of high dimension, low sample size data. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67, no. 3, pp. 427–444.

C. HENNIG (2020). Minkowski distances and standardisation for clustering and classification on high-dimensional data. In T. IMAIZUMI, A. NAKAYAMA, S. YOKOYAMA (eds.), Advanced Studies in Behaviormetrics and Data Science: Essays in Honor of Akinori Okada, Springer Singapore, Singapore, pp. 103–118.

A.HINNEBURG, C. C. AGGARWAL, D. A. KEIM (2000). What is the nearest neighbor in high dimensional spaces? In VLDB’2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, Cairo, Egypt, pp. 506—-515.

D. C. HOAGLIN, R. E.WELSCH (1978). The hat matrix in regression and ANOVA. The American Statistician, 32, no. 1, pp. 17–22.

H. HOTELLING (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24, no. 6, pp. 417-441.

P. J. HUBER (1985). Projection pursuit. The Annals of Statistics, 13, no. 2, pp. 435–475.

M. HUBERT, P. J. ROUSSEEUW, K. VANDEN BRANDEN (2005). ROBPCA: a new approach to robust principal component analysis. Technometrics, 47, no. 1, pp. 64–79.

R. A. JOHNSON, D.W.WICHERN (2007). Applied Multivariate Statistical Analysis. 6th edn. Prentice Hall, New Jersey.

I. M. JOHNSTONE, D. M. TITTERINGTON (2009). Statistical challenges of highdimensional data. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 367, no. 1906, pp. 4237–4253.

E. K. LEE, D. COOK (2010). A projection pursuit index for large p small n data. Statistics and Computing, 20, pp. 381-392.

B. LIU, Y.WEI, Y. ZHANG, Q. YANG (2017). Deep neural networks for high dimension, low sample size data. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 2287–2293.

N. LOCANTORE, J. S. MARRON, D. G. SIMPSON, N. TRIPOLI, J. T. ZHANG, K. L. COHEN (1999). Robust principal component analysis for functional data. Test, 8, pp. 1–73.

N. LOPERFIDO (2023). Kurtosis removal for data pre-processing. Advances in Data Analysis and Classification, 17, pp. 239–267.

P. C.MAHALANOBIS (1936). On the generalised distance in statistics. Proceedings of the National Institute of Sciences of India, 2, pp. 49–55.

K. V. MARDIA (1977). Mahalanobis distances and angles. In P. R. KRISHNAIAH (ed.), Multivariate Analysis IV, North-Holland, Amsterdam, pp. 495-511.

A. M. PIRES, J. A. BRANCO (2010). Projection-pursuit approach to robust linear discriminant analysis. Journal of Multivariate Analysis, 101, no. 10, pp. 2464–2485.

A. M. PIRES, J. A. BRANCO (2019). High dimensionality: The latest challenge to data analysis. URL https://arxiv.org/abs/1902.04679.

M. L. PROVOST, R. BAPTISTA, J. D. ELDREDGE, Y. MARZOUK (2023). An adaptive ensemble filter for heavy-tailed distributions: tuning-free inflation and localization. URL https://arxiv.org/abs/2310.08741.

R CORE TEAM (2024). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

U. RADOJICIC, K. NORDHAUSEN, J. VIRTA (2021). Kurtosis-based projection pursuit for matrix-valued data. URL https://arxiv.org/abs/2109.04167.

F. SAMARIA, A. HARTER (1994). Parameterisation of a stochastic model for human face identification. In Proceedings of 1994 IEEE Workshop on Applications of Computer Vision, pp. 138–142.

S. SARKAR, R. BISWAS, A. K. GHOSH (2020). On some graph-based two-sample tests for high dimension, low sample size data. Machine Learning, 109, pp. 279–306.

X. SHEN, C. WANG, X. ZHOU, W. ZHOU, D. HORNBURG, S. WU, M. P. SNYDER (2024). Nonlinear dynamics of multi-omics profiles during human aging. Nature Aging, 4, pp. 1619–1634.

A. D. SHIEH, Y. S.HUNG (2009). Detecting outlier samples in microarray data. Statistical Applications in Genetics and Molecular Biology, 8, no. 1.

M. SJÖSTRÖM, . S. WOLD, W. LINDBERG, J. A. PERSSON, H. MARTENS (1983). A multivariate calibration problem in analytical chemistry solved by partial least-squares models in latent variables. Analytica Chimica Acta, 150, pp. 61–70.

W. STAHEL (1981). Breakdown of covariance estimators. Research Report 31, Fachgrupp fur Statistik, E.T.H. Zurich.

V. TODOROV (2024). R: rrcovHD: Robust Multivariate Methods for High Dimensional Data. URL https://CRAN.R-project.org/package=rrcovHD. R package version 0.3-1.

G. TRENKLER, S. PUNTANEN (2005). A multivariate version of samuelson’s inequality. Linear Algebra and its Applications, 410, pp. 143–149.

D. E. TYLER (2010). A note on multivariate location and scatter statistics for sparse data sets. Statistics & Probability Letters, 80, no. 17, pp. 1409–1413.

S.WOLD, M. SJÖSTRÖM, L. ERIKSSON (2001). PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent Laboratory Systems, 58, no. 2, pp. 109–130.

C. ZHANG, J. YE, X. WANG (2023). A computational perspective on projection pursuit in high dimensions: Feasible or infeasible feature extraction. International Statistical Review, 91, no. 1, pp. 140–161.

Surprising Geometrical Properties of High-Dimension Low-Sample Size Data with Devastating Consequences for Data Analysis

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Information

Make a Submission

Current Issue