The most known approach to test for independence (as well as measure the strength and the direction of the relationship) between two random variables $X$ and $Y$ is to use an empirical version of the classical Pearson correlation coefficient: $$ \rho(X,Y) = \frac{Cov(X, Y)}{\sigma_X\sigma_Y} $$ or one of its extensions, such as the Spearman’s rank correlation coefficient. These approaches are easy to use and implemented in most statistical software.
Nevertheless, they are limited. For example they can only deal with two univariate random variables, and they can only detect linear or monotone relations.
These standard measures are not suited to tackle the new problems arising with the advent of big data sets. Indeed, Big Data objects can be as various as high-dimensional vectors, images, texts, speech data and live streams.
Our Dependence Measures project
aims to develop powerful, efficient, practical and easy to use statistical tools for better understanding and analysing relationships between such various objects.
It will deepen the analysis, and enhance the modelling, of interdependencies in such data sets, especially to quantify their strength, and to evaluate possible causalities. This project is led by an international and interdisciplinary team of skilled researchers, from the UNSW School of Mathematics and Statistics, from the UNSW School of Risk Actuarial Studies and from ENSAI. Thanks to established links with industry and neuroscientists, the project will bring forth high impact applications, particularly in neuroimaging genetics and insurance capital modelling.
The project will develop powerful statistical methods driven by the analysis of real Big Data sets both in risk insurance and neuroimaging genetics. This will empower all data users in Australia and in the world, for instance by better understanding brain dynamics, identifying the forerunners of natural disasters, or managing risks faced by the financial system.
Challenges arising in the big 3V’s-environment will be mainly addressed by taking advantage of results from complex analysis, by developing a unifying methodology that relies on the complex-valued characteristic function. This will create innovative methodologies drawing from complex analysis, hence enriching the palette of tools available to anyone analysing associations and dependencies in big data sets.
We will improve upon the classical tools in several directions:
- Consider random vectors and not only random variables.
- Go even further to consider unstructured random objects (images, texts, etc.).
- Look at dependencies between more than two objects.
- Scale the above approach to big data sets.
The corner stone of our methodology, following some previous work done by members of this project, will be the study of empirical generalizations of a weighted distance measure between the joint characteristic function of the random variables $X$ and $Y$ and the product of their marginals: $$ \int|\phi_{X,Y}(s,t)-\phi_X(s)\phi_Y(t)|w(s,t)dsdt. $$
In the course of this project:
We will develop and promote methods from statistical theory in the complex domain for devising new tests and measures of dependence between any number of random elements of various types, and to demonstrate their power in the big data environment.
We will extend the foregoing methodology to the case of assessing dependence and serial dependence between (potentially multivariate/high-dimensional) time series, with special attention to the difficult cases of non-stationary data and streaming data.
We will connect the above approaches to copulas, ubiquitous tools for dependence modelling in many fields, which will lead to the introduction of the copula approach into the big data framework.