#### BIPLOTS FOR INTERDISCIPLINARY HIGH DIMENSIONAL DATA VISUALISATION

*Sugnet Lubbe, Johané Nienkemper-Swanepoel, Niël le Roux, Carel van der Merwe,Raeesa Ganey*

Biplots have proved to be valuable visualisation tools in exploratory data analysis. To date the use of biplots for interdisciplinary applications has been limited since current implementation tools are constraint to expert users. A user-friendly package will enable and empower practitioners and researchers of varying skills to apply biplots more widely in many disciplines, especially in the current era of big data. The primary aim of this project is collating all the visualisation related code developed by many individuals over a long period of time into a new comprehensive user-friendly R package. Specifically new developments in R provide facilities requiring renewal refurbishment, and modernisation of the current programs.

The MuViSU Management committee received the EMS Faculty Elite Research Grant of R200 000 in 2020 for this project. Furthermore, Dr Nienkemper-Swanepoel and Dr Ganey received an early career collaboration research grant of R150 000 from the National Graduate Academy for Mathematical and Statistical Sciences in April 2022.

#### BIPLOTS FOR INDUSTRY

*Niël le Roux, Roelof Coetzer (NWU), Ruan Rossouw (SASOL)*

A multivariate reactor performance index (RPI) is developed for complex Multivariate Process Monitoring. The newly proposed RPI integrates subject-matter knowledge with a data driven approach for real time performance monitoring. A new approach to process deviation monitoring on many variables is presented based on the confidence value (α) at a specified -value. This methodology is proposed as a general data driven performance index as it is objective, and very little prior knowledge of the system is required. A performance index visualized on an appropriate and interactive graph is invaluable in the monitoring of multiple similar production processes, as it makes it easy to visually identify production processes not performing as expected.

#### EEG SIMULATIONS

*Niël le Roux, Pieter Schoonees (Erasmus University)*

This research focuses on methods for assessing the similarity of brain responses within and across subjects (individuals). Typically, the data comes from fMRI or EEG studies, and concern spatiotemporal measures of brain activity while the subject is exposed to some stimulus. A particular focus of these studies is on naturalistic stimuli, which typically means video content such as television and films. This is an important departure from traditional neuroimaging studies where subjects perform simple tasks multiple times in a highly controlled setting. fMRI offers high spatial resolution through dividing the brain into many voxels, but this comes at the cost of lower temporal resolution as it takes roughly two seconds to complete a single scan of the brain. In contrast, EEG trades spatial resolution for high temporal resolution. In EEG, a limited number of electrodes (e.g., 64) are placed on the scalp to measure activity, but by sacrificing temporal resolution in this way measurements can be made several times a second (typically 256 of 512 times). Our focus is on the statistical analysis of EEG data. To this end we developed an extensive R-based EEG simulation statistical model for generating EEG data in a wide variety of controlled conditions. This allows us to evaluate statistical procedures currently in use in the field of analysing EEG data.

#### BIPLOTS FOR TWO CLASS LINEAR DISCRIMINANT ANALYSIS

*Niël le Roux, Sugnet Lubbe*

Biplots are typically represented in two dimensions. In the two class case, the dimension of the canonical space where the classes are optimally separated reduces to one. The paper suggesting how to find an optimal representation for the second dimension is in the final stages of preparation before submission.

#### GENERALISED SINGULAR VALUE DECOMPOSITION FOR THREE-WAY DATA

*Sugnet Lubbe, Raeesa Ganey (WITS)*

Where the singular value decomposition decomposes a single matrix into three components, the generalised singular value decomposition decomposes two matrices simultaneously, with a single matrix of right singular vectors. This could be useful to visually represent more than one data matrix simultaneously. This project is currently on hold while Dr Ganey is on maternity leave.

#### FAULT DIAGNOSIS IN MULTIVARIATE STATISTICAL PROCESS MONITORING

*Sugnet Lubbe, Roelof Coetzer (NWU)*

While Prof Coetzer was at SASOL we co-supervised Dr André Mostert's PhD at UCT. Dr Mostert sadly passed away from COVID-19 in June 2021. Prof Coetzer and I plan to publish at least two papers from his thesis. Prof Coetzer is organising a session “Methodologies for process monitoring and fault detection in complex industrial processes" at the International Conference of Computational Methods in Science and Engineering (ICCMSE). Since the conference is in hybrid format, Prof Lubbe will remotely present a paper on this work.

** **

#### SUBSET MULTIPLE CORRESPONDENCE ANALYSIS IN THE VISUALISATION OF CATEGORICAL DATA WITH MISSING OBSERVATIONS

*Johané Nienkemper-Swanepoel, Niël le Roux, Sugnet Lubbe*

A paper based on the research from Dr Nienkemper-Swanepoel's thesis has been submitted for publication. More developments and generalisations are currently being investigated.

#### USING MCA BIPLOT VISUALISATIONS TO IDENTIFY MISSING DATA MECHANISMS

*Johané Nienkemper-Swanepoel, Niël le Roux, Sugnet Lubbe*

A paper based on the research from Dr Nienkemper-Swanepoel's thesis is in preparation while new developments and generalisations are currently being investigated.

#### EMBEDDED WORD MCA BIPLOTS FOR SENTIMENT VISUALISATION: APPLICATION TO COVID-19 RELATED TWEETS

*Zoë-Mae Adams, Johané Nienkemper-Swanepoel, Niël le Roux, **Sugnet Lubbe*

This work is based on the Masters research of Ms Adams. An abstract has been submitted for an invited session at the conference of the International Federation of Classification Societies in July 2022, Portugal. Ms Adams is busy finalising the thesis and will then prepare a paper for publication.

#### EXPLODING BIPLOTS R PACKAGE

*Carel van der Merwe, Delia Sandilands, Ruan Buys, Sugnet Lubbe*

Ms Sandilands completed her Masters project on automating the process of orthogonal parallel translation, moving biplot axes to the edges of the plot, similar to scatterplots. A paper with authors Sandilands, van der Merwe and Lubbe has been submitted, but needs to be revised. In the meantime Mr Buys is working on the coding of the R package.

#### The correspondence analysis of ordered categorical variables

*Eric Beh and Rosaria Lombardo*

For the past 25 years or so we have published extensively on examining the role of orthogonal polynomials on a range of issues concerned with correspondence analysis. These issues include the construction and interpretation of low-dimensional visual depictions of the association, as well as the partition of popular measures of association, and correlation and association models. This is because orthogonal polynomials provide an excellent, simple and a flexible means of incorporating the structure of ordinal categorical variables – all they require is an *a priori* chosen set of initial scores to reflect the ordinal structure of a variable and a three-term recurrence formula to generate the polynomials. Orthogonal polynomials also enable one to determine “generalised correlations" which include as special cases the traditional linear-by-linear correlation coefficient (that everyone should be familiar with) and sources of non-linear association that may exist between the ordinal variables. Alternative approaches involving scaling categories such that the resulting scores (obtained from reciprocal averaging or by other means) are “forced" to be ordered. Unfortunately, such approaches only considered ordered scores across a single dimension and the resulting visual representation of the association may not properly reflect the nature of the association.

This ongoing project examines the impact of orthogonal polynomials on the structure of the association between two or more categorical variables. Methods of three-way and higher-way decomposition using orthogonal polynomials are very much linked to the Tucker3 decomposition and, more generally, to the suite of decomposition methods that are now part of higher-order singular value decomposition (HOSVD). This project also examines the impact on the interpretation of visual summaries of the association obtained by performing correspondence analysis, where the traditional correspondence plot or biplot may be constructed.

#### The role of symmetric and asymmetric association in correspondence analysis

*Eric Beh and Rosaria Lombardo*

Typically, for the analysis of a two-way contingency table, the association between the variables is assumed to be structured such that they are both predictor variables. This is because such a structure allows for Pearson's chi-squared statistic to be used to assess the statistical significance of the association. However, there are times when (for practical reasons) it is more reasonable to treat one variable as being a predictor variable and the second variable as the response variable. Such an asymmetric association structured can be formally assessed using the Goodman-Kruskal tau index and visually assessed using non-symmetrical correspondence analysis (NSCA). Many of the features of NSCA remain the same as the traditional “symmetric" approach that uses the Pearson chi-squared statistic at its foundations with the interpretation of a correspondence plot, or biplot, being slightly different – due solely to the asymmetric structure of the variables. This ongoing project examines the features of NSCA, in particular for nominal and ordinal variables as well as variations of correspondence analysis that expand the technique for the analysis of the association between multiple categorical variables.

#### Strategies for dealing with overdispersion in contingency tables when performing correspondence analysis

*Eric Beh and Rosaria Lombardo*

When a correspondence analysis is applied to a two-way contingency table, it is performed by first decomposing a matrix of standardised residuals using singular value decomposition. The advantage of doing this is that the sum-of-squares of these residuals, and of the squared singular values, is equivalent to Pearson's classic chi-squared statistic. Such residuals, which are treated as being asymptotically normally distributed, arise by assuming that the cell frequencies of the contingency table are Poisson random variables; doing so means that their expectation and variance are equivalent. However there is clear evidence in the statistics literature that suggests that the variance of these residuals exceeds their expectation. Thus, we observe overdispersion in the table. Therefore, this project investigates various strategies can be undertaken to deal with overdispersion and include assuming that the cell counts are from a generalised Poisson, Conway-Maxwell Poisson or negative binomial distribution. Variance stabilising strategies can also be included such as by considering the *adjusted standardised residual* and the *Freeman-Tukey residual*. As part of this project, adopting such strategies means that one needs to examine their impact on how to quantify the overall association between the variables, and the interpretation of the low-dimensional visual display that can be generated. Extensions to examining this issue for multiple categorical variables is also under consideration.

#### On the construction of biplots for the visualisation of ordered categorical variables

*Eric Beh and Rosaria Lombardo*

For more than 20 years, variants of correspondence analysis have been developed that accommodate for the structure of ordinal categorical variables using orthogonal polynomials. When the visual display from this analysis is the biplot, projections linking the origin to the standard coordinate of each category is a common feature. When a column variable, say, consists of ordered categories, the biplot can be constructed so that their standard coordinate is determined using orthogonal polynomials which require a set of a priori scores that reflect the ordered structure of the categories. When the first two polynomials are used to construct the biplot they produce a configuration of standard coordinates that appear to be parabolic in shape. This project explores the exact nature of this parabolic relationship and examines the various features of this configuration of points. In particular, simple formulae can be derived to determine the focus, vertex, intercepts and directrix of this relationship. Since the use of orthogonal polynomials requires choosing a priori scores to reflect the ordinal nature of the categories of a variable, this project also explores the impact of different scores on these features. Ongoing research in this area means that this project includes examining the relationship between the first-order and higher-order polynomials and the impact such a relationship has on the interpretation of the biplot.

#### The impact of power transformations to reciprocal averaging, canonical correlation analysis and correspondence analysis

*Eric Beh and Rosaria Lombardo*

The role of transformations has gained wide attention in the correspondence analysis literature. In particular, the focus of such transformations have focused on the profiles of a two-way contingency table and is largely due to the impact of the work undertaken by Michael Greenacre over a decade ago. While his work examined on the impact of a power transformation of the elements of a contingency table and of a profile, the results from this approach can also be obtained by considering the same power transformations from a reciprocal averaging and canonical correlation perspective. A few questions arise though. For example, what possible range of transformations exist that ensure that the correspondence analysis is depicting the association between categorical variables that remains statistically significant? Also, what happens if transformations other than a power transformation – such as a log transformation or a trigonometric transformation are considered? This projects expands the role of power transformations in correspondence analysis and its related areas, including the impact of such transformations on the interpretation of the resulting low-dimensional visualisations that can be obtained from them.

#### Correspondence analysis and the Cressie-Read family of divergence statistics

*Eric Beh and Rosaria Lombardo*

The foundations of correspondence analysis rests with Pearson's famous chi-squared statistic and provides the numerical groundwork for visualising how categorical variables are associated. It has been recently shown that the Freeman-Tukey statistic can also play an important role and confirmed the advantages of the Hellinger distance that have long been advocated in the literature. Pearson's and the Freeman-Tukey statistics are two of five commonly used special cases of the Cressie-Read family of divergence statistics. Therefore, correspondence analysis can be expanded so this family lies at the heart of how the association is quantified and visualised. The advantage of using the Cressie-Read family of divergence statistics when performing correspondence analysis is that it includes as special cases two variants that have gained some attention in the literature - the Hellinger distance decomposition (HDD) method and log-ratio analysis (LRA). Expanding correspondence analysis in this way also enables for some general features to be obtained – such as coordinate systems, models of association/correlation, and distance measures – and for flexibility to be considered when defining the “best" and “worst" possible visualisation of the association. This project therefore examines the role of the Cressie-Read family of divergence statistics in the correspondence analysis of a two-way contingency table. Possible extensions to this project include expanding it to the analysis of a multi-way contingency table, examining the impact on the visual display (such as the traditional correspondence plot, or the biplot) and exploring whether asymmetric associations can be incorporated into this framework.