### Science Brief

# Big data methods and psychological science

By Kevin Grimm, Ross Jacobucci, and John J. McArdle, PhD

Kevin J. Grimm, PhD, is a professor in the quantitative research methods area of the department of psychology at Arizona State University. His research interests include longitudinal data analysis, mixture modeling and data mining. He is an author of "Growth Modeling: Structural Equation and Multilevel Modeling Approaches" and has taught at APA’s Advanced Training Institutes (ATIs) since 2003.

Ross Jacobucci, MA, is a PhD candidate in quantitative psychology at the University of Southern California. His main research interest is in integrating concepts from data mining with latent variable models, with specific application in both cognitive aging and clinical psychology.

John J. McArdle, PhD, is a professor of psychology at the University of Southern California. His research interests include longitudinal data analysis, structural equation modeling, and data mining. He authored several books, including "Longitudinal Data Analysis using Structural Equation Models" and created APA’s ATIs on Structural Equation Modeling in Longitudinal Research and Big Data: Exploratory Data Mining in Behavioral Research.

## Big Data: Exploratory Data Mining in Behavioral Research

This APA Advanced Training Institute provides an overview of recent methodological advances in exploratory data mining for the analysis of psychological and behavioral data.

**Arizona State University
Tempe, Arizona
June 5-9, 2017
**

Big data methods, often referred to as machine learning, statistical learning and data mining, are a collection of statistical techniques capable of finding complex signals in large amounts of data. Capitalizing on the availability of data from diverse sources like cell phones applications, biosensors and social media, researchers seek to derive structure and meaning from the massive amounts of data to uncover patterns and make predictions. Given that much of these data are behavioral, psychologists should have a major role in the analysis of these data.

Data mining methods have garnered much attention of late; however, their use in psychology remains limited. Their limited use may be due to several factors. Psychologists may be hesitant because of the exploratory nature of these methods. As theory-driven researchers, psychologists use statistics to test specific hypotheses. However, this confirmatory approach does not allow a systematic way for researchers to explore or learn from the data collected. The lack of guidance regarding data exploration has led to poor research practices and a lack of safeguards to prevent chance findings. Data mining methods, on the other hand, allow for efficient searching and model development from data, but at the same time, have safeguards to prevent overfitting or tailoring a model to fit the empirical data at hand.

Another reason why psychologists may not be using data mining methods in their research is because many of these methods are advertised as applicable for “big data,” and many psychologists do not consider the data they gather and analyze as “big” enough to use these methods effectively. While it is true these methods are often used in datasets with a large sample size and a large number of variables, they can also be productively used in smaller scale studies, as we discuss below. That is, even with smaller datasets, psychological scientists can and should use these methods to learn from their data (see also Tukey, 1962) and to inform further hypothesis generation.

## Data mining methods

Data mining methods can be roughly organized into two major classes: supervised learning methods and unsupervised learning methods. In supervised learning, there is an outcome of interest and the goal is to develop a prediction model based on a set of variables. Most supervised learning methods are focused on variable selection, nonlinearity and interactive effects and thus offer many advantages over standard regression models. Regression models with a large number of variables can be unstable, particularly if there is a high degree of correlation among the predictor variables. Additionally, when the number of variables is large, it can be next to impossible to manually search for which interactions may be present. The goal of supervised learning methods is to identify the important variables, nonlinear forms of the variables and/or their interactive effects. These approaches often yield a model that is simpler and more interpretable because the important effects can be isolated. Furthermore, the resulting model is more likely to replicate in a new sample.

In unsupervised learning, there is no outcome variable that we wish to explain; instead our goal is to group variables or participants based on their degree of similarity or covariation. Unlike supervised learning methods, unsupervised learning is commonly used in psychological research. For example, data reduction methods, such as principal components analysis (PCA) and exploratory factor analysis (EFA), are quite common in psychology as are methods for grouping participants, such as cluster analysis and finite mixture modeling.

## Current use of data mining in psychological research

Supervised learning methods have rarely been utilized in psychology; however, these methods should and will play a greater role in psychological research in the future. As we noted, one reason why these methods may not have taken hold in psychology is because researchers may think the methods require massive amounts of data — lots of participants and lots of variables. It is worth noting that many data mining methods work well in small data settings. For instance, when accounting for missingness due to attrition, classification and regression trees (Breiman, Friedman, Stone & Olshen, 1984) outperformed multiple imputation in small sample sizes (N < 500; Hayes, Usami, Jacobucci & McArdle, 2015). As a second example, the use of shrinkage in Bayesian structural equation modeling has been found to produce less biased estimates in small samples (McNeish, 2016) compared to maximum likelihood estimation.

Although data mining algorithms can be applied with smaller samples, researchers must be careful with their use. With smaller datasets comes a higher propensity to explain noise or unique features of the data (i.e., overfitting). To overcome this issue, it is absolutely necessary to use various forms of cross-validation in concert with these methods. Although not a novel concept in psychology (Browne, 2000), cross-validation is rarely used in psychological research. Cross-validation commonly entails splitting the dataset into two parts, a training dataset and a test dataset. With the training dataset, we can explore to our heart’s desire, but we typically use a form of internal cross-validation to prevent overfitting in the training dataset. After we explore, a small number of models (i.e., 1 to 3) are chosen that we think fit reasonably and examine the predictive nature of these models on the test dataset. Note that this does not mean that we re-estimate the model on the test dataset. Instead we take our model created on the training dataset and create predictions based on our test data. This gives us a more realistic assessment of how well the model will perform if data from a new sample were collected.

As we noted, unsupervised learning methods are quite common in psychology. PCA and EFA are common data reduction methods with EFA often a first step in understanding data dimensionality. In many instances, an EFA model is applied to half of the dataset and then a confirmatory factor analysis (CFA) model is estimated on the remaining half of the data as a way to separate the exploratory from the confirmatory aspects of data analysis. This approach is similar to cross-validation, but in psychology researchers often do not validate the exact model. Typically, the model is re-estimated in the CFA and factor loadings that were negligible in the EFA are fixed to 0 in the CFA.

Similar to PCA and EFA, cluster analysis and finite mixture models are common in psychology and the social sciences. Finite mixture models are increasingly being used to search for groups with different data patterns or associations. In psychology, few effects are universal and finite mixture models are a way for researchers to search for conditional effects. One issue with the current use of finite mixture modeling in psychology is that cross-validation is rarely used to evaluate the viability of a model. However, cross-validation has recently been given greater attention in mixture modeling (see Grimm, Mazza & Davoudzadeh, in press; Masyn, 2013).

## Expansion of big data methods to psychology

Although supervised learning methods are not often used in psychology, most of this can be attributed to the lack of attention these methods have received from methodologists in the psychological sciences. Slowly but surely this is changing, as more and more data mining methods are being adapted to the nuances and intricacies of psychological data and methods (see McNeish, 2015; Strobl, Malley & Tutz, 2009). Specifically, we (and many others) have focused on combining many of these big data methods with latent variable models that are common in psychology.

Latent variable models (e.g., confirmatory factor models, structural equation models [SEMs]) are common in psychology given our multivariate measurements and our fairly common longitudinal designs. Combining data mining algorithms with latent variable models is a necessary step to increase use among psychologists and there are several recent examples of this integration. For example, Brandmaier, von Oertzen, McArdle & Lindenberger (2013) combined SEMs with classification and regression tree algorithms to develop SEM Trees. In SEM Trees, a series of predictor variables are used to partition the data and a user-specified SEM is fit to each partition of the data. The goal is to find the predictors with cut points that maximize the fit of the model. Essentially, this is an automatic way to search for groups of participants where members of the same group are homogeneous with respect to the SEM and members of different groups are heterogeneous with respect to the SEM (see Jacobucci, Grimm, & McArdle, in press). For example, SEM Trees can be used to find groups with different trajectories across time, or groups where different measurement models are present.

In a similar vein, Jacobucci, Grimm & McArdle (2016) combined regularization, a method common in high-dimensional regression, with SEMs to create regularized SEM (RegSEM). RegSEM allows researchers to penalize specific parameters in an SEM, leading to simpler and more replicable SEMs. There have been similar developments in the multilevel modeling framework. For example, Hajjem, Bellavance & Larocque (2011) and Sela and Simonoff (2012) combined mixed-effects models and regression trees to create mixed-effects regression trees. These approaches can efficiently search high dimensional hierarchically structured data for nonlinear and interactive effects.

While this recent work makes certain algorithms more applicable to social scientists, we highlight a challenge that has received less attention — incomplete data. Simply put, many data mining algorithms require complete data. Furthermore, different programs handle incomplete data in different ways. Given that incomplete data are common in psychological studies and often not missing completely at random, models can yield biased results or, in the least, the results will depend on the method used to handle incomplete data. Thus, one avenue for future research that will drastically increase the utility of many of these methods in psychological research is the incorporation of contemporary missing data methods, such as multiple imputation or full information estimation, into data mining programs.

## Concluding remarks

Psychological researchers often strive to test theory-driven hypotheses with their statistical models, but at the same time researchers are willing to learn from their data through exploration. A concern with this exploration is that researchers conduct their exploration in unique ways, without the necessary safeguards to prevent chance findings, and tend to tailor the models to data at hand. Data mining methods, for the most part, are strictly exploratory procedures, able to efficiently search the data for associations and nonlinear effects, and have safeguards to prevent overfitting. For these reasons, we encourage psychological researchers to consider and evaluate the use of data mining algorithms in their research.

## References

Brandmaier, A.M., von Oertzen, T., McArdle, J.J., & Lindenberger, U. (2013). Structural equation model trees. *Psychological Methods, 18*, 71-86.

Browne, M.W. (2000). Cross-validation methods. *Journal of Mathematical Psychology, 44*, 108-132.

Breiman, L., Friedman, J., Stone, C.J., & Olshen, R.A. (1984). *Classification and regression trees*. Boca Raton, Florida: CRC press.

Grimm, K.J., Mazza, G., & Davoudzadeh, P. (in press). Model selection in finite mixture models: A k-fold cross-validation approach. *Structural Equation Modeling: A Multidisciplinary Journal*.

Hajjem, A., Bellavance, F., & Larocque, D. (2011). Mixed effects regression trees for clustered data. *Statistics & Probability Letters, 81*, 451-459.

Hayes, T., Usami, S., Jacobucci, R., & McArdle, J.J. (2015). Using Classification and Regression Trees (CART) and random forests to analyze attrition: Results from two simulations. *Psychology and aging, 30*, 911-929.

Jacobucci, R., Grimm, K.J., & McArdle, J.J. (2016). Regularized structural equation modeling. *Structural Equation Modeling: A Multidisciplinary Journal, 23*, 555-566.

Jacobucci, R., Grimm, K.J., & McArdle, J.J. (in press). A comparison of methods for uncovering sample heterogeneity: Structural equation model trees and finite mixture models. *Structural Equation Modeling: A Multidisciplinary Journal*.

Masyn, K. (2013). Latent class analysis and finite mixture modeling. In T.D. Little (Ed.) *The Oxford handbook of quantitative methods in psychology* (Vol. 2, pp. 551-611). New York: Oxford University Press.

McNeish, D.M. (2015). Using lasso for predictor selection and to assuage overfitting: A method long overlooked in behavioral sciences. *Multivariate Behavioral Research, 50*, 471-484.

McNeish, D. (2016). On using Bayesian methods to address small sample problems. *Structural Equation Modeling: A Multidisciplinary Journal, 23*, 750-773.

Strobl, C., Malley, J., & Tutz, G. (2009). An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression trees, bagging, and random forests. *Psychological Methods, 14*, 323-348.

Tukey, J.W. (1962). The future of data analysis. *The Annals of Mathematical Statistics, 33*, 1-67.

The views expressed in this article are those of the author and do not reflect the opinions or policies of APA.

PSA is the monthly e-newsletter of the APA Science Directorate. It is read by psychologists, students, academic administrators, journalists and policymakers in Congress and federal science agencies.