Friday, 13 January 2017

Getting to the Median (Mean, Range, SD) Intercorrelation of a set of Variables Quickly in R



Sometimes you are interested in the inter-relatedness of many variables. That's pretty much nine tenths of psychometrics.

You may want to use Meng's test of the heterogeneity of correlated coefficients (see this previous blog and this one too if you're interested in that test). Perhaps you have many individuals measured on numerous predictor variables and you want to check whether relations with a shared outcome variable are differential. To do this test you'll need to find the median inter-correlation of the predictor variables. To speak more specifically, the median Pearson's r.

This is a bit of a pain in the backside because most statistical software packages don't have a checkbox for this. Finding the median inter-correlation of a set of variables is not something SPSS, for example, will readily do for you. It's an unusual request.

Below is a simple code for finding the median inter-correlation of an array of variables. I've nicked most of it from this page on the STHDA website, so the credit goes to them. That's a good website, so check it out when you've got a minute.

But, before that, two things.

1) Create a .csv file, containing ONLY the raw data for the variables you are interested in testing the heterogeneity of. So no age, gender, ip address, eye colour, ethnicity, or the shared outcome variable of interest. None of that. Just the columns for the specific variables. So the .csv should only have as many columns as variables. If you don't know how to do that, don't worry it's not difficult, and you can find out here.

2) Install the package Hmisc, which contains many varied and useful functions for data analysis, not just the rcorr function used here.

Then do this

#Code for finding the median intercorrelation of set of variables.
#1) Create a .csv file, containing ONLY the raw data for the variables you are interested in testing the heterogeneity of. So no age, gender, ip address, eye colour, ethnicity, or the shared outcome variable of interest. None of that. Just the columns for the specific variables. So the .csv should only have as many columns as variables. If you don't know how to do that, don't worry it's not difficult, and you can find out here.
#2) Install the package Hmisc, which contains many varied and useful functions for data analysis, not just the rcorr function used here.
#set working directory - change D:/Data to wherever on your computer your .csv is stored
setwd("D:/Data")
#import csv
df1 <- read.csv("mycsv.csv")
#create a correlation matrix of your variables (however many you have).
#Importantly, you absolutely have to have Hmisc installed for this part
#as that's where the rcorr function comes from.
#If it's not installed you'll get a message that tells you there is no rcorr function.
#create matrix from csv
matrx <- rcorr(as.matrix(df1))
#show matrix
matrx
#Right we're off to a good start but we need to turn that matrix in to data that we can analyse
#To do so, flatten the correlation matrix by creating a function flattenCorrMatrix.
#This function turns the matrix of correlations in to a list.
#function for flattening
flattenCorrMatrix <- function(cormat, pmat) {
ut <- upper.tri(cormat)
data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor =(cormat)[ut],
p = pmat[ut]
)
}
#flatten rs and ps
df2 <- flattenCorrMatrix(matrx$r, matrx$P)
#show flattened matrix
df2
#find median of r
median(df2$cor)
#And there you have it, the median Pearson's r.
#You can now plug it in to your Meng's test or do whatever you want with it.
#Use other functions to ask other things of your correlation matrix. Mean? Range? SD?
mean(list$cor)
range(list$cor)
sd(list$cor)


The great thing about doing this way is that next time you have to do it (on another data set collected at a different time) you can just rerun the script.


No comments:

Post a Comment