Friday 27 January 2017

Tips for Early-Stage Use of R



As my PhD draws to a close, I've made a commitment to learn how to do all the stats I do in R. This is because (a) the packages R are comprised of are just extremely good pieces of kit and (b) I am worried about the possibility of not being attached to a research institution and therefore having to pay a small fortune for statistical packages. SPSS, for example, is not only a questionable piece of software, but also starts at £949 per year per user. No, there shouldn't be a decimal place in there - you really do have to pay nearly a thousand pound. R, on the other hand, is brilliant and free. At least for now anyway.

I'll admit that I've had a few false starts along the way. For a childish reason that says more about me than it does about R: I've got really really angry when it did something stupid and/or wouldn't give me the thing I wanted it to give me or wouldn't let me do the thing I wanted to do. I think I might have been avoidantly attached in childhood or something. Which means I have shunned R at the first sign of rejection on more than one occasion.

I think at least two of those false starts could have been avoided had someone provided me with some really basic information. The kind of information that they don't really tell you in the books they write about R. Probably because the authors of those books have a level of expertise that hides the annoying little hurdles that the beginner has to hurdle. So they don't have answers to some of those basic questions that you may have.

Here's a few basic answers to a few basic questions.

Q: All I hear about on Twitter is this thing called R. What the hell is R? And why the hell should I use it? I've gotten this far without it, haven't I? 

R is a big scientific calculator that you can download and use on your computer. In itself that's not that impressive, granted, but it has all of these packages that you can download and use within it that are really useful. Think of R as a games console (PS4, XBOX) and the packages as the games you have to put in it. But the thing is you don't have to buy them and each game is a about statistics or data-analysis and they're not an unbelievable waste of time, they're tools and they help you get your work done. Most of the packages allow you to do the same things you did in SPSS or SAS or AMOS or whatever, but most in some sense go above and beyond in the performance of the equivalent statistical procedures. They are also frequently updated. Think of each package, created by a passionate and probably crazed expert on that topic, as a dedicated SPSS to a single or number of statistical procedures. What this means is that each package helps you do something cool with your data or, not your data per se, numbers in general. In a way that you've probably never been able to before.

Want to generate data? Do it in R (e.g. "stats" or "mass" package). Want to run a structural equation model? Do it in R (e.g. "lavaan" package). Want to do an exploratory factor analysis? (do it in "psych"). Want some beautiful plots of your data? Do it in R (e.g. "ggplot2"). Want to do all of these things sequentially, save the code, and then be able to do it again, or have someone else do it, in a couple of months? Do it in R (write a script). Want to do something no one has done before? Make your own code or package. There's a creativity in R that you don't get with the conventional data software, because packages in combination open up infinite possibilities. And think how much time this all saving you in the long run, not needing to go from SPSS to AMOS to LISREL to excel and back again. And the back again the next time you do it.

Add to that the fact (a term here defined as my opinion) that R, particular R studio, is extremely user friendly and aesthetically about as good as any computer data-analysis aid you will find, and it's clear that choosing to get to know R has the potential to be one of the better decisions you'll make this year.

Q. I want to quickly get a feel for it. I have some data. How do I get it in to R?

Easy. Download R. And download RStudio. Hopefully it's already on your computer, it's likely that it is if you're on a computer at a British university. If not, go here https://cran.r-project.org/mirrors.html and choose somewhere near you. Then download the one that works with the operating software you have.

Now open RStudio. First, set your working directory, this is the place on your computer that R will look for when you tell it to load a file in to it. Go to Session -> Set Working Directory -> Choose directory... and choose the folder you want. Perhaps a folder on a memory stick.



Now place a .csv file in the very same folder you have just set as your working directory. If you don't often work with .csv files, consult this page here which tells you what one is and how to save an existing excel file you have as one. In short .csv stands for comma seperated values file. It's a kind of file used widely for the management and manipulation of databases.

Next, get the data from the csv file in to R. Type this in the console: tbl <- read.csv("insertnameoffilehere.csv"). Insert the name of the file exactly as it appears in the working directory (the folder you have told R to look in). The tbl bit is arbitrary, call it whatever you want. That's just a label for your .csv, that you can now use in your code. Press enter.

Now simply type tbl and press enter. It will show you your data. Well, not all of it, but a bit of it. Look at the headers given to each column. Let's pretend that you have a column headed "gpa". To get some simple statistics for this variable type the following.

max(tbl$gpa)
min(tbl$gpa)
mean(tbl$gpa)
median(tbl$gpa)
range(tbl$gpa)

Note that the dollar is there for a reason, not because I am materialistic. So don't forget it!  And if you're British don't get all nationalistic by putting a £ in there instead. Pound sterling just won't do in this context.

You won't need to do this all the time - add that $ - but if you import a .csv and want to get straight to it, absolutely no messing about, the dollar is crucial. In many of the books on R they don't make it clear that this is the case. Mainly because they show you how to perform operations on data that comes preloaded in a package. The analysis of that kind of data does not require this piece of code.

This is all so simple. I do realize that. And you can do this so easily in the software you were using before, I know. But doing this successfully will get you off the ground, giving you a bit of confidence to get really stuck in.

But let's crank it up. Let's get multivariate. Thanks to classical test theory, if you're a psychologist you probably have a multi-item scale in there, yes? If so here's how you test a single-factor model.

Install lavaan. The guide to it can be found here http://lavaan.ugent.be/tutorial/index.html

install.packages("lavaan")

Find the names of the items of the scale. Let's pretend they are V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, change the code below as appropriate (change the Vs to whatever).

#create model
singleF.model <- 'scale =~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10'
#fit model
fit <-cfa(singleF.model, data = tbl)
#summary of model
summary(fit, fit.measures=TRUE)

Q. What are the different areas of the display in RStudio and what can you do in each?

As standard, RStudio appears with four panes (but you can customize to your own specifications eventually, find out here).



The top right pane is the source. This is the location that imported scripts will appear, as well as the location in which you can create your own. The bottom right is the console. This is where you can execute code (well, you can do that in the source too, but you can't save code in the console).

As standard, the top right pane contains two things. First, "environment", which displays the data sets you have imported, as well as the objects and functions you have created in your current session. Second, "history", which displays a timeline of all the code you have executed during your session.

As standard, the bottom right pane contains:

  • "files". This is where files are displayed in case you want them. As standard, RStudio shows you the files contained in your "my documents" folder.
  • "plots". This is where plots will appear when you make them (e.g. boxplot()).
  • "packages". This is where installed packages are displayed, as well as where they can be loaded via the tick-box interface.
  • "help". This is where you can search all of the commands contained in your packages. A help request (e.g. help("read.csv")) enacted in the console will appear here.

Q. What books should I get? Are there websites or twitter accounts that useful?

Get the R Cookbook, ggplot2: Elegant Graphics for Data Analysis, and Statistics: An Introduction Using R for a kick-off. Don't rush through them, progress slowly for efficient learning.

There are plenty of really useful websites, including www.r-bloggers.com and their associated twitter feed @Rbloggers, which you can go to directly by clicking here. Click here for a long list of useful sites.


Q. I can't get my analyses to run today, even though I did them yesterday. What's going on?

Check to see if the packages are loaded. Check also if the packages are installed. If you have a buggy computer like mine, packages will sometimes just disappear. Reinstall the package if it fails to appear on the packages list multiple times.

Check if the package appears and is ticked on the list:



Q. I've installed a package, I think. It told me it successfully installed, but it won't let me use it. What do I do?

Restart R. If that doesn't work, make sure all packages that it depends on for functioning are installed too. It will probably tell you if they aren't.

Consult your package library to make sure it is there. Your package library is where all the packages are stored (can be your hard-drive or somewhere else). When you load the package by clicking the tick box in the packages tab this will come up on the console: (for example) library("stats", lib.loc="C:/Program Files/R/R-3.3.2/library"). The lib.loc is where the package is stored. Check that place to make sure it's there.

Q. How do I remember all the code that I've used?

Write scripts instead of coding in the console. Save scripts for a particular analysis in the same place as you take the data files from, so you can easily find them again.

Follows these great rules when you code.




So there you have it, a few tips that may help your transition to R a little easier. My old German teacher used to say that learning a new language was not difficult, it was just different. I think the same is true with learning how to use a new statistical interface. And much like learning a language, immersion is the best way to learn. So immerse yourself in R and don't give up when the going gets a little confusing.

No comments:

Post a Comment