Grrr (recoding and cleaning)


For those of you who may be humanities scholars and/or don’t deal with datasets often, “data cleaning” is really just the weird misnomer quant scientists use for prepping the data.  Sometimes we have data that needs to be coded into groups, so we use statistical software to do that; sometimes we need to clean up the way the data is already coded, which was what my problem was today.  Data cleaning is the most tedious and usually the longest process when preparing to analyze data; I’d wager that as far as analyses go, quant grad students spend about 2/3 of their analysis time cleaning data and about 1/3 doing actual analyses.  (Of course, this varies depending on how complex your data is.

One of the things I hate is when I sit down to do some data work after I thought I cleaned my data and realize that there’s still a lot of cleaning and recoding to do.  It’s inevitable, of course; you always find that there’s something wrong or something you forgot to do once you begin to actually manipulate the data.  My goal today was just to explore my main variables, look at distributions and bivariate relationships (relationships between two variables, like race and age or race and scores on some measure) before moving on to constructing my model.  Of course, I found issues before I could do that.

First, I hate when scales are scored in nonsensical ways.  Let’s say that I have a score called “Skill at Underwater Basketweaving.”  I want higher scores on that scale to reflect more skill in underwater basketweaving.  I don’t want higher scores to be LESS skill, because that’s confusing.  Unfortunately I found out that three of my scales were scored that way (WTF?  In my defense, I did not score them myself).  I found out when I went to run bivariate correlations and found weird negative correlations I didn’t expect.  I don’t know if they were intended to be that way or if they were entered wrong in the survey, but I reverse scored them so that higher scores meant higher amounts of the thing the score measures.

Luckily I was using Stata (a statistical software package) which I have to say is awesome.  I just started using it – in the past I did all my data management in SPSS and then most of my analyses with SAS (both also statistical software packages), but Stata has an SEM package (the kind of analysis I am using in my dissertation) included in it now and so I was able to get my hands on it.  I imported the data into Stata – mostly clean data already – and when I found issues, it was so much faster to fix them in Stata.  Even just moving variables around was faster, although that’s likely because in Stata I am far more likely to look up the syntax to do something (and then fuck it up 4 times in a row, burning into my mind how to do it correctly*).

Then I started playing around with demographics to see if there are important differences between groups. I don’t want there to be differences between my groups because I am doing within-person analyses for my dissertation – I want to see if people change within themselves over time, and I’m not yet really interested in the differences between groups of people.  I am especially not interested – yet – in differences based on demographics, so thankfully I found on important indicators there don’t seem to be differences in the variables I am interested in.  I forgot to look at two particularly important variables, though, so I will save those for tomorrow.

So I did get something done today.  Yay! has been really slow lately.  I’ll type something and it takes several seconds to show up.  It’s annoying!

*As a side note, this is the way I learn statistics and/or a new software program, and in my opinion the best way.  You sit down with a dataset, and you do stuff to it.  You fuck up multiple times, and then you learn not to do that dumb shit again.  I recall syntax so much better when I’ve typed it slightly wrong 5 times in a row before getting it right (fucking capitals, how do they work) than I do when I just copy and alter it.