The geekery surrounding Stata continues.


Today I’m teaching myself to use Stata 13 SEM commands.  Generally in Stata I use the command syntax, because it’s faster and easier and I understand it.  They’re also very very simple (compared to SAS’s wtf command lines, which don’t make sense to me).

I started out learning how to do a simple mediation with the sem command and then a multilevel mediation with the gsem command.  Easy!

But Stata also has a graphical SEM builder.  Their documentation is SO handy, it tells you how to use it step by step.  Not only did I get the same results, but I also get a nifty diagram with the coefficients on it:


Which is fucking amazing.  I wonder if I can get it to flag my significant paths?  All of the paths in this diagram are significant, so maybe they just don’t show nonsignificant ones.

(This isn’t my dissertation data, btw; Stata has freely available datasets on their website that you can use to learn the techniques.  The best thing is that Stata can automatically download these datasets, so you don’t have to poke around looking for them.  You just type “use [url here]” and it GETS it for you.)

As for the lesson in this, I realized today that what takes a dissertation so fucking long isn’t the actual process of data analysis and writing.  That part is relatively easy.  It’s the learning.  I’m going to write a full post on this later, but the dissertation process is simply an alternative way of learning something – different from taking a class, kind of akin to taking comprehensive exams.  It’s struggling through the shit you don’t know that takes the longest amount of time.)


Grrr (recoding and cleaning)


For those of you who may be humanities scholars and/or don’t deal with datasets often, “data cleaning” is really just the weird misnomer quant scientists use for prepping the data.  Sometimes we have data that needs to be coded into groups, so we use statistical software to do that; sometimes we need to clean up the way the data is already coded, which was what my problem was today.  Data cleaning is the most tedious and usually the longest process when preparing to analyze data; I’d wager that as far as analyses go, quant grad students spend about 2/3 of their analysis time cleaning data and about 1/3 doing actual analyses.  (Of course, this varies depending on how complex your data is.

One of the things I hate is when I sit down to do some data work after I thought I cleaned my data and realize that there’s still a lot of cleaning and recoding to do.  It’s inevitable, of course; you always find that there’s something wrong or something you forgot to do once you begin to actually manipulate the data.  My goal today was just to explore my main variables, look at distributions and bivariate relationships (relationships between two variables, like race and age or race and scores on some measure) before moving on to constructing my model.  Of course, I found issues before I could do that.

First, I hate when scales are scored in nonsensical ways.  Let’s say that I have a score called “Skill at Underwater Basketweaving.”  I want higher scores on that scale to reflect more skill in underwater basketweaving.  I don’t want higher scores to be LESS skill, because that’s confusing.  Unfortunately I found out that three of my scales were scored that way (WTF?  In my defense, I did not score them myself).  I found out when I went to run bivariate correlations and found weird negative correlations I didn’t expect.  I don’t know if they were intended to be that way or if they were entered wrong in the survey, but I reverse scored them so that higher scores meant higher amounts of the thing the score measures.

Luckily I was using Stata (a statistical software package) which I have to say is awesome.  I just started using it – in the past I did all my data management in SPSS and then most of my analyses with SAS (both also statistical software packages), but Stata has an SEM package (the kind of analysis I am using in my dissertation) included in it now and so I was able to get my hands on it.  I imported the data into Stata – mostly clean data already – and when I found issues, it was so much faster to fix them in Stata.  Even just moving variables around was faster, although that’s likely because in Stata I am far more likely to look up the syntax to do something (and then fuck it up 4 times in a row, burning into my mind how to do it correctly*).

Then I started playing around with demographics to see if there are important differences between groups. I don’t want there to be differences between my groups because I am doing within-person analyses for my dissertation – I want to see if people change within themselves over time, and I’m not yet really interested in the differences between groups of people.  I am especially not interested – yet – in differences based on demographics, so thankfully I found on important indicators there don’t seem to be differences in the variables I am interested in.  I forgot to look at two particularly important variables, though, so I will save those for tomorrow.

So I did get something done today.  Yay! has been really slow lately.  I’ll type something and it takes several seconds to show up.  It’s annoying!

*As a side note, this is the way I learn statistics and/or a new software program, and in my opinion the best way.  You sit down with a dataset, and you do stuff to it.  You fuck up multiple times, and then you learn not to do that dumb shit again.  I recall syntax so much better when I’ve typed it slightly wrong 5 times in a row before getting it right (fucking capitals, how do they work) than I do when I just copy and alter it.