This is What Data Cleaning Looks Like | Data Collective: Blog
heyo, I actually found the original blog post for which they compiled this data ... adding to Articles tab now.
Hmm, I can't find a description of primary or secondary fraction in the article. Also, can't find the data. Unless the data is the stuff being sold in the Infochimps marketplace for $1? Namely: http://www.infochimps.com/datasets/candidate-wordbags ?
Hey.. sorry about that.. it's been awhile since I worked with this data :-) Let me try and dig back through what I did and explain a bit more about what exactly the data means...
Btw.. that data set at Infochimps is the *raw* wordbag data -- I've made it free now, so feel free to download.
Here's what I did..
Using the candidate list to split by party:
I then ran an analysis on the word incidence for tweets that contained either republican or democrat full names.
"primary fraction" and "secondary fraction" were actually "democrats" and "republicans" -- sorry, that's confusing, and I should probably just get rid of those two columns. The reason I made a separate "democrats" and "republicans" column was so that it would plot nicely on a bar graph (so.. the "democrats" column is just a negative value of the "primary" column).
The p/s is just a derived column, dividing primary/(primary+secondary) to try and get an interesting sort of the data -- so, low values of p/s mean that they are more republican-ish terms, and values closer to 1 mean that those are more democrat terms, and p/s of 0.5 means that the terms are relatively equal in their incidence by party.
So, in plain english.. the term "god" is in 0.185% of tweets that mention a republican name, and 0.038% of tweets that mention a democrat name.
I know, I noticed that! I cloned it because I was going to try playing around with it later, but I haven't altered anything yet. If you like you can post your viz and article to Azad's original and I'll delete this one ...
I could do that, but anyway I'm more interested in the general implications of cloning and what that does to the stream of comments, articles and visualizations that are linked to the clone or the original. For now, I'll just be careful to comment on the original from now on.
Sure — our cloning/version control is in active development, so keep your eyes peeled for what we deploy at commercial launch :)
oooh didn't know it was cloned. I had a link to the original file (with viz!) in my google docs for more than a year now and it was the first thing I threw up on buzzdata. Should have searched to see if anyone else had done it before me.
oh it was cloned _from_ me. I see. Also take a look at this amazing motion chart that was made by the original compiler of the data: https://spreadsheets.google.com/spreadsheet/ccc?key=0Aq3K-CZwPWxOdEx2SXh4VDFDYzA5UWhYczlCdWZ2UGc&hl=en_US#gid=15