Earlier this month I attended and spoke at News:Rewired, a popular digital journalism conference in London, U.K. The journalists there were top-drawer: from Reuters, the BBC, the Guardian, the Telegraph, and others. My talk, on how data curation will be key in driving digital journalism forward, appeared to resonate with quite a few people, which was great.
However, more often than not, attendees came up to me, thanked me for the talk and then prefaced their clear enthusiasm for data journalism with an almost bashful admission that they lack the data exploration, analysis and visualization skills to actually do it.
This is no surprise. Journalism has historically always been a narrative craft and largely still is. But this experience did make me think that perhaps it might be helpful to make step-by-step tutorial posts showing how to probe and visualize data.
(Are you a data geek with a tip or tool to show off? Ping me about guest-posting or contributing to a BD tutorial)
This first data-tutorial post — a very basic one for data newbies — will begin exploring some Canadian government open data recently published on BuzzData: the City of Toronto’s water billing data over the last 11 years. A city journalist’s spidey-sense should tell them right away there’s a budding story to be had in this data, namely:
Which wards are the most water-efficient and water-wasteful in Toronto?
Let’s follow this 5-step process to find out.
(Want to follow the evolution of this data and stay updated on results as I go? Follow my dataset on BuzzData)
Step 1: Get the data
First you have to get your hands on the data you want. This particular dataset is easy to get: just clone the data from the original publisher here: www.buzzdata.com/opento, and then download the xls file to your desktop. In this video I show you how to clone the data and make it private so you can build your project around it without others seeing what you’re working on:
Step 2: Pick a question to answer
In future posts we’ll get more sophisticated with our exploration and visualization. For this first exercise we’re going to pick a very specific, simple question:
“Which wards had the highest and lowest average water consumption last year?”
If you open up the dataset you downloaded, you’ll see that it actually splits water billing accounts into two types: residential and commercial. Let’s stick with residential. (Feel free to repeat this exercise on your own to find out which wards had the highest and lowest average commercial water consumption, and then see if there’s a correlation between the two types …)
Step 3: Pick your visualization method (and use K.I.S.S. — Keep It Simple, Stupid)
I’m going to make a bar chart in Excel. I know, it sounds boring, but here’s why:
To answer my question, I’m going to visualize discrete data (Toronto wards), and only compare one kind of value (the wards’ average residential water consumption). Any other kind of graph would probably be less clear in the long run, because the extra bells and whistles of the method would just add noise to the image.
However, if I wanted to highlight water consumption trends over multiple years, a line graph or time series chart would likely work best.
If I wanted to know which wards were close to each other, a heat map using GIS data would be great. We’ll get to those in the future.
As a rule: pick the method that would best highlight the answer to your question!*
*You may not know which one works best without a little trial & error first.
Step 4: Format your data
Now it’s time to look at the data:
That’s a lot of data. Graphing this entire spreadsheet would be pointless, in fact it would probably be harder to understand than the spreadsheet itself. You have to think about what information pertains to your question. I want to know which ward was most efficient and wasteful last year, respectively. So I need the following data:
Average residential water consumption for each of 44 wards in the year 2010
Everything else — commercial accounts, # of accounts, total consumption, etc. — would be noise on the page. So how do we get just this data? There are lots of ways, but in this instance we’ll make a Pivot Table:
Now we have a nice Pivot Table, but we’ll want to do just a little more formatting and organization before we make our graph. In this video clip I show how to prep your table for graphing, as well as how to sort your data to get an idea of what your findings will be:
Step 5: Graph it & and get your answer!
Before going on, let’s recap what I’m trying to find out here. My original question was:
“Which wards had the highest and lowest average residential water consumption last year?”
By sorting the data earlier, we already know our answer. Now we just want to visualize it. Because we already sorted and formatted our data, graphing is now a piece of cake with Excel’s chart wizard. Here’s how you do it:
And that’s how you make a nice clean bar chart (and start to explore data with a journalistic frame of mind). What other trends can you find in this dataset?
One more note: visualizing data one way to answer a question often prompts new questions! In this case, I can’t help but wonder whether city wards with similar water consumption levels cluster together geographically. To answer this, we’ll need to map the data, so stay tuned for the next tutorial post to learn how!
NEXT UP: Visualizing Toronto’s water consumption with GIS (geographic information systems) data. In other words, shapefiles and mapping. Woohoo!
Enjoyed this but know you could do better? Great! Ping me about guest-posting or contributing to a BD tutorial post!
Want to follow the evolution of the data as we go? Follow my dataset on BuzzData!