Our Best City in the World Contest has come to an end! We’re so excited to start compiling all the submissions and seeing what people came up with. (Backgrounder: in this contest, the Economist Intelligence Unit challenged the world to devise and visualize new ways to rank cities and measure urban liveability.)
You can check out — and in the case of interactive submissions, play with — many of the contest entries which have now been made public on our Best City Contest Topic page (which aggregates all the submissions).
Some screenshots of the most recent submissions (picked for no specific reason):
Hello and Happy Holidays from everyone at BuzzData!
The new year is fast approaching, but we couldn’t help but push a few more new features to the site before 2011 officially comes to an end. Without further ado, here are the latest and greatest improvements we’d like you to know about:
Dataset and user profile badges to add to your websites
Now sharing data with your social circle is as easy as a single mouse click. Want your social media followers and friends to be able to download cool data you find on BuzzData? Get the data into the Twittersphere in an instant.
This is an early-stage community feature we’re quite excited about. Now whenever you create a new dataset on BuzzData, you can write in tasks that you’d like help with from others in the dataset’s Overview tab.
So what? Well, if your dataset is public, it will show up in our new global “Tasks” webpage, located just to the left of the BuzzData search box, along with the tasks you need help with.
The Tasks page is where users can peruse unfinished data projects from around the world that they can contribute to, thus helping the data community work together to achieve their goals.
To illustrate, when creating a dataset:
The global Tasks page is an early-stage feature that will continue to evolve in 2012. We hope you make use of it often and let us know how we can make it better by emailing us at firstname.lastname@example.org.
Now accepting direct image uploads and video URLs
Have you made a visualization of your dataset on your desktop and want to upload an image or video of it in action? No problem, you can now upload image files directly to the site and view them in our new visualization viewer.
In addition, BuzzData’s visualization viewer now allows you to post and stream videos from a variety of popular content providers such as Youtube, Vimeo and BrightCove, so you can showcase videos of interactive visualizations and other media on BuzzData as well.
Alright that’s it for now! We hope you have a lovely, stress-free winter break and a fun-filled New Years’ Eve!
This year it’s easier than ever to put your open-data efforts in the spotlight for all to see. BuzzData makes publishing and coordinating your hackathon projects a snap.
Here are 6 great reasons to use BuzzData to organize your HackFest 2011 projects:
1) BuzzData has an easy-to-use interface that non-coders can contribute to.
This is important for making HackFest as inclusive as possible, and engaging journalists, politicians and the general public. It’s all well and good to encourage everyone to participate, but if your Mom doesn’t know how to use ScraperWiki or read Python, chances are you’re going to have a hard time getting her interested in your open-data project unless we lower the tech barrier a bit.
That’s right – BuzzData lets you upload just about any file, as long as it’s under the file size limit. So bring your .pdfs, .docs and GIS data on board — it’s a one-stop drag & drop, easy enough for anyone to use.
What better way to show the evolution of a data source from unusable .pdf to a nice clean dataset than by uploading the original source file and adding subsequent versions on top (no need for extra file names or cluttered FTP servers)?
This is also a great way to get a poorly formatted data source online and solicit help from people to clean it up. Just upload it to BD and share it with your community!
If you see some data on BuzzData you’d like to work on yourself, clone it and build your own Hack Day project with it. By cloning the data, you preserve its connection to the original source, thus helping others keep track of how the data is evolving and where it originally came from.
You can also link unrelated datasets to each other so that others can keep track of which datasets you’re using for a single project.
For developers who want to pull and update data quickly, BuzzData also has a beta API on Github, which you can use as a no-muss, no-fuss data-storage target for your open-data apps. (BuzzData users can access their API keys under their Profile Settings.)
6) This year, BuzzData has a real-time updating public page for HackFest 2011!
We’ve created a HackFest 2011 (odhd) Topic page on BuzzData! Now when you publish ODHD-related datasets you plan to work during the event, you can tag it with the “HackFest 2011” topic and it will be added to this page, which you and your friends can follow to stay updated on the day’s developments.
Are you excited yet? We are! Get moving and get your HackFest 2011 data & projects on BuzzData!
Have you joined BuzzData and then gotten a bit stuck because, well, you have no datasets to work on yet? Don’t fret — we heard you.
We don’t want users who are intrigued by (but new to) data to be held back, and so took it upon ourselves to track down, collect and clean up BuzzData-curated public data for you to play with and explore. BuzzTopics will be a treasure trove to info-viz enthusiasts especially, who want good clean data with which they can hone their evolving viz-tool skills.
We will be focusing on building our discovery and search capabilities further on down the line (we know it’s a bit of a thorn in people’s sides right now, don’t worry, we’ll get to it), in the meantime you can get the full list of BuzzTopics available by searching for “Buzz” in the search box at top right:
“We made sure that we were going to all the right places, to publishers that had high-quality data,” says BuzzData business analyst Anthony Ilukwe, who published nearly a thousand datasets with the help of three others in a fairly short time span.
Below is the full list of BuzzTopics so far — this is just a start and it will continue to grow. And if you have requests for data, ping us at email@example.com and we’ll track it down. Have fun!
It was hard to keep the datasets limited to just three this time around, with so many users releasing some very intriguing data in the last week. We managed to whittle the list down to a few key datasets recently published by some news outlets and journalists.
Without further ado, here are the rules and data for our second monthly data-storytelling contest:
THE GOAL: To tell the story behind the data through your own BuzzData project
THE RULES: The number of data sources you can include in your project is unlimited, but you must use at least one of the following, and you have to include all data sources used in your final submission.
Published by:Chad Skelton: Vancouver Sun investigative reporter, specializing in FOI requests and data journalism.
Description: “Data obtained from the Correctional Service of Canada by The Vancouver Sun through the Access to Information Act, detailing all seizures of contraband and unauthorized items in B.C. federal prisons between January 2008 and October 2010.”
Description: “The Globe and Mail has set out to measure the state of philanthropy , and part of that has involved examining data filed by Canadian registered charities in their tax returns. The Globe and Mail obtained complete databases for the years 1997 to 2009, and partial data from 2010 (data from about 20% of the country’s charities hadn’t been processed by the end of July, 2011). All of the data, maintained by the Charities Directorate of Canada Revenue Agency, is available to the public by law.”
NOTE: We are currently in the midst of trying to get the entire charity database online in a non-.mdb file format, such that projects needn’t be limited to querying specific columns or years or require MS Access to query all the data. If/when we do so, we will update and notify participants accordingly.
Description: “The data reflect campaign disbursements by candidates for U.S. House of Representatives from Jan. 1, 2011 to Sept. 30, 2011. The data only include direct expenditures by federal campaigns, as reporter to the Federal Election Commission — not transfers to other committees or repayments of loans or campaign contributions to other candidates.”
THE DEADLINE: Midnight EST, Friday, December 2, 2011
HOW TO REGISTER AND SUBMIT YOUR ENTRY:
1. To register: Clone one of the datasets above directly from the publisher on BuzzData. (Make it private if you don’t want the public to see it until it’s ready.)
As early as possible, invite me (Momoko on BuzzData) as a viewer by going to Admin > Collaborators of your clone, writing in my name, selecting “viewer,” and clicking “Add”, like so:
2. Over the next month, build your project on BuzzData (posting links and viz’s as appropriate).
3. When it’s done: note in the Overview which visualization/article/attachment is the final product(s), then, before deadline, invite the original data publisher to check it out. And don’t forget to make it public if you want to show it to the world!
4. Tweet and link to your project elsewhere if you want to build interest in it (optional, but always a good idea)
If you’ve never used BuzzData before, here’s a quick video that shows how to start, build and submit your data project:
UPDATE: the awesome peeps at toronto.ca/open (who are also on BuzzData) just let me know that population data (and all kinds of other cool data) is available at map.toronto.ca/wellbeing. You don’t need to code and can mix, match and export all kinds of different indicators (screenshot above).
But if you still want to learn to scrape, which is actually kind of fun, keep reading!
WEB SCRAPING: WELCOME TO GEEK TERRITORY!
Since there are only 44 wards, one option is to manually copy and paste the ward names and population sizes into your own dataset on Excel. It wouldn’t take you too long, but it would be kind of annoying.
Another option is to attempt to “scrape” the data, meaning to code a script to copy the parts of the page text that you want and organize them into your own datafile. Because of the simple structure of Toronto’s websites, this is a perfect opportunity to learn how to do this. Let’s get started.
FIRST THINGS FIRST: PROGRAM INSTALLS
First, you need to make sure you have the right programming language installed on your computer so that when you “run” the script we’re going to write, your computer can read it. We’re going to write our script in a popular language called Python. To install it, follow the instructions for your computer’s particular OS (ie: version of Windows, Mac or Linux) here.
If you’re not used to installing programs on your computer (especially Windows computers), you can run into the occasional snag and get stressed out. Don’t. Google is your friend. Most of the time you can find a web page where whatever issue you’re dealing with has been discussed and resolved. For example, here’s a good step-by-step guide on installing Python for different OSes.
Next, you need to install a program called Beautiful Soup; it parses (ie: reads) webpages for you. Find the installation directions here, and make sure to save the “BeautifulSoup.py” file in the same directory you plan to save the script file. (Again, if you run into problems, Google is your friend! Don’t give up.)
Last, you need a good text editor for coding, if you don’t already having one. A sweet free one is Sublime Text 2 (still in beta). It comes with pretty colours. Install that, too!
Phew. That’s a lot of installs. Sorry about that. Now the fun starts.
Open Sublime Text 2 and save the open document as “ward_population.py,” and make sure it’s in the same folder as “BeautifulSoup.py.”
Cool! Looks like every ward profile URL is structured the exact same way! The only thing that changes in it is its ward number. This is going to be very useful for when we write our script.
Find out where the population for Ward 2 is listed. Is it structured the same as Ward 1? Yes. This is also very important.
Now we have to start writing out our script. Go back to the Sublime Text document you saved called “ward_population.py.” Open it up.
TIME TO CODE!
The first thing we have to do is import the right libraries with the right methods into our script so that your computer knows what to do when it comes across specific terms in your code. Sound complicated? It’s not. All you do is write the following:
from BeautifulSoup import BeautifulSoup
Not too hard. Okay, now to get to the fun part. If you follow these next clips, I’ll walk you through a translation of each line of code (as best I can), one line at a time, and show you how it relates to the webpage you want to scrape. (Apologies if I’m less than eloquent, I’m new to this and was surprised at how hard it is to explain code off verbally!). And make sure to change these videos to full-screen, otherwise you’ll struggle to see what I’m talking about.
Let’s got through the first few steps:
Following so far? Awesome. Now we need to take a look at the actual source code of the ward profile pages we’re going to scrape. Here we go:
Make sense? I hope so. Let’s take it back to the script and see how exactly to code the extraction of the snippet we want.
Phew! We’re close now. Now we use a nifty little Python method called ‘split’ to break up the sentence and pick out only the words we want.
(Note: this particular video clip was made using an older version of the script, so there are a few comments included that don’t apply. Disregard lines 17, 18 and 30!)
Finally we come to really fun part: running the script and publishing our dataset!
Okay, wait, so why did we want this data again? Oh yeah! To visualize average water consumption per person per ward. Well, I think you can handle that on your own at this point, don’t you? Try graphing a bar chart in Excel of water consumption per person per ward in 2006. Which wards stand out?
NEXT UP: we’ll do some GIS mapping (sorry I didn’t get to it this week; the opportunity to demonstrate basic scraping as part of an existing project was one I didn’t want to skip.)
Want to learn more? Here are some helpful references to follow up with:
(I started learning how to scrape using Python by reading this book, and followed Yau’s general approach — with permission — while coding the above script. I can’t say enough good things about VT if you’re new to coding and want to make visualizations. Fantastic book.)
Earlier this month I attended and spoke at News:Rewired, a popular digital journalism conference in London, U.K. The journalists there were top-drawer: from Reuters, the BBC, the Guardian, the Telegraph, and others. My talk, on how data curation will be key in driving digital journalism forward, appeared to resonate with quite a few people, which was great.
However, more often than not, attendees came up to me, thanked me for the talk and then prefaced their clear enthusiasm for data journalism with an almost bashful admission that they lack the data exploration, analysis and visualization skills to actually do it.
This is no surprise. Journalism has historically always been a narrative craft and largely still is. But this experience did make me think that perhaps it might be helpful to make step-by-step tutorial posts showing how to probe and visualize data.
(Are you a data geek with a tip or tool to show off? Ping me about guest-posting or contributing to a BD tutorial)
This first data-tutorial post — a very basic one for data newbies — will begin exploring some Canadian government open data recently published on BuzzData: the City of Toronto’s water billing data over the last 11 years. A city journalist’s spidey-sense should tell them right away there’s a budding story to be had in this data, namely:
Which wards are the most water-efficient and water-wasteful in Toronto?
First you have to get your hands on the data you want. This particular dataset is easy to get: just clone the data from the original publisher here: www.buzzdata.com/opento, and then download the xls file to your desktop. In this video I show you how to clone the data and make it private so you can build your project around it without others seeing what you’re working on:
Step 2: Pick a question to answer
In future posts we’ll get more sophisticated with our exploration and visualization. For this first exercise we’re going to pick a very specific, simple question:
“Which wards had the highest and lowest average water consumption last year?”
If you open up the dataset you downloaded, you’ll see that it actually splits water billing accounts into two types: residential and commercial. Let’s stick with residential. (Feel free to repeat this exercise on your own to find out which wards had the highest and lowest average commercial water consumption, and then see if there’s a correlation between the two types …)
Step 3: Pick your visualization method (and use K.I.S.S. — Keep It Simple, Stupid)
I’m going to make a bar chart in Excel. I know, it sounds boring, but here’s why:
To answer my question, I’m going to visualize discrete data (Toronto wards), and only compare one kind of value (the wards’ average residential water consumption). Any other kind of graph would probably be less clear in the long run, because the extra bells and whistles of the method would just add noise to the image.
However, if I wanted to highlight water consumption trends over multiple years, a line graph or time series chart would likely work best.
If I wanted to know which wards were close to each other, a heat map using GIS data would be great. We’ll get to those in the future.
As a rule: pick the method that would best highlight the answer to your question!*
*You may not know which one works best without a little trial & error first.
Step 4: Format your data
Now it’s time to look at the data:
That’s a lot of data. Graphing this entire spreadsheet would be pointless, in fact it would probably be harder to understand than the spreadsheet itself. You have to think about what information pertains to your question. I want to know which ward was most efficient and wasteful last year, respectively. So I need the following data:
Average residential water consumption for each of 44 wards in the year 2010
Everything else — commercial accounts, # of accounts, total consumption, etc. — would be noise on the page. So how do we get just this data? There are lots of ways, but in this instance we’ll make a Pivot Table:
Now we have a nice Pivot Table, but we’ll want to do just a little more formatting and organization before we make our graph. In this video clip I show how to prep your table for graphing, as well as how to sort your data to get an idea of what your findings will be:
Step 5: Graph it & and get your answer!
Before going on, let’s recap what I’m trying to find out here. My original question was:
“Which wards had the highest and lowest average residential water consumption last year?”
By sorting the data earlier, we already know our answer. Now we just want to visualize it. Because we already sorted and formatted our data, graphing is now a piece of cake with Excel’s chart wizard. Here’s how you do it:
And that’s how you make a nice clean bar chart (and start to explore data with a journalistic frame of mind). What other trends can you find in this dataset?
One more note: visualizing data one way to answer a question often prompts new questions! In this case, I can’t help but wonder whether city wards with similar water consumption levels cluster together geographically. To answer this, we’ll need to map the data, so stay tuned for the next tutorial post to learn how!
NEXT UP: Visualizing Toronto’s water consumption with GIS (geographic information systems) data. In other words, shapefiles and mapping. Woohoo!
Enjoyed this but know you could do better? Great! Ping me about guest-posting or contributing to a BD tutorial post!
Great news: the first iteration of BuzzData’s API is officially up and running!
While it’s still very young (and its capabilities limited), we do like to release features as fast as humanly possible. It’s just more fun that way.
If you’re a developer who’s been thinking about using BuzzData in creative ways, and possibly collaborating with us in the future, trying out our API early on is a great opportunity to help shape its development right from the start.
To access the API, you will need a key, so email us a request at firstname.lastname@example.org and we’ll set you up right away. The API documentation and access to the Ruby client library are also publicly available on Github, so feel free to check them out.
We’ve attracted the interest of such a unique, diverse community over the last few months, we can’t wait to hear more suggestions from you!
Buzzworthy Act: Simply put, BuzzData user James McKinney took some data I had and made it better. A lot better.
Now, revising a dataset doesn’t sound as sexy as say, publishing a data visualization or coding an app. However, at this early stage of building a visible group workflow culture for data, the implications of thoughtfully revising data might actually be more significant over the long-term than making it visual or entertaining.
Above is the original version of a dataset I put up a few weeks ago called “Open Data Hubs Worldwide”: a simple reference list of regions and URLs for journalists and hackers looking for open data; nothing special.
McKinney, a long-time open-data hacker, immediately followed my dataset and scrutinized it carefully. “I was just getting started on the platform, and since I had a list of Canadian open data cities, I downloaded the dataset and checked it against mine,” he recalls. He saw right away that my dataset was rife with organizational flaws that made working with it quite painful.
“The dataset didn’t have a ‘Country’ column, so to find those belonging to Canada, in my case, I had to go laboriously through all 100 rows. The column header ‘Region/Institution’ meant I couldn’t count on a keyword like ‘Canada’ or a code like ‘CA’ appearing in each case,” he said. (Lesson learned: sorting alphabetically alone rarely makes any sense for data.) “I mentioned this in a comment on BuzzData and reported my findings.”
I realized that McKinney could do a lot to improve my data, so I invited him to collaborate and make changes himself.
“I started cleaning up, reformatting, and adding to the dataset,” McKinney says. “I created two new columns for country and subdivision, using standard ISO 3166 codes to make it easier for people to match other datasets to it. I also labelled data hubs as government-sanctioned or not, as many data users prefer primary source material.”
Now, with more than 30 followers, six clones and contributions from various users, this machine-readable dataset has become a legitimate resource unto itself, rather than data hidden behind a map, or a visualization, or a white paper. If you Google “open data hub,” this comprehensive dataset is now the top search result.
Perhaps what I appreciated most during this collaboration with McKinney was the fact that I automatically learned better “data etiquette” from it. So often when we work with data, we organize it to suit our own local, immediate needs. We rarely think about how to set it up so that others can use it in the future.
McKinney recalls making the same type of list last year, but the experience was different: “I had done something similar for the Open Government Data Working Group of the Open Knowledge Foundation a year ago, but we were using Google Spreadsheets at the time,” he says. “Google Spreadsheets’ strength is real-time document collaboration, but it’s weak on other social aspects. For example, if you close the spreadsheet, you lose your chat history.
“Although BuzzData is only in beta, already the conversation sidebar acts as a useful backchannel between collaborators and the overview page is a convenient opportunity to introduce editors to formatting guidelines and design decisions.”
While McKinney recognizes BuzzData’s potential, he also has his own suggestions on how to steer its development to foster a more cohesive data community: “In order to build a strong and active data authoring community, BuzzData should focus on pushing these collaborative and conversational aspects,” he says.
“One quick-win would be to allow editors to write update messages when uploading new versions, like commit messages in version control systems such as Git.
“Another obvious, but challenging, feature would be to display changes between versions. I believe these two features-being able to see and read about changes-are necessary steps to broader and deeper participation in data sharing and authoring.”