Corporate giving coming with more strings attached
About this Dataroom
The Globe and Mail has set out to measure the state of philanthropy today, and part of that has involved examining data filed by Canadian registered charities in their tax returns. The Globe and Mail obtained complete databases for the years 1997 to 2009, and partial data from 2010 (data from about 20% of the country's charities hadn't been processed by the end of July, 2011). All of the data, maintained by the Charities Directorate of Canada Revenue Agency, is available to the public by law.
The datasets uploaded to BuzzData contain columns of identifying information for the registered charities that filed tax returns, plus columns of data that correspond to various lines in the T3010 return forms. The information in these columns is what we used to create some of the visualizations in our Giving dashboard. See the "attachments" area of each dataset for a column key and the full, .mdb-formatted database file.
We invite the data community to examine this data and suggest story ideas or create visualizations to help make sense of it. By suggesting story ideas or submitting visualizations, you grant The Globe and Mail the right to use your visualizations, with credit, in any media, including in print or online. Reach us in the comments section of this dataset.
SOURCE: Canada Revenue Agency
RELATED DATASETS ON BUZZDATA: 2009 2008 2007 2006 2005 2004 2003 2002 2001 2000 1999 1998 1997
The Globe and Mail - Numbers
A Chinese boomtown test drives the concept of charity
In Harper's Canada, will we give more of ourselves to get lower taxes?
Hello... it looks like you have identified an error in the original dataset. CRA has a process for fixing data entry errors. You can contact the Charities Directorate from this page: http://www.cra-arc.gc.ca/chrts-gvng/chrts/cntct/cntct-eng.html. We will fix our records after the change has been made by CRA.
I went back a few years and checked out the top few lines. Looks like 2009 has similar problem but 2007/8 are ok. I downloaded 2008 and the sum of the individual sources is within 0.6% of total - which could be entry errors, I suppose
Couple of points
a) I could contact the Charities Directorate but as OP i would expect you to do so first. In addition, your G&M status will be more likely to get a response
b) Did you not use this data for your chart on proportion of funding by broad source (inc govs)?
which does not seem to have a problem. Could something have gone wrong at your end between the receipt of data and this upload. If you look at 2009/10 the indvidual money columns are classified as character fields and have dollar signs. The 2008 is straight integers
BTW Great series in the Globe
I did an individual check on the 2008 data. I make it that there are 640 cases (out of 72,000) where the total government aid does not equal the sum of the fed,prov,mun constituents. Bit worrying but small beer in the total scheme
The 2009,2010 data problem is more important
Hi pssguy.... I think your digging has raised an interesting point about open data and large datasets. This charities data in particular is known to be prone to inconsistencies, since CRA simply plugs values from tax forms into a database without checking for entry errors either at their end or at the charity record-keeper's level.
At the Globe we do our best to verify the information we've used in our reporting and the data that informs our infographics and other visualizations. But with 14 years' worth of massive datasets, it's not realistic for us to check every value. Similarly, we're not equipped to follow up and change the original datasets as errors are identified.
So my question becomes: when a user posts datasets to BuzzData, what other responsibilities is it assuming toward making sure that data is up to date?
Mason
The general point is very important. I think the original poster should endeavour to ensure that the data is as clean as possible and state in the overview any known shortcomings
People wanting to use the data can then decide if it is not worth pursuing or, hopefully, mention it in their product if they decide to proceed regardless
To cover personal cases, I scraped some data on sport salaries. There were some obvious errors (e.g double counting)
that I could cater forbut, for NHL data, I went back to the source for a correction. This was not forthcoming so I have not published the
flawed data
Another one was a dataset on degree applications in the UK. Here, I combined several years data into one file. It entailed
a lot of work making sure that the correct rows for each year matched up. Over the period in question, several colleges amalgamated
and many more changed names. I updated the initial dataset making clear that I felt confident about the vast majority of information
and asking people for corrections if they found errors
With the charities data, what I have tried to make clear in my questioning is that it is not individual data errors that appears to be the problem
- which I would not expect you to spend time on. As I recall, the entries for total government funding do not tally with the
individual level of governemnt figures throughout the 2009 and 2010 results. As I also mentioned, the 2009/10 data fields also differ
from previous years e.g what woud be an integer field of 123456 in another year is shown as a character field of $1,234.56
in the most recent two years. What baffles me is that the G&M produced a graph including 2009 (and maybe 2010) data which cannot be gleaned from the info you uploaded. At least thats how I saw it a week ago
Just noticed that row 33 has no account name ...? Filed a lot of revenue, though.
In a couple of cases the Account Name is different from the Legal Name in the database. In this case there was nothing in the account name, but the charity's legal name is "Villa Youville Incorporated".
With such a large group of datasets I was forced to choose columns in order to present the data as efficiently as possible. But if you have any other questions about what is there and can't download or open the full dataset from the Attachments tab, please let me know. I'll answer your questions as quickly as possible.
Sweet :) I'm actually going to clone this right now and hack on it a bit privately. Will update ...
I just summed up the columns available in this dataset but am not sure if it reflects total revenue for the charity ... thoughts? If there's a billion other columns I need to include to accurately reflect this I'll just grab the full dataset in the Attachments and work from there.
You're getting familiar with the challenges I faced when analyzing this data ;-)...
I didn't leave out any revenue columns with this dataset. I literally included every column we used from CRA to create our visualization dashboard (I haven't uploaded StatsCan data because it's copyright-protected). However... because operating charities receive grants from foundations, it's important to avoid double-counting donations when calculating a sum total of revenue across charities.
I note that the "data" tab has selected columns, then the "attachments" tab has a more comprehensive file. Is it possible to get all 13 years in a single .7z/.zip file? Thanks!
Hi Duncan... here's the message I get when I try to zip all 14 years: "The compression cannot be performed because the size of the resulting Compressed (zipped) Folder is too large." I confess to knowing nothing about .7z files, but I'll look into it. Regards...
7zip, I love it -- sometimes it makes my archives 10% of what WinZip does.
Are the various years physically separate at the source? Or did you split the database by year to make it small enough to upload? I would love to get my hands on a single database. I can help you share it via BitTorrent if it's too large to upload to BuzzData.
I will attempt to message you privately to share my email address.
For Canada's charities, this is a time of crisis and a moment of opportunity
Editor's note: The perils and promise of the new philanthropy
Having a bit of trouble with 2010 data (havent checked others) cat 4570 should be all gov rev but if you look at data say row 1 there is a blank whilst 4540 which is the fed input has some data $14886