Teaching Open Data

December 17, 2015

Mike Smit is a professor in the School of Information Management at Dalhousie University in Halifax, Canada. His research and teaching explore data management for open and big data, data literacy, the effect of open information on civic engagement, and the interaction of information and emerging technology (including cloud computing and the Internet of Things).

One of the great things about open data is that it's not usually released with a specific purpose in mind. We can't predict what uses people will find for raw data; the power of open data is in unanticipated uses which move beyond the interests, scope, or capabilities of governments.

I use open data in my teaching. As a professor in Dalhousie's School of Information Management, I teach courses to students in our Master of Library and Information Studies (MLIS) and mid-career Master of Information Management degrees. Working with data is an important part of both degrees, and the effective visualization of data is one key learning outcome.

I've been asking students to visit open.canada.ca, find a dataset, and use an effective data visualization to tell me something interesting. What does "something interesting" mean? Fortunately for students, I find lots of things interesting. What I want is to learn something I didn't know, and that isn't obvious just by looking at the data.

Every semester, I have been astounded at the creativity of students who scour Canada's Open Data portal in search of a dataset that captures their interest. For students from Canada, it's an opportunity to better understand their home country; for students from outside the country, they learn a bit more about their host country.

Thinking more broadly about the objectives of the assignment, it is worth reflecting on what we expect a modern workforce, and a modern pool of graduates, to know about working with data. The increasing interest in open data, combined with the problem of big data and the power of data science and data analytics, suggests the world is growing more and more data rich. But raw data is of limited use; we unlock the potential of data when we can analyze it, visualize it, create information and knowledge from it, and ultimately inform evidence-based decision making.

I'm part of a team of researchers at Dalhousie that was recently awarded Social Science and Humanities Research Council (SSHRC) funding for an initial look at the question "How can post-secondary institutions in Canada best equip graduates with the knowledge, understanding, and skills required for the data-rich knowledge economy?" What levels of what we call "data literacy" will we want Canadians to have as we consider the future of open data and open government?

These are big questions; for now, I'll say that an open data visualization assignment is a good start. I've linked to a copy of my assignment which anyone is welcome to use and adapt. Below, I've embedded some of the data visualizations MLIS students produced so you can see how students rose to the challenge of distilling complex datasets into easily absorbed messages.

Figure [01]: Seasonal sea ice thickness averages

Figure [01] - Text version

Graph of month-by-month sea ice average thicknesses from 1947 to 2002. Darker lines indicate thicker ice, and the graph shows that, while sea ice thickness varies throughout the year, the overall trend is thinner ice over the time scale.

Emily Colford (MLIS 2015) used data from the Canadian Ice Thickness Program, which measured sea ice thickness from 1947 to 2002. Because she used lighter-colored lines for more recent data, you can clearly see the decline in thickness over time (though each individual line is difficult to identify, the focus is the overall trend).

Figure [02]: Comparison of mortgage rates and average rent in Halifax

Figure [02] - Text version

Graph of mortgage rates expressed in percentage and monthly apartment rental prices for the City of Halifax from 1987 to 2012. The graph shows a trend of increase in rental prices over the time scale for all dwelling types (three bedrooms, two bedrooms, one bedroom, and bachelor) and a declining mortgage rate.

Carlisle Kent (Master of Resource and Environmental Management/MLIS 2016) used CMHC data to compare mortgage rates with average rent in the city of Halifax over the past 25 years; this showcases the value of the opportunity to buy property rather than renting it, and how this value has changed over time.

(Sources: Conventional mortgage lending rate, 5-year term and Average rents for areas with a population of 10,000 and over)

Figure [03]: Internet use by age and income

Figure [03] - Text version

A graph showing what percentage of people have regular internet access, divided into highest quartile income families and lowest quartile income families and by age.

Regular internet access by age and income
  Lowest quartile income family Highest quartile income family
Individuals aged 16 to 24 years 94.65% 99.1%
Individuals aged 25 to 44 years 88.1% 98.35%
Individuals aged 45 to 64 years 61.7% 92.65%
Individuals aged 65 years and over 26.75% 68.6%

Harrison Enman showed us that the digital divide (a separation between people with regular Internet access and those without) is at its greatest among senior citizens whose family incomes are in the lowest quartile (the bottom 25%).

(Source: Canadian Internet use survey, Internet use, by location and frequency of use)

N.B. Ordinarily one would not connect these categories with a line graph, but the visual effect that results excuses this faux pas.

Figure [04]: Marital status for incarcerated men vs. Canadian men over 18

Figure [04] - Text version

A bar chart of the marital status, expressed as percentages, of free versus incarcerated Canadian men over 18, divided into Single, Common law, Married, Previously partnered, and Unknown. Proportionately, more men in prison are single or common law than in the general population, fewer men in prison are married, and approximately the same have been previously partnered.

Keriann Dowling (MLIS 2014) pointed out tongue-in-cheek that as a percentage, far more men in prison are single than in the general population.

(Sources: Offender Profile 2013-2014 and Estimates of population, by marital status, age group and sex for July 1, Canada, provinces and territories)

Figure [05]: Alcohol expenditures compared to unemployment rates

Figure [05] - Text version

A bar chart comparing average annual alcohol expenditures against Unemployment rate percentages in Canada.

Average annual alcohol expenditure Unemployment rate percentage
$622 8.5%
$672 8.7%
$677 8.5%
$712 8.2%
$721 7.7%
$806 7.3%
$837 6.8%

Finally, in another light-hearted analysis, Andrea Kampen (MLIS 2015) wondered if there might be a relationship between the amount of money Canadians spend on alcohol and unemployment levels; at a glance, it certainly appears that when more Canadians have jobs, we spend more money on alcohol. I will leave the reader to form their own conclusions for this one!

(Sources: Survey of household spending (SHS), household spending on tobacco and alcoholic beverages, by province and territory)

Add comment *

Provision of the information requested on this form is voluntary. The information is being collected for the purpose of responding to your inquiry or comments, and to improve our suite of online products and services. Personal information that you provide is protected under the provisions of the federal Privacy Act. Please do not include sensitive personal information in the message, such as your Social Insurance Number, personal finance data and medical or work history.

Read the Privacy Statement for this Website.

The collection and use of your personal information is authorized by the section 7 of the Financial Administrative Act. Collection and use of your personal information for data.gc.ca is in accordance with the federal Privacy Act. Your personal information is used to respond to your inquiries, if applicable, and may be used to evaluate the effectiveness of the program in responding to client needs. In exceptional circumstances (e.g., investigation of hackers, or of individuals who make abusive remarks or threats, etc.), personal information may be disclosed without your consent pursuant to subsection 8(2) of the Privacy Act.

Any personal information that may be collected is described in the Standard Personal Information Bank entitled Public Communications, PSU 914, which can be found in the Treasury Board of Canada Secretariat (TBS) publication: InfoSource. The personal information collected will only be kept by TBS for a period of eighteen months of the completion of activity after which all personal identifiers will be deleted.

Under the Privacy Act, you have the right of access to, and correction of, your personal information, if you have provided any. Note however, that to exercise either of these rights, you must make a request for access to your personal information before the retention period has expired. For more information about your right of access, please read About the Access to information Program.

If you require clarification about this Statement, contact the TBS Privacy Coordinator at 613-957-7154. For more information about your privacy rights and the Privacy Act, consult the Privacy Commissioner through the Office of the Privacy Commissioner of Canada website or 1-800-282-1376.

I commend and support this direction in LIS programs. We should be teaching students how to act as responsible data interpreters -- not just for "innovation" purposes, but so they can help their communities begin to use such information to hold parties to account and to address community needs.

I would prefer it if we could do this responsibly, by ensuring that basic numeracy lessons, and a fundamental understanding of statistics and research methods are not divorced from the process. Such understanding is necessary so that visualisations are not just "fun" but also reasonable interpretations of reality.

Based on the examples in this post, I have some concerns. Most of the above examples appear to be "lying with statistics" -- implying ready associations between semi-arbitrary variables, overlooking confounding variables or methodological data constraints, graphically representing different indices/scales as though they're comparable units of measurement, representing categorical variables as though they're continuous variables, etc. Not to mention a lack of labeling and some missing source citations. Essentially, the students' work (as presented) appears to be a showcase of everything we're afraid of when it comes to sharing data.

Maybe this can be addressed by discussing all these issues after the students' "first pass" and then having them re-visit the assignment. But the aim as a whole requires more than a data visualisation course if we hope to produce competent data stewards, facilitators, and users. I hope that we are moving in that direction.

Naomi, thanks for your comments - the short response is you do not need to be concerned. A blog post is a very small window into a large curriculum, and while I chose to focus on the "have fun with data" portion of one assignment, being critical consumers and users of data is a core part of my course and the broader curriculum. For example, we talk about spurious correlations, and how data exploration can show correlations and aid in the development of theories, but cannot account for various moderating/confounding variables alone, and many other aspects of being critical thinkers. We talk about how the data we have is the tip of an iceberg, and all the ways in which data can deviate from reality (just like 5 images and a blog post can give the wrong idea about the depth of a curriculum!).

Even in the context of data visualization, we talk about different audiences: are you trying to make a point, or inform generally? Are you exploring, or communicating?

I am glad you are interested and concerned, though! You may be interested in reading our report on data literacy, http://hdl.handle.net/10222/64578. I would welcome your input.

In case anyone else is concerned: this post is not intended to suggest any kind of conclusions about the data. This is about people having fun playing with open data, learning something basic from that data, and more importantly learning about working with and manipulating data in numeric and visual form.