Automating the collection of literature – or, keeping up to date with the MOOC literature

Spoiler: We’ve been toying with automating the collection of literature on MOOCs (and other topics). Interested? Read further.

Researchers use different ways to keep updated with the literature on a topic. On a daily basis for example, I use Table of Content (TOC) alerts, RSS feeds, and Google Scholar alerts. Many colleagues have sought to keep track of literature on a topic and share it. For example, danah boyd maintained this list of papers on Twitter and microblogging; Tony Bates shared a copy of the MOOC literature he collected on his blog; Katy Jordan also kept a collection of MOOC literature.


A Google Scholar Alert

The problem with maintaining an updated list of relevant literature on a topic is that it quickly becomes a daunting and time-consuming task, especially for popular topics (like MOOCs or social media or teacher training).

In an attempt to automate the collection and sharing of  literature, my research team and I created a python script that goes through the Google Scholar alert emails that I receive (see above), parses the content of the emails, and places it in an html page on my server, from where others can access it. The script runs daily and any new literature is added to the page.

We aren’t there just yet, but here is the output for the MOOC literature going back to November 2012. All 400 pages. I placed it in a Google Document because the html file is 2.5mb (and its easier for people to just download it in a format that they prefer)

In theory this is supposed to work quite well, but there’s a couple of problems with it:

  1. The output is as good as the input. Google Scholar (and its associated alerts) are a black box – meaning there’s no transparency of what is and isn’t indexed.
  2. It’s automated – which means it’s not clean and some “mooc literature” may not really be mooc literature because Google Scholar alerts work on keywords in the body of papers/text rather than keywords describing the papers/text.

We plan on to make the source code available and describe the process to install this so that others can use it for their own literature needs. My question is: How can the output be more helpful to you? Is there anything else that we can do to improve this?


New publication: A case study of scholars’ open and sharing practices


Networked scholars – final table of contents

1 Comment

  1. Hi George,

    Keeping up with the MOOC literature is indeed a daunting task! I’ve not updated my page for a while whilst focusing on finishing my PhD data collection and analysis. I do intend to update it at some point, but there is going to be a *lot* of catching up to do!

    My MOOC literature browser also runs from a Google spreadsheet; the interface uses Simile Exhibit. Here is a link to the Google spreadsheet – the key thing to get it to work with Exhibit is the {}’s around the column headers:

    I’m a fan of Exhibit because as soon as you add new data to the spreadsheet it is live in the webpage, and it’s a simple way of being able to search and filter data.

    Anyway, thought I would share the ‘backend’ – I don’t know how difficult it would be to make a script to divide up the records into title / link / abstract and feed them into a spreadsheet – but if it could be done, combining automated collection with feeding into Exhibit could be very useful.

Leave a Reply

Your email address will not be published. Required fields are marked *

Powered by WordPress & Theme by Anders Norén