The Digital Public Library of America. Highlights from Robert Darnton’s recent talk

January 24, 2012

I was fortunate to be among those attending Robert Darnton’s talk on the Digital Public Library of America initiative last week. Harvard Professor and Director of Harvard Library, Darnton is a pivotal figure behind DPLA and his talk – most concurred – was both provocative and inspirational. More than a description of the DPLA initiative, Darnton framed his talk with key issues and questions for us to reflect upon. How can we provide a context where more knowledge is as much as possible freely available to all? Where we can leverage the internet to change the emerging patterns of locked down and monopolised chains of supply and demand?  And as Professor David Baker highlighted in his introduction of Darnton, there is much alignment here with the broader and more aspirational ethos of Discovery: a striving to support new marketplaces, new patterns of demand, new business models – all in the ideal pursuit of the Public Good. Arguably naïve aspirations, but certainly the tenor in the room was one of consensus, a collective pleasure at being both challenged and inspired. Like Discovery, the DPLA is a vision, a movement, tackling these grand challenges, but also striving to make practical inroads along the way.

The remainder of this post attempts to capture Darnton’s key points, and also highlight some of the interesting themes emerging in the Q&A session that followed.


 “He who receives ideas from me, receives instruction himself without lessening mine; as he who lights his taper at mine receives light without darkening me” Thomas Jefferson


To frame his talk, Darnton invoked this oft-cited tenet of Thomas Jefferson – that the spread of knowledge benefits all. He aptly applied this concept to the concept of the internet and specifically the principles of Open Access for the Public Good, and the assumption that one citizen’s benefit does not diminish another. But of course, he cautioned, this does not mean information is free and we face a challenging time where, even as more knowledge is being produced, an increasingly smaller proportion of it is being made available to the public openly. To illustrate this, he pointed to how academic journals have increased in costs at four times the cost of inflation, and we are anticipating that these rates will continue to rise, even as Universities and libraries face increasing cutbacks. We need to ask, how can that increase in price be sustained? Health care may be a Public Good, but information about health is monopolised by those who will push it as far as the market will bear.

Darnton acknowledged that publishers will reply by deprecating the naiveté of the Jeffersonian framing of the issue. And, he conceded, journal suppliers clearly add value; it’s fair they should benefit – but how much? Publishers often invoke the concept of ‘marketplace of ideas’ But in a free marketplace, the best will survive. For Darnton, we are not currently operating in a free marketplace, as demand is simply not flexible  – publishers create niche journals, territorialise, and then crush the competition.

The questions remain, then, how can we provide a context where more knowledge is as much as possible freely available to all? Where we can leverage the internet to change these locked down and monopolised chains of supply and demand?  The remainder of Darnton’s talk outlined the approaches being taken by the DPLA initiative. It’s early days, he acknowledged, but significant inroads are already being made.

So what is DPLA? A brief overview

Darnton addressed (in relative brief) the scope and content of DPLA, the costs, the legal issues being tackled, technical approaches, and governance.

Scope and content: Like Discovery, the DPLA is not to be One Big Database – instead, the approach is to establish a distributed system aggregating collections from many institutions. Their vision is to provide one click access to many different resource types, with the initial focus on producing a resource that gives full text access to books in public domain, e.g. from  Hathi Trust, the Internet Archive, and U.S and international research libraries. Also carefully highlighted that the DPLA vision is being openly and deliberately defined in a manner that makes the service distinct from those services offered by public libraries, for instance excluding everything from the last 5-10 years (with a moving wall annually as more content come available as Public Domain).

The key tactic to maximise impact and reduce costs will be to aggregate collections that already exist, and so when it opens, it will likely only contain a stock of public domain items, and will grow as fast as funding commits. To achieve this, it will be designed in a way that as much as possible makes it interoperable with other Digital Libraries (for example, an agreement has already been made with Europeana). So far funding has been dedicated to building this technical architecture, but there is also a strong concentration on ongoing digitisation and collaboratively funding such initiatives.

In terms of legal issues Darnton anticipates that DPLA will be butting heads against copyright legislation – he clearly has strong personal views in this area (e.g. referring to the Google Books project as a ‘great idea gone wrong’ with Google’s failure to pursue making the content available under Fair Use)  but he was careful to distinguish these views from any DPLA policy in this regard.  But as DPLA will be not-for-profit, he suggested that they might stand a good chance to invoke the Fair Use defence in the case of orphan works, for example. But he also acknowledged this is difficult and new territory. Other open models referenced included the case of a Scandinavian style licence for public digital access to all books. He also stated that he sees the potential for private partnerships in developing value-added monetised services such as apps – while keeping the basic open access principles of the DPLA untouched.

The technical work of DPLA is still very much in progress, with a launch date of April 2013 for a technical prototype along with 6 winning ideas from a beta sprint competition. More information will be released soon.

In terms of governance, a committee has been convened and has only just started to study options for running DPLA.

Some questions from UK stakeholders

The Q&A session was kicked off by Martin Hall, VC of Salford University, who commented that in many ways there is much to be hopeful for in the UK in terms of the Open agenda. Open Access is going strongly in the UK with 120 open access repositories; and, he stated, a government that seems to ‘get it’ largely because of a fascination with forces in the open market. As a result there is a clause in new policy about making available ‘openly’ public datasets.  This is quite an extraordinary statement, Hall commented, given the implications for public health, etc. and this is possibly indicating a step change. But it all perhaps contributes to the quiet revolution occurring around Open Access.

Darnton responded by highlighting that in the USA they may have open access repositories, but that there is a low compliance rate in terms of depositing (and of course this is an issue in the UK too). But Harvard has recently mandated the deposit; and while there was less than 4% before, there is now over 50% compliance, and the repository “is bulging with new knowledge.”

In addition, Darnton reminded the group, while the government might be behind ‘Open,’ we still face opposition from the private sector. A lot of vested interests feel threatened by open access; and there is always a danger of vested interest groups capturing attention of the government.  But, he said, it’s good to see hard business reasons are being argued as well as cultural ones, but we need to be very careful.

Building on this issue, Eric Thomas, Vice Chancellor of Bristol University raised the issue of demonstrating the public value – how do we achieve this? He noted that the focus of Darnton’s talk was on supply side, but what about demand? To what extend are DPLA looking at ways to demonstrate public value, i.e. ‘this is what is happening now that couldn’t happen before…’?

In his response, Darnton referred to a number of grassroots approaches that are addressing this ‘demand’ side of the equation, including a roving Winnebago ‘roadshow’ to get communities participating in curating and digitising local artefacts. In short, DPLA is not about a website, but an organic, living entity… This approach, he later commented was about encouraging participation from the top down and bottom up.

Alistair Dunning from JISC posed the question of what will ‘stop people from going to Google?; Darnton was keen to point out that while he critiqued Google’s approach to the million books copyright situation, DPLA was in no way about ‘competing’ with Google.  People must and will use Google, and DPLA will open their metadata and indexes to ensure they are discoverable by search engines. DPLA would highly value a collaborative partnership with Google.

Peter Burnhill from EDINA raised the critical question of licensing. Making data ‘open’ through APIs can allow people to do ‘unimaginable things’; what will the licensing provision for DPLA be? CC-0?  Darnton acknowledged that this was still a matter of debate in terms of policy decisions – and especially around content. He agreed that there were unthought of possibilities in terms of Apps using DPLA, and they want to add value by taking this approach (and presumably consider sustainability options moving forward).  In short, the content would be open access, and metadata likely openly licensed, but in terms of reuse of the content itself, this *could* be commercialised in order to sustain the DPLA.

In a later comment, Caroline Brazier from the British Library expressed admiration for the vision and the energy and the drive. She explained that from the BL perspective ‘we’re there for anybody who wants to do research’; She highlighted how the British Library and the community more broadly has a huge amount to do to push on with advocacy, particularly around copyrighting issues.  This, forces all institutions of all sizes to rethink their roles in this environment – there are no barriers here, she suggested: we can do things differently. We need to think individually about what we do uniquely. What do we do? What do we invest in? What do we stop doing? Funding will be precious, and we really need to maximise the possibility to get funding.

Darnton agreed, and stated that there is a role for any library that has something unique to make it available (and of course, the British Library is the pinnacle of this). The U.S. has many independent research libraries (the Huntington, Newberry, etc) and they very much want to make room for them in the DPLA; they want to reach out to these research libraries who may be open minded but are behind closed doors in terms of broader public.

The final (and perhaps one of the most thought-provoking questions) came from Graham Taylor from the Publishers Association. He stated that he concurred with much of what Darnton had to say (perhaps surprising, he suggested, given his role) but he did comment that throughout the afternoon he had “not heard anything good about publishers.” So, he asked, where do publishers fit? In many regards, publishers are the risk-takers, the ones who work to protect intellectual property, and get all works out there – including those that pose ‘risk’ because they are not guaranteed blockbusters.

Darnton strongly agreed that publishers do add value, but, he explained, what he’s attacking is excessive, monopolistic commercial practices to such an extent that they are damaging the world of knowledge.  He was struck by Taylor’s comment on risk-taking, though, for indeed publishing is a very risky business. But sometimes the way risk is dealt with is unfortunate, with that emphasis on the blockbuster as opposed to a quality, sound backlist. So what can be done about this risktaking and sharing the burden? Later this year, he said, Harvard would be hosting a conference that explores business opportunities in publishing in open access. If publishers are gatekeepers of quality, how can open access can be used to the benefit of publishing, and so alleviate that risk-taking and raise quality?

Introducing the Technical Principles for the Discovery Ecosystem

November 7, 2011

We’re pleased to announce that our colleagues at UKOLN, and specifically Paul Walk, have taken the comments provided by the Discovery Advisory groups and revised the draft Technical Principles for the Discovery Ecosystem.

In short, the Principles for the Discovery Ecosytem are:

1. Discovery is heterogeneous
2. Discovery is resource-oriented
3. Discovery is distributed
4. Discovery relies on persistent global identifiers
5. Discovery is built on aggregations of metadata
6. Discovery works well with global search engines
7. Discovery data is explicitly licensed for (re)use

Please check out the Technical Foundations post for more detail on these guiding principles and what we mean by them. We certainly invite comments or question.

Here Paul also comments on the practical challenge of balancing consensus with agility, and at the same time staying open to innovation.  Indeed, principles are all very well, well, ‘in principle’… but our next swathe of activity will be tackling just these challenges as we work to cascade and embed these Principles (along with the Discovery Open Metadata Principles) within the real-world contexts of library, archive, and museum content discovery. To have any impact, the Principles must be supplemented by supporting materials and guidelines — from technical ‘how to’ guides, to a range of case studies representing different types of institutions and starting points when it comes to opening up metadata for online discovery.

We’ll also need to engage people face-to-face as much as possible, and a number of training events will be planned for the Spring and Summer of next year.  Tomorrow I’ll be getting together with folks here at Mimas (Jane Stevenson, Diana Massam and Lisa Jeskins) along with David Kay and Owen Stephens to kick off that area of work in earnest: refining our approach to the case studies, determining our approach to training (and key learning objectives), and, on a fundamental level, discussing and agreeing exactly who we will be targetting for our training workshops given the broad spectrum of potential stakeholders we could potentially engage.  I’ll look forward to posting on the progress of this are work, and inviting feedback on our approach.

In the meantime, comments or questions concerning the Technical Principles are welcome over here

Discovery. Where Next?

October 11, 2011

As of July 31 we were officially ‘done’ with the first phase of the RDTF Management Framework activity. The last six (plus) months have been very productive, if furiously busy. We’ve been working with UKOLN to establish technical principles (soon to be publicised) for Discovery, and at the same time pushing our Open Data advocacy agenda through the launch of the Discovery Open Metadata Principles and supporting guidance materials. We’ve been working with the community to identify opportunities and examples activity to help us understand the potential business benefits of Open Data, with the RDTF funded projects and Developer’s Competition helping illustrate in practical terms the potential of newly opened data sets.  We also established Discovery itself as the community initiative to help push through the realisation of the RDTF Vision.

There is a great deal more to do. And there’s certainly room for improvement. It is clear that we need to work harder to clarify what Discovery is (and specifically, what it means to different stakeholder groups) and what we’re attempting to achieve in practical terms.

After some reflection, we’ve established a renewed set of aims and targets, and as a result gained a renewed sense of focus. Between now and December 2012:

Discovery will….

    1. Clearly position and define the benefits of Discovery to research and education at the local and national level
    2. Improve the discoverability of UK library, archives and museum content
    3. Drive a shift in ethos to ‘open’ in institutions, services and funding bodies
    4. Improve the quality and sustainability of new and existing resource discovery infrastructure
    5. Be understood, endorsed and promoted by key stakeholders within the library, archives, and museums sector and beyond

Our targets for moving forward are to

    1. Progress the embedding of the technical, licensing and metadata principles
    2. Drive innovation and sustainable, benefits-led reuse of LAM open metadata
    3. Identify and establish core efficiencies in dataflow and aggregation that can be achieved by key shared UK bibliographic data services
    4. Establish open licenses for JISC library and archives service metadata and where possible facilitate this for other key UK library, archive, and musuem aggregations
    5. Develop demonstration exemplars of what is possible, strengthening the business case for open data, as well as identifying issues for sustainability
    6. Open up and make discoverable important but hidden collections
    7. Demonstrate and support approaches to inaccessible metadata and where no metadata exists
    8. Persuade funding bodies and vendors to support the key Discovery principles
    9. Engage with related initiatives to ensure that the approaches recommended in Discovery are compatible with relevant work occurring elsewhere.
    10. Work with related JISC initiatives to explore how they can be integrated into the Discovery framework

Our key tactics to achieve this will include:

–          Providing case studies, principles, guidelines and training to help embed learning and build confidence, addressing technical and licensing concerns

–          Resourcing developers to drive innovation and overcome technical barriers

–          Entering dialogue with aggregators to identify efficiencies in dataflow and aggregation that can be achieved by key shared UK bibliographic data services

–          Securing buy-in to the Discovery Open Metadata Principles and supporting implementation, with particular emphasis on making available hidden or strategically important collections under open licences

–          Collaborating with JISC library and archives services and other key aggregations to identify opportunities for establishing open licences

–          Establishing dialogue with commercial providers

–          Engaging with related initiatives internationally as well as in the UK to ensure joined up thinking and solutions

One of our key commitments as we move forward is to regularly update this blog with information about these areas of work as they progress, the lessons learned as well as progress made.  We’ll also use this blog as an opportunity to reflect on and engage with related initiatives in the UK and overseas.

As always, we very much welcome your comment.

Developers entries help us explore new possibilities in discovery

September 15, 2011

It really was a tough call to pinpoint a clear winner for the #discodev competition. After we gave people a bit more time, using some of the August lull to work on applications, we ended up with a really good array of entries, demonstrating a wide range of possibilities. A key judging criterion (obviously) concerns the usability of the application. But judging aside, I am personally less concerned with how usable a rapidly developed application is – and some of these applications have worked very effectively with complex and often dense datasets – but how much they get me thinking about potential use cases and benefits.

To a large degree, the Discovery programme is about identifying the potential, and where appropriate finding ways to build on someone’s seed of an idea. Applications such as Yogesh Patel’s experiment with Archives Hub linked data might only scratch at the surface of the dataset but they still prompt us to think about some of the great potential that exists. Along with What’s About it hints at the potential of combining historic and contemporary geospatial data to provide new routes through to content; to explore the world of ‘exploration’ spatially as opposed through the linear and hierarchical structure of the archival description. I think the archival community especially is hungry for examples to help us get past some of our entrenched thinking about what discovery interfaces looks like. Along with initiatives such as HistoryPin, OCLCs MapFast these applications give us something tangible to react to and explore ideas around discovering library, archival, or museum data geospatially.

We’re also learning more about the potential for Linked Data. The entry from Mathieu D’Aquin, Discobro, compliments the research and development activity of the JISC-funded LOCAH project perfectly in this regard. These are projects that enable the archival community see how EAD rendered as linked data can become more embedded within the wider web of data; and instantly (it seems to me) we’re forced beyond the finding aid and document-centric mindset, and thinking about our descriptions as data that needs to be interlinkable to be found and used. It is remarkable how well Discobro works. My own search for the Stanley Kubrick archives in the Archives Hub using the bookmarklet immediately provided multiple links out to DBpedia entries on Kubrick’s life, cinematography, and films. All this is not achieved through a manual mashing of data, but an automatic ‘meshing’ that can scale (which is perhaps one of the most heady promises of Linked Data).

Will Linked Data be The Way Forward? The jury’s still out, but applications such as Discobro,  and others help us understand in much more tangible terms what benefits might be delivered.

And some applications demonstrated benefits that we can work on delivering much more immediately. For me the stand out here is the Open URL Router Recommender developed by Dimitrios Sferopoulos and Sheila Fraser at EDINA . My brain’s whirring with the possibility of how we can include this as a functionality into article search services at the local or national level (for example, embedding it into the newly designed Zetoc which will be launched later this year). The use case for recommender functions is already proven, although we have more to learn about such functions in academic and teaching contexts, but what EDINA have demonstrated is what you can achieve through the network effect – gathering data centrally. Patterns and relationships between articles emerge that are not readily available through other means. It’s simple, and the data’s already there waiting to be exploited. As a result we can provide routes through to discovery based on communities of use, disciplinary context, and not descriptive metadata alone.

Neeta Patel’s simple visualisation of the MOSAIC circulation data demonstrates something similar – through my involvement with the SALT and Copac Collections Management projects, we know that libraries are already using their circ data (if they collect it) to inform collection management decisions, but that often this work involves scrutinising spreadsheets and figures. Visual views of the data can really help support such analysis, and give that at-a-glimpse overview that can often tell a whole story.

There’s obviously a lot more that could be said about these entries (I wish I could touch on them all) and hopefully we’ll hear some views from my Discovery cohorts.  I’m now interested in seeing what conversations now open up as a result, and what practical work we can carry forward through new collaborations.

#discodev — announcing the first Discovery developer competition

July 3, 2011

I’m pleased to announce that we’re teaming with UKOLN’s Developer Community Supporting Innovation (DevCSI) to run a developer competition throughout July 2.

There’s a lot of talk about the potential of open data, and how it can support innovation, but we want to try and drive that innovation in ways that help us understand in practical terms what’s possible, and what future use (and business) cases might look like.

It’s fairly simple — we want developers to build open source software applications or tools using at least one of our 10 open data sources collected from libraries, museums and archives. Other sources may be used (we encourage it).

This is a chance to win a nice chunk of Amazon vouchers (from £30 to £100) or an EEE Pad Transformer (a.k.a. hackable equivalent to the iPad)

Enter simply by blogging about your application and emailing the blog post URI to me ( by the deadline of 2359 (your local time) on Monday 1 August 2011 (now extended to 2359 on Monday 22 August 2011).

More information about the rules, criteria for judging, and ideas for what might be worth trying etc., is all detailed here.

p.s. If you blog, tweet, etc, then use: #discodev (uh huh).

Discovery conference, May 26th. An overview of the day’s discussions

July 1, 2011

Here is a summary of the main ideas and themes from the presentations and discussions at the Discovery Conference at the Wellcome Institute, London, on 26 May 2011. It’s based on notes taken at the time, and is therefore by necessity to some extent selective, but I’ve tried to be comprehensive and true to the spirit of the day. I’ve included references to some of the key twitter themes as these help to highlight issues of interest to the community.

Jane Plenderleith of Glenaffric Ltd (and member of the Discovery Communications Team

Opening Address from David Baker, JISC

Our starting point was the RDTF Vision Statement of 2009. Since then there’s been some discussion about scope, suggesting that the vision should not be limited to UK HE. Following some heated discussion at the 2010 JISC conference, the vision is about opening access for all. But we have to start somewhere, hence the focus on UK HE. In our definition of the future state of the art, it’s important not to try to project too far forward, so the focus is on what we aim to achieve by 2012.

We are aiming for integrated seamless access, focusing on UK HE in the first instance, with a thorough and open aggregated layer, designed to work with all search engines, through a diverse range of personalised and innovative discovery services. Increasing efficiencies is clearly important for sector leaders and managers – the potential of open data to address this priority needs to be emphasised.

At the moment we are in Phase 1 of this process, focusing on open data. More detail about the call from JISC for moving into Phase 2 will follow. Phase 1 achievements include:

ñ  Excellent Open Bibliographic Data Guide

ñ  Projects

ñ  Metadata Guidelines

ñ  Newsletter  (people can sign up, keep in touch, feedback, engage in interchange of ideas and experiences)

There was a successful event in April, with good engagement, proposals for further work, and suggestions for a ‘Call to Arms’. This has resulted in the Statement with eight Open Metadata Principles in the pack for today’s event.

Eight projects have already been funded, focusing on a broad and appropriate range of issues, providing a test-bed for the Phase 1 work, and giving us a good platform on which to build Phases 2 and 3.

We are working on making metadata open and easier to use, distilling advice and guidance from the eight projects. Key stakeholder engagement is vital to this process. This new phase of work is under the brand ‘Discovery’. RDTF was a clunky old name, but when boiled down to the essence, it’s about developing a metadata discovery ecology for UK education and research.

Engaging stakeholders and developing critical mass is key. With the community we want to explore what open data makes possible. Since the first event in Manchester on 18 April many people have signed up to the Statement of Principles. Today’s event is about using this momentum to move the open data agenda forward.

Part 1: The Demand Side — User Expectations in Teachin, Learning and Research

Keynote 1: Stuart Lee

Stuart was addressing the conference (by filmed videolink) wearing two hats, one as a researcher in the humanities and one as an IT service manager at the University of Oxford.

He started with a historical overview of his data usage techniques: ‘When I was writing my PhD thesis, I had to produce a glossary. The normal method at the time was using a card system, which took a long time (a friend took one and a half years). But I was trained to use a text analysis tool, so it took me three hours. Later I was asked to produce a monograph of my thesis, but instead I made a pdf and put my thesis on the web. I didn’t know at the time that this was in fact open publishing, but this had far more impact than if I had published in book form. It’s been downloaded thousands of times, and made my reputation in this field.’

Stuart went on to make reflections on how researchers in the humanities work, and what open data might mean. Researchers in the humanities never really finish. Projects have a long life span and are often revisited. We work in an iterative cycle, our research is unbounded and incomplete. We don’t just publish and move on to the next thing. We tend to work alone, in our own way, not in teams in labs. Print is a very important medium for us. We use primary and secondary resources, we find stuff through browsing catalogues.

In a nutshell, we just want to find ‘useful stuff’. Modern researchers are less worried about provenance, they are more concerned with usefulness. Many collections that we use are built by other academics working in our field.

We use tools to edit, analyse and compare data. We need to organise material so we can quickly find it. We have to present our work in a particular way – present an argument, combine primary and secondary resources. Citation is important, particularly of recognised names in the field. The material we produce has to be safeguarded, archived, so we can come back to it and others can use it. We want it to be available for a long long time. It does not go out of date like science stuff does.

So what opportunities does open data present for researchers in the humanities? We are very interested in open data and the Discovery agenda. We can now achieve the previously impossible – find relevant resources quickly, deal with mass quantities of data (example: corpus linguistics), achieve low cost distribution (example: iTunesU). Storage is no longer a problem, we can search across data silos from our desk and take advantage of cross-searching possibilities.

Perhaps we undervalue serendipity when we are looking for resources. If you are scanning books in the library, you find useful stuff on either side of the one you are looking for. If you are browsing data on a keyword search this throws up lots of possibilities.

There are a lot of chances for collaboration using online tools. We work very much in our own sub domain, with international connections in our field. We need better bibliographic tools like Mendeley.

Inevitably, open data poses some problems and challenges. Who is a researcher? Increasingly libraries have to incorporate meeting the needs of people beyond the usual HE sphere including the public and corporate bodies. There’s a lack of awareness about what is available. There’s a need for better standardisation. Text analysis tools haven’t advanced much in 20 years, and training undergraduates in their use is still necessary. There is still a problem with accessing data when we don’t know its provenance.

We need to break free of the stranglehold of academic publishers – we in the humanities are every bit as fed up about this as people working in the sciences. The system we have at present is unsustainable. We need to make metadata open to make it easier to find things. There are more challenges relating to the analysis of data, and preserving knowledge. We need support for adopting open content, both top down and bottom up.

Stuart ended with some comments on the changing nature of the library itself, the concept being no longer of a physical building, but a whole plethora of bodies holding information and making it available in what Stuart called the ‘cl**d’ (he doesn’t care for the word).

Keynote 2: Peter Murray Rust

PMR’s focus was researchers in the STEM field, and he was provocative from the outset. How many practising scientists are in the room? None. That’s the problem – scientists have no use for university libraries and repositories.

There are global and domain solutions to resource discovery. We have the technical solutions – what we need to make this happen is political will. For example, only those universities which have mandated publishing work in repositories (such as Ghent, Queensland, and to some extent Southampton) actually use them.

By comparison, look at the Open Street Map project (an open information resource for global maps). People have really contributed to this. They even held mapping parties. Example: after the earthquake they created a digital map of Haiti in two days for the rescue services. That’s the power of crowd sourcing. But there is no sense of the power of this in JISC – their strategy is to rely on publishers getting the stuff for us. But publishers, says PMR, produce garbage (this remark aroused amused assent from most of the people in the room).

PMR continues in this provocative vein. ‘It is quite simple for us to produce our own discovery data. Example: I have an interest in UK Theses, so I went to Ethos. I went with a simple and fundamental question – trying to access all Chemistry theses published in the UK in 2010. But they are scattered over different repositories, not searchable, and not available in any integrated way. In France they have SUDOC Catalogue– with 9 million bibliographic data references. If there is one clear message from today, it’s “do what the French do”.

It is technically trivial to turn documents into pdf, but this is an appalling way of managing data. PDF is like turning a cow into a hamburger. You can’t turn the hamburger back into a cow. (The twittersphere took up this comment and retweeted it many times).

Another example of where it doesn’t work: I put 2,000 objects into the Cambridge D-Space, but then I couldn’t get them out. I had to write some bash code to get my own objects back out again.

More provocation: We are paralysed by the absurdity of copyright. I know people who delight in not doing anything because of copyright. Any small interaction that is not automatic kills open data. Google just goes and does it.

PMR’s solution to these problems was to build his own repository – a graduate student did this in a year, which now costs about 0.25 FTE to maintain each year. Some funding was secured under JISC Expo to make open bibliographic data available. We have ‘liberated’ 25 million bibliographic references. It’s important to aim outwards not inwards. Example: PubMed is funded in the UK by Wellcome Trust. This organisation has done more than the whole of UK HE to push the open data agenda forward.

For PMR, what would really make this work is support from the major research funders. Wellcome, RCUK, Cancer Research UK. But they are not here today. If the funders were to mandate that all the work they fund is published openly, and state that if you don’t publish your data you won’t get another grant, this would have a serious impact. All that would then be required would be to manage the bibliography, and that’s easy. Open data just requires political will and management to make it happen.

Research Conversation

The opening keynotes gave rise to a lively debate about open data for research, with comments and questions from lots of people. The tweet wall was also animated, echoing key points and making further suggestions and generating ideas. Here’s a summary of the main questions and comments from this session:

What is the value of open data to researchers? What’s the value of a map to geographers? It’s a vital resource – we need to know who is doing what, with links to everything, with that we have the complete spine of scholarship. Bibliography is the map of scholarship. There are also management uses for data about published papers.

PMR said that data are complicated, diverse, and domain dependent. Every discipline is different and has its own views on what data is. It will take 25 years to sort what scientific data actually is on a technical level.

How important is provenance? Researchers care about provenance and how something came about. We need to exercise critical appraisal when assisting the construct of information sources. But while provenance is important, it is also incredibly difficult. In the first instance what’s important is that the data is available

What do we have to say to the funders to make them listen? Funders want the work they fund to be widely used, discovered, read, computed, built on.

The tweet wall at this point was alive with comments about IPR and copyright risks.

Is there the same ethos of collaboration and openness for museum data? Museums are protective of what is effectively their life’s work. There are copyright worries about the protection of intellectual capital. Providing an open record to a world where it might be challenged or used in a context for which it was not intended is quite challenging for museums. But it was also noted that there are people in museums who do want to share.

Should publicly funded research institutes make their data openly available? PMR praised organisations like BAS and NERC which are dedicated to maintaining data and making it publicly available. He noted that in academic communities this practice is variable. Some researchers would die rather than make their data available, while others are doing this quite freely. In some places there is an embargo on publication for five years, in case people might find out what they are doing. Issues relating to university ethics and data storage policies were mentioned.

It was suggested that what is needed in the sector is strong leadership promoting open data. There’s a particular problem with senior academic managers, working in a factional REF-dominated culture of competition. In industry, competitors manage to work together on issues of common interest, while still maintaining competition.

David Baker summarised the key issue: it is becoming apparent that the political and legal challenges to open data are more difficult than the technical.

Keynote 3: Drew Whitworth

Drew’s focus was the role of open data for teaching and learning in a variety of formal and informal contexts. A key theme was information as a resource in the environment – it does not diffuse itself evenly, it can be controlled, polluted, degraded. Drawing on Rose Luckin’s 2010 work ‘An ecology of resources’, Drew noted that an ecology evolves in a dynamic way. When you use resources you transform them into something else. This can be a problem – if we transfer resources into pollution, we are not using them in a sustainable way. Sustainable development means you meet current needs without damaging the process of meeting needs in the future. How are you using information now? Are you developing resources that will lead to enhanced resources in the future? We need to use resources now to build resources for the future.

In his book Information Obesity (2009) Drew presented the argument that while logically, information is a good thing and we need it, a lot of information can be a bad thing (why do we talk about information ‘overload’ not ‘abundance’?) It is the same with food – it is possible to have too much food or the wrong kind of food. Fitness means eating smaller amounts of right kind of food. We are under pressure to consume, and this works for information as well as food. Obesity is not just about over-consumption. It depends on individuals, and purpose. Athletes process lots of calories. Some of us can process lots of information. But we don’t want to turn learners into information processing machines.

Drew described the JISC-funded MOSI-ALONG project which was trying to connect museum artefacts of local relevance with real people and stories from the community.

In summary: we have to remember that learning and information processing happens all the time through communities. If we don’t look after our information environment it will become polluted. Environments are healthiest when they are diverse. We need to look after these environments, protect against storms and national disaster. It falls upon people in workplaces, and business leaders, to make sure the information environment on which we depend is sustainable. Our task is to look after these environments, and it’s everyone’s responsibility.

Teaching and Learning Conversation

Does the UK discovery ecotecture need to concern itself with usability or are we simply aiming to get the stuff out there? We definitely need a usability strategy. Otherwise people can just shove data in and it’s unusable.

We also need to be aware of our filtering strategies. We are programmed to filter sensory information all the time. We have known tendencies to filter out information that challenges our primary beliefs. You want to give help and guidance, you have a mental model of the data, you have some organising principles, but you need to build in some flexibility in case your mental models do not match those of your data users. This is key to effective use of these resources for learning and teaching. We need to guide, help, but not fix and control.

There is a danger in the paradigm of respected provenance, we need to be wary of gate-keeping, and think about filtering throughout the chain of use. But from a metadata standpoint – if we try to predict how users will use data, we tie ourselves in knots. For the Discovery initiative there is a sense we just have to get the content out there and communities can practice, can start to repurpose for their needs.

Usability and discovery are different but related. The challenge is – we are used to usability in terms of HCI making it easy for people to navigate and use.

But how do we channel that thinking about flexible usability while still making it possible for people to uncover the complexity of the data?

Whatever usability criteria there are need to be continuously reviewed in the light of how people are using the data. Any organising principle can become too restrictive. Scaffolding learning is a good principle – but when the job is done the scaffolding comes down. The challenge is finding a way to use scaffolding for information retrieval then take it down so people can find for themselves.

We need a discussion about the nature of infrastructure, so the scaffolding notion is useful. If we immediately apply this – we have processes that generate metadata, and much of it is context bound. We are moving towards just-in-time metadata, that is generated from processes. You might need the scaffold for 10 seconds or 10 months.

The elephant in the room is VLEs. People who are populating VLEs are not putting together temporary scaffolding, it’s a bit more permanent. There are competing approaches to describing resources and we need to take this into account. The problem with VLEs for learners is that the second they leave the institution they no longer have access.

Information is resource in a context. What else is necessary in order to turn information into learning? It’s not really possible to say what turns information into an educational resource. The quality of teacher, the motivation of student, the relevance in context. You cannot reduce education to a science, it is unpredictable, conversational, context specific.

There has to be redundancy to make our ecology healthy and diverse. Funders make decisions. JISC has a pretty good approach, do consultation, collaboration before they set priorities. For others funding is based more on political expediency, and this is worrying. There is a need to prioritise developments, but let’s do this in the right way and leave some room for flexibility.

At this point in the proceedings there was a welcome break for lunch.

Part 2: The supply side: Opportunities to expand access and visibility

Summing up the morning, David Baker said:

Seamless access, flexible delivery, personalisation – if we can put these three together, there is a very exciting future.

The afternoon session was chaired by Nick Poole. HE/FE says ‘we need this’. Just do it. Politicians say just do it. We say do it, but do it well. The afternoon session was to present examples of people who have just done it.

Veronica Adamson: The Art of the Possible, Special Collections

Key points from the discussion:

Special Collections may be the key that unlocks potential of open data for many people – it resonates, there is an understanding, examples of where LAM can really work together.

Having a business case is essential – LAM managers need to be able to make the case for open data on the basis of efficiency savings, improving the quality of learning and teaching, enhancing research output, widening participation, raising the profile of the institution.

Collections experts may not be the best people to make the business case for managers. It’s not just about listing benefits, it’s about costs and benefits.

There is some interesting work by Keith Walker at IoE looking at learning trails in museums.

Peter Burnhill: Aggregation as Tactic

Aggregation means combining different sources of data, seeing the machine as user.

There is a purpose beyond discovery.

Do we need to decouple the metadata layer from the presentation layer? This is a techie question but it’s important.

Supply and demand – maybe we are all middle folk. We are adding value by bringing different streams of data together, making them more amenable for access. Aggregation is an intervention with some purpose.

There was some activity on the tweet wall at this point relating to important parallels between the discovery agenda and OER in terms of aggregation as a tactic.


The final session was a conference discussion about the scope of projects which might catalyse collaboration, focusing on events, celebrations and anniversaries in 2012. The debate continues.