On the 5th October 2011 I attended a workshop on ’emerging bibliographic tools’ organised by JISC. The idea of the workshop was to bring together a small group of people with experience of a wide variety of tools used to transform, publish, and otherwise manipulate bibliographic data.
The day kicked off (after introductions) with simply capturing the whole range of activity, formats and tools that the attendees through were relevant to exploiting bibliographic data. The nature of this session made it rather a whistlestop tour of technology and terminology, including:
- Linked Data and RDF
- NoSQL and related tools such as CouchDb, MongoDb (document stores) and Redis (a key-value store)
- Big data (defined as ‘data bigger than your used to handling’) and Hadoop/MapReduce
- Identifiers – the challenges of finding and exploiting appropriate ones such as DOI, ISBN, AuthorClaim and ORCID
- Automatic metadata creation from full text resources
- Visualisation tools – from Google Charts to R
- Ontologies and representations – from MARC to BibJSON to RIS to BibTeX to Bibliographic Ontology to Schema.org
- ‘Data reconciliation’ tools such as Google Refine and the Stanford Data Wrangler
- Indexing technologies; Solr/Lucene, SolrMARC, Sphinx
- Code libraries for MARC: PyMARC, ruby-marc, MARC::Record, MARC4J
- Spidering/Web crawling technology: CrystalEye, PubCrawler, nutch
- … and more
However, there was also time to discuss some aspects in more detail, going beyond just the tech, and starting to talk about the skills required to manipulate bibliographic data, and potential developments that might support those working with data, such as identifier lookups, visualisations, and data transformation services.
After lunch we picked up on these latter points looking for the opportunities, challenges and gaps that existed. The morning discussion had highlighted the incredible range of relevant technologies, and one of the challenges identified in the afternoon was keeping on top of existing and new initiatives, with the use of mentoring and online community support, identified as opportunities.
In the morning a healthcare metaphor was introduced with some discussion of a ‘Data Doctor’ role for organisations – someone with the technical skills, domain knowledge, and data expertise, who would be responsible for ensuring that the organisations data was in ‘good health’ (see also ‘data scientist‘ ). In the afternoon, this concept was expanded with the idea of a ‘data health check’ service, somewhere you could load data to identify possible problems, and crucially suggested workflows and resources for improving the data.
Perhaps the most crucial issues identified in the afternoon were around skills and sustainability. As we see an increasing need to manipulate data and publish it simultaneously in multiple formats to serve different audiences and needs, we need to find staff with appropriate skills, and ensure managers understand the business case for this work and the skills needed to support it.
At times, the range and scope of the technologies, tools and issues identified by the workshop was overwhelming, as acronyms and jargon flew freely around the room. However, the opportunities opened up by new ways of working with bibliographic (and other) data are exciting, and I strongly believe that we can take advantage of these to produce richer expressions of our data than ever before.
The technologies and tools identified by the workshop will form the basis of a short guide which will be published by the Discovery initiative.