Tim has a nice post about OCLC's recent decision to ditch their new license policy and how it reflects on the nature of the OCLC hegemony in the Library community. Jennifer Younger's presentation about the decision is well worth listening to, as it highlights some of the feedback of the member community as well as the self interests of OCLC. It still concerns me that The Board is unable to recognize fully the nature of the current information ecosystem and our role as librarians in ensuring it is open and accessible. The OCLC world is still imbalanced towards the physical manifestation of OCLC (and the revenue stream derived from member's records), and away from a rich online ecosystem that encourages sharing and innovation. I was particularly concerned with the statement "Identify threats to the sustainability of WorldCat and strategies for protecting it against unreasonable use." Duh. That's the whole point that OCLC still fails to get - there is no unreasonable use of this information, as it belongs to everyone, and any attempt to describe a use as unreasonable is itself unreasonable. Listening to Younger's presentation gave me the chills, as it described the "threats" to the "members" information. Please. Stop already. Find another revenue stream and release the records, the information commons will be better off and we will get more out of it than we could ever possibly lose.
I was at OCLC in Ohio last week speaking at the Rethinking Resource Sharing conference (a nice bunch of librarians it is) and without going into detail, my closing keynote was not appreciated by all, due to a slam of the proposed OCLC license (I didn't know about the planned reversal, but likely would have slammed anyway). Why is it that some can't accommodate constructive criticism of our own profession, particularly that as reflected by OCLC? I'm not real sure, but this whole thing does make me a little sad...
Matthias Razum (FIZ Karlsruhe) gave A Closer Look at Fedora's Ingest Performance. The group used vanilla hardware with single processor and 2 GB RAM to look at ingest speeds and optimization options. The ingest consisted of about 4.9 million objects/500 million triples (PDFs from the patent database, which took 3 weeks to ingest) CPU was not really the limiting factor, it was I/O. There was no difference from JDK 1.5 to 1.6. There was no real difference between the various triplestores or no triplestore, meaning that using triples does not add significant overhead. The most promising areas of optimization were with Postgres tuning - they switched off Postgres's ability to respond when the machine goes down during an operation. This resulted in a highly significant change in ingest rates (130ish ms compared to 40ish ms). With MySQL tuning the InnoDB/MyISAM tables resulted in similar levels of performance improvement. Putting the DB on a separate machine, even with network overhead had a significant improvement as well. Other findings: there was absolutely no impact with a growing number of objects indicating the scalability of Fedora; combination of I/O (re database) and other tuning can see an improvement factor of 4. Another thing the group has considered is creating a number of Fedora instances and then merging the indexes later. Dan Davis provided an update on the work with Sun and highlighted the conclusions of the Karlsruhe work. They will be using the open source Grinder app to create a testbed for ongoing work in this area.
Gert Schmeltz Pedersen (Technical University of Denmark) spoke about Fedora and GSearch in a Research Project about Integrated Search. Gert looked at integrating multiple Fedora/GSearch implementations in a federated search kind of opportunity. Zoned on this one - to much data on little slides.
Tom Cramer (Stanford University), Richard Green (University of Hull), Lynn McRae (Stanford University), Tim Sigmon (University of Virginia), Ross Wayland (University of Virginia) presented on Case Studies in Repository Workflows: Three Approaches. This is critical stuff for the Fedora community - workflows are what it is all about and I think will an area of major activity for the next couple of years. One nice thing about the new Hydra project is the intention to build a standard workflow tool and the fact that the 3 partners are each using a different framework for building workflows means they have a greater chance of coming up with something cool. See: being different is good :-)
Jeesh. I was in business-y type meetings most of days 2 and 3 so I wasn't able to take in any sessions. A few comments that I did pull from
The first session on the last day was Courtney Michael, Chris Beer (WGBH Educational Foundation) speaking on Disseminating Broadcast Archives: Exposing WGBH Materials for Scholarly Use and the Open Vault project. They use PBCore for the metadata, which sounds like a good option for describing rich media. The interface was built in PHP. They have a Tab/Annotate feature which allows them to add a tag or annotation to a specific timestamp. They use a video, transcript and metadata datastreams. The video CM has 3 separate video streams (archival, proxied and streaming). The WGBH group has done a great job of modeling the while "video assets for scholars" context - well done. UPEI is working on a number of video-based projects, so we will definitely be giving Chris and Courtney a call to use their content models and applications.
David Paul Descheneau (University of Alberta) talked about Agile Fedora: AJAX, Low-cost Clustering, and Dynamic Metadata Forms for a Multicultural Website Project. Their project is an ethnomusicology study that worked with a number of cultural groups and ultimately stored 500 hours of video 500 hours of audio and 2500 images. The group found that one of the key issues for them was the workflow, with a number of concurrent independent processes. Their SAMC Media Processor back-end sounds like a must-see for the workflows associated with video/audio file formats, including the OpenPBS software that was used to create a media cluster using surplus equipment. The SAMC Cataloguing tool was VERY nice, lots of AJAX and smart screen design with live updates of the XML datastream. I liked the way they had high-quality documentary style videos for each major theme, and then they listed all the videos clips used in the piece at the end of the page. They used h.264 low/med/high Quicktime and H263 low/high Flash outputs for the video targets. They also created their own metadata schemas from the ground up to reflect the data they needed to capture for music, events and cultural expression. This was also done to provide a more sensitive context in which to engage the cultural groups and reflect their ideas about how t describe their culture.
Jon W. Dunn (Indiana University) talked about PhotoCat: Implementing a Cataloging Tool for a Live Fedora Repository. They created a photo cataloguing tool which allows things like: the creation of "bags" to run the same edit operation on a collection of images; managing a controlled vocabulary database with auto-completion.
I love this Fedora community! So many bright people making their brightness available to the larger community. Brilliant.
The afternoon session featured John Kunze and Sayeed Choudhury talking on NSF DataNet: Curating Scientific Data. John started out with a number of examples of big data examples from the climate change arena. The 4 challenges he highlighted re this domain are:
Dispersed Sources - agencies, data centres, individuals
Diversity of Data Types
He also described the DataONE project, which is designed to provide access to data about life on earth. The project will look at data types from biological and environmental domains and it looks like they will join substantial existing research groups into one data curation context. "Data is like software, but even more sophisticated" was one quote to illustrate the complexities of the data curation challenges. John talked about the idea of digital preservation being about building an outcome, not a place with a "deadly embrace". He talked about their efforts to in essence build a repository using the simple and effective tools available to them at the operating system level. I may have missed something but it sounded suspiciously like they were kinda rebuilding Fedora? I'm sure there are good reasons for going where they did, but I would be interested to see an initiative like this consider working with something like Fedora to effect what they are looking for as an outcome: an open and flexible repository of research data. The benefit to the larger community would be considerable.
Sayeed talked about the Data Conservancy project, which I unfortunately missed :-(
I spoke in the second set of sessions Monday morning, so have a more abbreviated report. Michael Witt (Purdue University) gave a talk entitled Eliciting Faculty Requirements for Research Data Repositories, which presented some initial results of a survey of faculty on desired uses of research repositories. I found Michael's talk very interesting as it highlighted what we are finding in practice with the development of our VRE and the various research groups we are dealing with. They are interested in a number of things, but high among the most desired are the ability to store their data in a safe environment, being able to transform their data to future/emerging formats, being able to search and share their research data with a range of groups in different ways over time, and more.
I talked about developing the VRE service at UPEI and how we leveraged research funding and interests in developing an approach that led to an agile development environment and a sustainable research ecosystem. The PDF is available.
Wendy White (University of Southampton) presented Institutional Repositories: Contributing to Institutional Knowledge Management and the Global Research Commons. She talked about their approach to knowledge management and how to work with research communities of practice to effect a strong IR landscape.
A I always (usually...sometimes...rarely) try to do I will be commenting on the sessions at the 2009 Open Repositories Conference.
Jeremy Frumkin (Arizona) - Global Registries Initiative is a registry of collections of web resources and related services. The mantra of the project is "register locally discover globally". Rather than letting Google do it, or use LinkedData as the registry, they decided to build their own to accommodate specific metadata needs and requirements, such as accessibility via OAI-PMH or other appropriate standards. Vic Lyte talked about the IESR (Information Environment Service Registry), which is a registry of UK resources and services, accessible via web search, web widget, OpenSearch, RSS, OAI-PMH, SRU/W, OpenURL and yes, you know it, Z39.50. I like the multiple access tools/protocols - why can't library systems/databases do the same? Chris Blackall (Australian National Data Service) talked about their effort, which uses the ISO 2146 Registry Services standards and is called ANDS Collection Services Registry. Jeremy also showed a demo of the LibraryFind system doing a federated search across registries - all 3 mentioned here. There is no doubt that this is a step in the right direction, especially since many of these would be inaccessible (dark web) to the standard engines like Google. Now all we have to do is make them even more accessible by moving them into a semantic weby context.
Simeon Warner (Cornell), gave a session called Author Identifiers in Scholarly Repositories which reflects arXiv's approach to this issue. The need for author IDs is familiar to anyone who has looked for papers by a specific author in Google or even the standard databases. The key is to get help from the source (ie. repository), but there are challenges, such as the act that IRs have an institutional boundary and subject repositories have a domain boundary - often research crosses boundaries. Simeon raised some of the challenges with creating an ID repository, including privacy, accuracy, longevity, openness, etc. There are a number of existing author ID services (one of the best is RePEc, which is for economists), including domain specific, national, etc. arXiv's approach is to get authors to claim papers at some point in time by providing useful services around it. One thing they could do with this is provide a list of publications for that author with a CSS to present it in a specific style. This is very similar to what we want to do with our IslandScholar project, providing a widget that lists author pubs and even reformats for SSHRC or NSERC style, for example. They also have built a Facebook application that is fed by the arXiv author ID. A great example of what can be done with the linked data.
Matt Zumwalt (MediaShelf) talked about rapid application development for Fedora repositories. Matt gave an example (live) of building a couple of Ruby-Fedora apps, using the Jewish Women's Archive and a couple of others as examples. The challenge is to bridge the two worlds of agile development vs the "enterprise" style approach which typically has a plethora of big and complex standards and applications. "Eyes on the Horizon (big goals), Feet on the Ground (iterative development)" is one mantra. Matt also related big system design to the DeathStar, which was used before it was finished...and then it blew up. A nice analogy to what can happen with non-agile, big systems :-) Another nice quote re standards: "When they serve you, make standards the assumed, thus hidden convention. When they're harmful, say so." All in all a great talk about how to balance enterprise and agile approaches and what each brings to the table.
Fedora Commons recently announced the latest release of the flagship Fedora repository software (version 3.2). The latest release adds some key new features and a host of bug fixes:
Multiple Fedora WARs in one Servlet Container
web-based repository adminstrative client
initial integration with the Akubra abstraction of the underlying storage layer
SWORD 1.3 Support
Update Mulgara to 2.1.1
Fedora and DSpace also announced joining their organizations together to pursue a common mission under the name DuraSpace. From the PR:
The joined organization, named "DuraSpace," will sustain and grow its flagship repository platforms - Fedora and DSpace. DuraSpace will also expand its portfolio by offering new technologies and services that respond to the dynamic environment of the Web and to new requirements from existing and future users. DuraSpace will focus on supporting existing communities and will also engage a larger and more diverse group of stakeholders in support of its not-for-profit mission
Tis a good time indeed to be in the repository landscape - good luck to the DuraSpace team!
Register now for the 2009 Red Island Repository Institute on Prince Edward Island - July 20-24.
Reserve your spot at the Fedora-focused Repository Institute on Prince Edward Island, one of Canada’s premiere travel destinations known for its sandy beaches, golfing, seafood and iconic red dirt roads. The 1-week hands-on workshop will be led by Chris Wilper and Thorny Staples (Fedora Commons), Matt Zumwault (MediaShelf) and Mark Leggott (Islandora, UPEI). Register soon, as the seats are limited to 20 participants and they tend to go fast. If you plan on attending you should also book a room as soon as you can, as the summer is premiere time on the Island.
The Institute is hands-on and is targeted at individuals from institutions planning or running a repository program and is intended for users with a wide range of experience, from managers to programmers. Based on last year's feedback, we have added a number of break-out sessions tailored for either managers or developers. Attendees will be provided all the information and tools needed to implement and maintain a flexible repository program using Fedora. Since the Institute is a combination of lecture and hands-on experience, we encourage all participants to bring their own laptops. This will allow participants to return to their place of work with a fully-functional Fedora installation for further development and testing. Those participants who are not able to bring a laptop will be provided with one to use for the duration of the Institute.
Registration includes meals (except dinners for Tuesday to Thursday), special events and all materials. The workshop agenda and registration form are now available at http://vre.upei.ca/riri/.
If you have questions about this year's RIRI, please contact Mark Leggott at email@example.com.