Australian National Library uses open source for treasure Trove

New search engine provides access to over 90 million items about Australians and Australia

The National Library of Australia's Trove search engine has an open source backing

Comments

The National Library of Australia has opted for an open source platform to drive its newly unveiled search engine.

Called, Trove, the search engine provides access to more than 90 million items about Australians and Australia, sourced from more than 1000 libraries and cultural institutions across the country.

The project’s team of five developers used SOLR 1.4, which internally uses Lucene 2.9, for the main bibliographic search database and the web page archive, and MySQL 5 for managing all data relationships.

“That was something that was pretty important to us, we didn’t want to go and build something in-house,” Trove manager, Rose Holley, said. “We wanted it to be shareable when we were finished.”

Holley said the search engine evolved out of the library’s newspaper newspaper digitisation program, which began two years ago and runs off Lucene 2.9 "natively".

That program involved the use of Optical Character Recognition software to automatically convert old newspaper images into digital text. The small fonts and uneven printing of many of the newspaper pages made conversion difficult and not always accurate. As a result, more than 5000 online users helped corrected text and subsequently the top correctors were slated to receive Australia Day awards for their efforts.

“If that was successful then our master plan was always to transfer the rest of our service into that infrastructure,” Holley said. “So the infrastructure for the newspapers service is the same for Trove.”

The project team also opted for Jetty as a web server, Nginx as the http front-end / reverse proxy, Java Server Pages (jsp) for the newspapers part of the site, and Restlet and FreeMarker for the remaining portions of the service.

Additionally, one of the main steps taken was to use Solid State Disks (SSDs) – four Intel X-M25 160GB drives in each machine – for the Lucene indices to achieve the necessary performance. Trove issues more than 8000 i/os per second to the SSDs, which the team says would be expensive to achieve with even the fastest SAN setup.

The Trove website, meanwhile, is split into eight searchable categories:

Books, journals, magazines, articles
Pictures and photos
Australian newspapers (1803 – 1954)
Diaries, letters, archives
Maps
Music, sound and video
Archived websites (1996 – now)
About people and organisations

Unlike Google search, which provides a list of websites for search results, Trove displays links to items.

“We are searching across meta data mostly from cultural heritage organisations,” Holley said. “We have over a thousand organisations that have been providing their data. Obviously libraries have been looking to standards of data sharing for several years and we use a mechanism called OAI – the Open Archives Initiative.

“Whereas Google is trolling meta data for a website, we are doing a similar thing for data but for unique Australian objects, many of which are objects that cultural heritage institutions have digitised. Most of them you wouldn’t normally be able to find on the web through Google search because they are in the deep web or some database that wraps them up.

“So we don’t hold the actual items here, just the meta data. People search across that and when they find an item they want to see they follow the links and that will take them to where that object is.”

One of the big approaches for the Trove team was a concept called, ‘find and get’. In short, this means having no ‘dead ends’ on the search engine, or always providing users with a place to go.

“A lot of people are only interested in getting it immediately, so that means it has to be digital or online,” Holley said. “That check box right next to the search box, only online, is really crucial for us. But there is a lot of stuff that are physical objects – we have the whole of the Australian national bibliographic database in there as well, which is basically books. If it is not online, then there will be options to find it at your local library or to purchase it. It doesn’t matter what the format is, we are making a big effort not just to offer purchasing books, but other items like music.”

The National Library has relationships with many bookshops already and is open to discussing further deals with other retailers. Many of the book shops open up their databases for Trove to index in search results.

One of the next steps for the Trove team is to roll out forums within the search engine site to enable users to interact and also provide further context around links and items.

The other thing is to get more content into Trove. This includes from galleries, museums and government data, which may become more accessible should the Government 2.0 Taskforce recommendations be implemented across agencies.

“Also, for example, people like the ABC,” Holley said. “They are people we have an eye firmly on. What we want is for any member of the public seeking information on, about, or by Australians or Australia to come here to find that.”

Try out the Trove search engine.