Digitising collections: how technology is speeding up the process of big data

In this article Advisor talks to Steen Dupont, Project manager of the digital collections programme at the Natural History Museum, London on how he and his team are developing new technology to speed up the process of digitisation the museum’s collections and we also get an in depth insight into Axiell’s expertise on digitising collection in an Ask the Expert Q&A

Last month the Natural History Museum, London announced that it had digitised four million (4,039,334 at the time of writing) items from its collection and added them to a dedicated Museum Data Portal, (launched in December 2015) as part of an ongoing project to make the collection more accessible by being available online.

However, large as that number is, it only equates to around five per cent of the 80 million specimens in its collection that spans 4.5 billion years and has been collected over a 300-year period, from the time of Sir Hans Sloane (1660-1753) whose vast collection was to become the basis of the museum upon his death.

The four millionth specimen to be released onto the Data Portal was a moth that was part of the project to digitise British and Irish Lepidoptera. Dr Vincent Smith, Head of Diversity & Informatics at the Museum says that for every scientist that comes to Exhibition Road to physically visit the Museum’s collections, around ten visit its digital collections. “The Data Portal has become the largest single gateway to Museum specimens, and use of this freely accessible data is creating opportunities for research and collaboration that would have been unthinkable just three years ago.”

As Advisor noted in 2016, it was the dawn of the internet that made the digitisation of collections a possibility, but it has been the subsequent development of the equipment and technology that has facilitated its rapid progress. At the Natural History Museum, the five-strong core digitisation team is at the forefront of developing and improving equipment, such as cameras, scanners and Artificial Intelligence that will allow them to fast-track what, at first glance, seems like an insurmountable challenge.

The scientific community is already using the online collection. So far more than 12 billion of the Museum’s specimen records have been downloaded during more than 150,000 separate download events. There have also been more than 85 scientific publications which cite the Museums data, covering topics from human health to species discovery. Natural History Collections hold important information which can help scientists to answer key questions about the past, present and future of the solar system, the geology of our planet and life on Earth including pressing issues such as pollination and climate change. Understanding of the past can enable more accurate prediction of the future. By introducing its collection to the world on an industrial scale, combined with other museums’ digitising their collections, the Museum is contributing to big data, which can, through computerisation, reveal patterns, trends, and associations on a mass scale globally.

“Once you have data from multiple museums you start being able to generate big data and answer big problems such as global warming, climate change and pollination declines,” says Dupont. “This is because you start seeing patterns and trends that you are not aware of when you don’t have all that data. And that is generated by computers. You can’t sit down in a museum and look at that – you need to do it digitally.”

It has to be as fast as possible, and that’s one of the tasks of the project: to conduct imaging on a mass scale

Steen Dupont, Project manager of the digital collections programme, Natural History Museum

The Museum’s Data Portal is now part of the Distributed System of Scientific Collections, a new initiative that has so far brought together 144 museums from 21 countries and more than 5,000 scientists with a vision to unify their collections into one digital research infrastructure. Dupont says while this is on a massive scale it highlights how museums, even smaller local museums, can align their collections with others through digital platforms to share information.

The Museum’s digital collections programme began in 2014 and it is essentially a response to one of the high-level strategies of the museum: to make the collections available and accessible as widely as possible. And the Museum is prioritising what it digitises based on a balance of factors including scientific relevance and interest; public or cultural interest such as unique historic collections; and feasibility.

There are multiple stations around the museum where digitisation takes place to cope with the variation of the specimens. In some cases the digitisation equipment goes to the specimen as it is too fragile to be moved. Specimens are not only being photographed but scanned using scanning electron microscopes or 3D surface scanners – some of these specimens such as  fossil mammals collected by Charles Darwin have been made into 3D models. There is also an X-ray microscope, which produces 3D images of the internal structure of certain specimens with a resolution of up to 700 nanometres. Samples are placed on LEGO mounts on a conveyor belt and then X-rayed. Some of the cradle scanners that are used to capture the pages of tightly bound books in the collection have the additional help of LEGO arms to pull the pages as a human hand would.

The imaging workflows that NHM are using have been developed specifically for the project, to solve a range of challenges including multiple labels on pins below specimens, and getting good resolution across larger specimens or drawers of multiple specimens. “Before we might have had to take multiple images to capture a large area with a lot of specimens in. The resolution on cameras is becoming so good we can potentially use a single image – that could reduce the amount of images that need to be taken by six times. It’s a huge step forward.” Says Dupont.

Dupont and the team are also working on automated solutions for extracting information from images, including optical character recognition, which can look at a label image and convert text to digital information. The Museum’s labels contain a great variety of typed and handwritten texts, in multiple languages and going back hundreds of years, presenting a challenge for these types of software.

The Museum are now able to digitise pinned specimens and slides at an industrial scale with specially designed set-ups that allow them to image them fast and reliably. And that’s really important because they are dealing with approximately 34 million pinned insect specimens and two million slides.

“I estimated it would take one person 7.6 years to image all the slides and about 434 years to image all the insects,” says Dupont. “So it has to be as fast as possible, and that’s one of the tasks of the project: to conduct imaging on a mass scale.”

The Museum is also documenting the progress of its digitisation project by producing methodology and scientific papers and attend conferences such as the Museums + Heritage Show, where Dupont gave a talk last month. They have also organised digitisation workshops outside the Museum – Dupont has given talks internationally such as at the Natural History Museum of Jamaica to pass on his knowledge.

“We are to some extent leading on the methodology of digitisation – sometimes in collaboration with industrial partners, whose equipment we can test and challenge in new ways,” he says.

Because this is a continuing project and because the Museum’s collections are so varied, the project has required identifying the communality between the various parts of collections, building and improving standards for image policies and building workflows that will apply to different areas.

The Museum runs on an open by default data policy. Information that it puts online Data Portal is by default licensed as Creative Commons Zero (or By Attribution for images), meaning it’s as open as it can be for anyone to use

“People should be able to use collections data as widely as possible – we know of people using it for artwork, and for teaching, as well as for science,” says Dupont. Collections are a resource for everyone – and digitisation makes this possible more than ever before.”