Coders and librarians team up to save scientific data
- 20 March, 2017 21:04
On a windy, snowy night in Dover, N.H., about 15 people gathered in an old converted mill, staring at computer screens and furiously tapping at their keyboards.
The group – some students, some programmers, and at least one part-time dishwasher and data entry clerk – were braving the snowstorm and volunteering their time to try to keep scientific data from being lost.
It was one of dozens of data rescue events spread out in cities from Toronto to Los Angeles, and Houston to Chicago. These events, many on university campuses, have been going on since December, bringing together software programmers, librarians and other volunteers who are trying to safely archive scientific data from government websites.
"There's loss of data with any administration," said Daniel Pontoh, a data entry clerk, dish washer and a student at Great Bay Community College in Portsmouth, N.H. "We just know how fast that loss of data could happen with this administration."
There is more concern since President Donald Trump took office. His administration has stated that it doubts the reality of climate change and has proposed deep cuts to the budget of the Environmental Protection Agency and the nation's top weather and climate agency, the National Oceanic and Atmospheric Administration (NOAA).
Scientists fear losing critical studies and long-term research in such wide-ranging areas as ocean temperature change, greenhouse gas emissions, changes in polar ice caps, gun violence and animal treatment in research facilities.
References to climate change were removed from the Whitehouse.gov website on Inauguration Day. And the Trump administration reportedly told the EPA to remove online educational resources and links to climate-change data.
Some fear the data will be intentionally lost or altered. Others want to make sure the data is available in more than one location, especially more than one government website, since budget cuts could mean server space and upkeep of these data sets might no longer be a priority.
"We're most concerned that data might be taken offline and public accessibility will be gone and it'll only be available as [Freedom of Information Act] requests," said Margaret Janz, a data curation librarian at the University of Pennsylvania. "Our goal is to make trustworthy copies of data so it will be available to the public and suitable for research. ... This data should never have been in just one place."
Janz is on the planning committee for DataRefuge, one of the organizations working to archive scientific data that has been sitting on government websites.
The group, working with the Environmental Data and Governance Initiative, helps organize data rescue events.
DateRefuge has held about 30 data archiving events, each one bringing in about 100 attendees, according to Janz. The New Hampshire event, which was held March 10, was one of the smaller turnouts. The organizers are also working on ways to keep their community engaged for the long haul.
"Deleting data is like burning books," said Matt Jones, a software developer at Massachusetts-based Yieldbot, was archiving data at the New Hampshire event. "I'm passionate about data and information.... I don't believe in throwing anything out. All data is relevant to somebody."
Volunteers with DataRefuge don't hack into sites nor do they steal the data. They are working to make copies of data that's in the public domain.
The volunteers receive training and then work during the events, sometimes continuing the effort at home.
Part of the work being done is called seeding, where participants nominate URLs to be stored in the Internet Archive, a San Francisco-based nonprofit, public digital library. If the archive's web crawler can extract the necessary data from a nominated page, it will.
If the page is too complicated – say it has 100 different files or is highly interactive -- for the web crawler to work, then the seeders will note that and volunteers will get to work "harvesting" the information.
Using scripts and tools built with either the programming language Python or R, the harvesters will go through those pages manually, collecting data sets, such as weather maps or GIS files, that they need to save.
At the New Hampshire event, volunteers were divided into two groups – one using Python and one using R. Then they got to work harvesting from complicated pages.
Event organizers couldn't say how much data was harvested at that event, but at an earlier DataRescue event that was held at the University of New Hampshire in February, about 40 people volunteering one night were able to seed about 1,100 pages that could be harvested by the web crawler.
At both the UNH and the Dover, N.H., events, they were working to save data from the EPA website.
Volunteers said when they went through the EPA sites, they found instances where pages or data sets already had been removed.
The EPA did not respond to a request for comment on whether scientific data on its site has been removed or altered. Both NASA and the NOAA, however, said data has not been removed.
Lauren Moore, a front-end web developer and digital marketing manager with Durham, N.H.-based Blue Truck Studios, said she is passionate about protecting decades worth of scientific research and has had to learn back-end coding skills to help with the DataRefuge effort.
"It's kind of overwhelming, but I'm getting the hang of it," said Moore, who volunteered at the recent New Hampshire event. "It's definitely worth it to learn a new language and do the work."
Clarice Perryman, a National Science Foundation fellow and a graduate student in earth sciences at the University of New Hampshire, said it's worth it to volunteer what little free time she has because she's concerned about protecting scientific research.
"The websites are deep, and the web mapping is not great. You need people to go in and figure out where things link together," Perryman said. "Regardless of political context, environmental data loss is a big issue … Public access to that flow of information is important, especially when you have politicians saying climate change isn't real and issues with water aren't real.
"It's about integrity," she said.
Daniel Mannarino, a programmer with IBM, was helping with training at the New Hampshire DataRescue event. He said saving scientific data isn't a matter of politics.
"Things can get lost perfectly innocently," he said. "We need data to actually stick around… otherwise you're doing everything from scratch and there just aren't enough resources to do everything from scratch. Science is standing on the shoulders of giants, so you have to make sure the shoulders are still there or we're lost."
It has been about two months since the Trump administration took office, but DataRefuge volunteers say it's not too late to keep trying to save as much data as they can.
"There just hasn't been a chance for it all to be changed yet," Perryman said. "The White House took all mention of climate change off the [WhiteHouse.gov] website on Inauguration Day… But If the data is so great and so deep that we're having a hard time archiving it all, it's probably so deep that they're having a hard time getting to it all. Maybe we're getting to it faster than they are."