Thoughts on Science
I have recently been spending quite a bit of time on data sharing. It is something I hadn't really considered much in the past, but as I collect more and more data, it is a concept that comes closer to the forefront for me. Come to think of it, this should have been more in the forefront of my mind, since the data shared from all of the amazing Missouri Stream Team volunteers has been foundational to my entire PhD journey.
Part of the work that I've been doing in the past three years has been funded by an EPA Urban Waters Grant. One of the stipulations of the grant was that the data I collected would be uploaded to their database, called STORET (or WQX). I had used data from STORET in the past when I worked for the Iowa Department of Natural Resources, so I was familiar with the site. Unfortunately, my current dataset includes Excel spreadsheets for 50 combinations of site and measurement parameter; nearly all of these files have over 70,000 records! That is a lot of data to upload. After several attempts to upload the data and a series of long phone calls with a tremendously helpful EPA staff member, we came up with a reasonable solution for putting the data into the system. I ended up using their data template to upload about 4% of the data (one data point for every 2 hours instead of one every 5 minutes) and a link to a file that has the full data set.
A later conversation revealed that they are still working to improve the interface; it seems that EPA's data storage technology has not quite kept pace with the advances in data logging. While this may be understandable given both the bureaucratic/political constraints and the breadth of high-priority work being done by EPA, I see this as a major flaw in an agency that needs a lot of data to operate properly. As we are in an era of "big data," it seems that having access to mid and long term, high-frequency data sets would be up near the top of the list. Since one of the priorities of the Urban Waters Federal Partnership is to take advantage of the overlapping interests of the many federal agencies that intersect in this realm (e.g., EPA, USACE, HUD, CDC, FEMA, NOAA, USDA), maybe they can partner with the National Science Foundation to modernize their data handling with an eye to the next anticipated breakthroughs in technology.