Tuesday, August 3, 2010

PROJECT PROFILE: Using Computer Science at Harvard Forest to Increase Integrity of Scientific Conclusions

By Sofiya Taskova and Morgan Vigil

This summer, we have had the privilege of working with Dr. Emery Boose and Dr. Barbara Lerner on a project involving a mash up of ecology and technology. For the past few weeks, we have been inundated with the buzzwords "data provenance", "sensor network", "Process Derivation Graphs", "Data Derivation Graphs", "stream discharge", and "weirs". Our headquarters is located in the Shaler Common room (where we do most of our programming and computer work), but we do make a weekly trip out to the six hydrology sites in the forest to collect manual and logged data as well as water samples and measurements for stream ecologist Dr. Henry Wilson of Yale University.


Our work addresses the problem of data provenance--which handles how data is collected and processed, who interacts with the data, what is added to the data, where data originates, when data is collected, and similar questions. More and more, we are using electronic sensors to collect data, which allow for massive data accumulation. However, this huge amount of data comes with a price--sensor drift, sensor failure, logger failure, server failures, and other "technical difficulties". By making data and everything that affects it transparent, data provenance gives integrity to the scientific conclusions drawn from digital data.

Currently, Morgan is working with Little-JIL--a visual programming language that allows informative "metadata" to be captured while raw sensor data is processed into a more usable format. She is focusing on making the process more usable for non-programmers (specifically stream ecologists) by adding a graphical user interface to the process with a setup wizard. She has also revised and built upon the existing Little-JIL process so that more provenance data about input files and .properties files used in the process can be passed on to a DDG.

On a day to day basis, she is programming (mainly working in Java) but also collects water samples and digital data from the hydro sensors in the forest once a week.

So far, she has learned a lot about the software design process, especially with regards to collaborating with scientists from a different field. She has also learned about the importance of creating data management protocol for the sake of reproducibility and reliability.

From here, there are many ways to go with Morgan's side of the project. She would like to see the process abstracted so that a scientist could use it to collect provenance data from any kind of sensor. She would also like to see it be connected to a DDG-creating process. It would also be good to see it developed into an even more usable tool for scientists.



Sofiya is using a graph (a structure that describes relations between the elements - procedures and data instances - that constitute the process) to represent the specific process that a piece of data undergoes. Her goal is to optimally represent the different manipulations of the data and to collect the provenance information of a running process.

The central question that she is addressing is what amount of information to collect about a process. It appears preferable to maximize the details in order to avoid undermining any software functionality that could be of interest. However, massive data sets are very likely to be manipulated in practice, in which case, the amount of provenance information explodes.

On a day-to-day basis, she works on a computer program that collects information on a process strictly defined in the graphical language Little-JIL. She also describes traversals of Process Derivation Graphs (graphs showing all possible execution paths) using Data Derivation Graphs (graphs showing execution paths of particular pieces of data) to help consider the possible handlings of the data and errors that may occur during the process.

While working on the project, she has become familiar with the problem in scientific research of keeping data measurements consistent and adequate for analysis and comparison despite the variation of scientific methods and the myriad of circumstances that may flaw the validity of the data.

In the future, Sofiya is looking to continue working on the software for collecting, storing, and using data provenance information. She is hoping that the basic functionality supported by the software will solve the problem of data provenance in strictly defined processes in many domains including science, engineering, and health care.

No comments: