Preserving Data: Who’s Got Your Back(up)?

Patricia Kovatch

Patricia Kovatch

Data management has become an essential part of research. Scientists need to be able to rely on their data infrastructures to recover data in case of disaster or to assist with reproducibility of their results. Ensuring a reliable data infrastructure and backup processes may not be the most exciting part of research, but consequences can be severe without a reliable system in place. Developing an effective backup strategy will protect your most valuable investment: your time.
In project management, risk is defined as the probability of an event multiplied by its impact.1 What would happen if you lost some or all of your research data? How long would it take to recreate the data, if you even could? The ability to reproduce your research results might be severely jeopardized.

Most people either have their own stories or have heard stories about lost theses, presentations, and data. The emotional panic and lost time are draining at best. Preparing for highly probable failure will save time and drama. Moreover, most federal agencies and privately funded organizations that sponsor research require data retention and management plans. The University of Oregon Library has compiled a useful list of guidelines from 12 agencies, which generally require data retention for five to seven years.2 Luckily it is not difficult to create an effective backup strategy.

If It Can Go Wrong, It Will

Unfortunately, there are an infinite number of ways in which things can go wrong, which translates into a high likelihood of data loss over time. Human failure may be the most common reason for data loss beginning with the simple accidental deletion of files or the failure to create a data management plan. Once a plan is in force, data sets need to be regularly recovered to verify them. Add in electro-mechanical failures, fire, and natural disasters, and one begins to see the inevitability of data loss. A particularly sad case study can be found on the Stanford University Libraries website.3

If you are saving data now, it is probably to disk drives or tapes. Although manufacturers ship disk and tape drives with decades-long mean time between failure rates, the real rate of failure on any individual drive or component is much higher.4,5 For example, Google published a paper in 2007 showing three-year-old drives failing at a rate of over 8.6%.6 Another effort looked at over 100,000 disks and found failure rates up to 13% for some systems.7 To improve reliability, most storage systems have redundant configurations, but these configurations are also subject to failure because the additional disks are usually the same age and fail at similar times.7,8 Modern backup tapes are similar to 1970s cassettes, and they break in similar ways. Over time, the tapes become harder to read, and due to their high failure rate, the backed up data are normally copied to two tapes to improve likelihood of recovery.

Different parts of the world are subject to different natural disasters, with 2012’s Hurricane Sandy still fresh in the minds of many New Yorkers. Some lessons learned from this disaster included the loss of “offsite” backups in New Jersey that were subject to the same flooding conditions as New York City. Some computer rooms and data saved at lower elevations were completely wiped out.

Developing a Strategy

There are many backup options available to researchers, but selecting one or combining a few into an effective strategy depends on several factors. The best approach is to first determine the following: 1) which data are unique and/or time-consuming to reproduce, 2) how much storage does the data require (in megabytes, gigabytes, terabytes or more), and 3) how often do the data need to be used? Is it a static snapshot to be saved over a long period of time, or is it active data needed to recover from a disaster?

If it will be preserved for a longer term, the data should be saved as an archive, while a backup generally saves active data and enables you to recover from a failure. An archive could be made using the same software as a backup, but normally is performed only once, while a backup is normally repeated regularly. Stanford University and the University of Oregon have excellent descriptions of the best practices to consider when developing your plan for backing up and archiving.9,10

Another helpful source with a broader view of all aspects of data management geared specifically to address Responsible Conduct of Research was developed by Clinical Tools and funded by the U.S. Department of Human Health Services’ Office of Research Integrity. It has examples of specific situations regarding data collection, ownership, storage, retention, sharing, protection, reporting, and analysis and is an invaluable asset to assist with a comprehensive data management strategy.11 Once you have identified your requirements and understand the options, it will be easier to craft a plan suited to your needs.

There are at least three options for backing up or archiving your data. You could purchase a couple of hard drives, manually make copies, and leave the drives in geographically distributed locations to reduce the risk of disaster in one location. This may give you ultimate control and might be preferable depending on privacy concerns, but it also leaves you primarily responsible.

As a second option, most academic institutions offer services to save data to their computer centers or libraries. The advantages are that IT professionals are worrying about the reliability of the hard drives, although this solution depends on sufficient network connectivity and trust in your local institution.
A third option that also depends on network connectivity and trust is the “cloud.” A variety of companies have archive and/or backup services for Mac, Windows, and Linux. Cost can vary widely, so understanding how much data you need to back up will help you select an option that will work for you. Most services encrypt the data automatically, which may alleviate any privacy concerns. The United States Computer Emergency Readiness Team published a valuable document on the advantages and disadvantages for backing up data to the cloud, local drives, and tapes. They suggest a strategy of keeping multiple copies of important files in different locations and on different media to reduce the impact of the inevitable failures.12

It is usually best to have the backup automatically performed, because it’s more likely to be forgotten if it has to be manually run. Most backup software works this way, but the software will still have to be manually monitored to make sure there are no errors that prevent it from running or failing to actually back up your data. The backup should be retrieved at least once a year to verify that it is still viable. At the same time, it is also valuable to review your backup strategy to see if it is still meeting your requirements. There may be more cost-effective options available, or new data that need to be saved and added to the backup process.

It Is Worth Your Time

Take the time to setup and regularly review your data management strategy. One data-dependent scientist I know had his laptop fall off his motorcycle and get run over by a car. Thanks to his CrashPlan cloud backup, he was up and running on a different laptop in less than two hours. Contemplate the impact of losing your data sets and develop a plan to protect them. Be vigilant and schedule to test your backups at least once a year. Can you afford to risk the alternative?

— Patricia Kovatch, Icahn School of Medicine at Mount Sinai

Note
The author is Associate Dean for Scientific Computing, Associate Professor of Genetics and Genomic Sciences, Associate Professor of Structural and Chemical Biology, and Co-Director, Master of Science in Biomedical Informatics.


References

  1. Wikipedia article on Risk Management. http://bit.ly/1SKGnYK.
  2. University of Oregon. Best Practices for Research Data Management: Funding Agency Guidelines. http://bit.ly/1P4Lhdk.
  3. Stanford University Libraries. Case study: Data storage and backup. http://stanford.io/1l1k0k4.
  4. Wikipedia article on Hard Disk Drive Failure. http://bit.ly/1H9ZrXE.
  5. Kovatch P, Ezell M, Braby R (2011). The Malthusian catastrophe is upon us! Are the largest HPC machines ever up? Euro-Par 2011: Parallel Processing Workshops, Volume 7156 of the series Lecture Notes in Computer Science, pp 211–220.
  6. Pinheiro E, Weber W, Barroso L (2007). Failure trends in a large disk drive population. Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07).
  7. Schroeder B, Gibson G (2007). Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST’07).
  8. Wikipedia article on RAID. http://bit.ly/1VoF559.
  9. Stanford University. Storage and backup. http://stanford.io/1ne143o.
  10. University of Oregon. Best Practices for Research Data Management: Data Storage and Backup. http://bit.ly/1N3Ee2Z.
  11. Coulehan M, Wells J (2006). Guidelines for Responsible Data Management in Scientific Research. http://1.usa.gov/1mRRvXT.
  12. Ruggiero P, Heckathorn M (2012). Data Backup Options. http://1.usa.gov/19Ge62x.

About the Author: