≡ Menu

Drowning in Data

Bob Congdon writes on something we’re all living through — the decline “hard media” (paper, LP’s, even CD’s, etc.) and the prevalence of digital media.

From a green perspective, getting rid of all of this hard media is a good thing. Why print out documents when you can read them on your computer? Why should publishers print hundreds of thousands of copies of a newspaper each day to be read once and tossed out? The same with weekly and monthly magazines. Why produce millions of CDs that just end up in landfills?


I agree that the trend is here to stay, but I, personally, am scared to death. I think we’re headed for disaster. The problem is that few of us have an adequate back up regime for all of this data. When disaster hits, and a single disk drive holds all of our downloaded commercial software, our e-books, our electronic documents, our financial data, our music, our photographs, etc., then we’ve lost everything. So what used to require a devastating house fire now will hit unprepared users every time their hard drive fails. We tend to have all our eggs in one basket now and a single failure has now a greater impact.

Sure, we could back everything up. I used to do that. Floppies, ZIP drives, tapes, CD-ROM, DVD-ROM, external drives, online backup services, I’ve done them all over the years. The problem is that my data needs keep on increases. Back in 1990 my entire data needs, a few dozen WordPerfect files, could fit on a single floppy disk. Today, a single photograph, in RAW format, can take 10x that amount of storage. Add to this music files (at high bit rates), video files (now in Hi Definition, of course), and so on, and I’m nearing a terabyte of data at home. Forget about backing up to 125 double-sided DVD-R’s. Forget about online backups — the latency would make a backup take a month. We’re not going to change the speed of light so that option will never scale. All I can really do is archive to a portable hard drive, and even then I have only space for the most recent snapshot, not a history of recent backups. This is fine for recovering from a system failure, but I’d be in trouble if I suffered serious data or file corruption and that made it into my backups before I noticed.

So, yes, we use less paper. But my unread ebook folder gets larger and larger. My unlistened to play list is longer and longer. My unwatched shows on Tivo continue to accumulate. I have no assurance that I will catch up before data disaster hits. I know I should be feeling green, but instead I’m feeling blue. I could sure use some quantum storage right around now.

{ 12 comments… add one }
  • Anonymous 2009/02/10, 04:31

    I think you are not scared enough.

    In my experience, backups are only useful with life data. So after a crash you can reconstruct the data.

    But only as long as you know what data you have and where to find it.

    And you have to store the programs to access the data with the data themself.

    See eg, this rather old paper:

    Best Practices for Digital Archiving: An Information Life Cycle Approach

    People organize workshops about this

    Sustainability of Language Resources and
    Tools for Natural Language Processing

    Winter

  • Jesper Lund Stocholm 2009/02/10, 06:14

    Rob,

    I don’t agree with you that online backups are not an option. After being in your position as to thinking about what to do with all the digital stuff our familiy has accumulated, I ended up using an online backup service.

    Granted, the initial backup of around 20Gb of data took almost en eternity (and I don’t even have a fixed DSL-line but “only” a 3G-broadband connection with 386 kbit/s upstream), but now it is all a breeze and the client I have installed takes care of it for me.

    The cost is also quite low – $4.95/month for unlimited storage.

  • Mark S. 2009/02/10, 10:44

    If your data is really that important, here is how to ensure its safety:

    1) Most PC’s now have RAID support built into the motherboard, so pony up for a second internal hard drive and set it up as RAID-1 (mirrored drives). That takes care of the common “dead-drive-ate-all-my-data” scenario.

    2) Keep your important data segregated from “junk” data (ripped CD/DVDs) and OS. This can drastically reduce backup times and simplify data recovery when needed.

    3) Use two (or more) external hard drives in a weekly rotation, and keep one off-site at all times. This handles the “house-fire” and “virus-corrupted-data” scenarios that RAID is helpless against.

    4) Subscribe to an online backup service. When it comes to backups, the belt-and-suspenders approach can save your butt.

    And most important of all:

    5) TEST YOUR BACKUPS! Make sure that the files you THINK are being backed up can ACTUALLY be restored. I guarantee that any untested backup will not work the way you think it will. Trust me. :(

    And if all of that is just too complicated, then I guess your data is not that important after all.

    Cheers!
    Mark S.

  • Lucas 2009/02/10, 13:30

    I completely agree with Rob on this issue… I have a server with 1TB storage (yes, quite small these days) now about 60-70% full of documents, videos, pictures, etc. that I want to keep. With an online service (and especially with bandwidth cap these days popular with ISPs), how long would transfering 700GB take? Even with 384KBps upstream at 100% capacity (unlikely), it would take 22 days just to copy the data. That’s not even mentioning security and/or privacy issues of using an online service.

    Of course I can burn all that data onto 175 DVDs — but reliability of the DVDs comes into question — how long would those DVDs last in storage? No one is really willing to quantify.

    My favorite backup media of the past was MO Discs, which have a life of 10 yrs +. However, with the explosion in hard drive capacity and relative tiny regard for backups, the storage needs now far out pace the capacity of an MO disc, which tops out at 9GB.

    I guess I’m doomed to keep a bunch of backup hard drives — one set for frequent (weekly) backups another set to be kept in safe deposit boxes at the bank updated perhaps monthly, and maybe another set at some other locations.

    Sad and frustrating… but, that is the world we live in…

  • Rob 2009/02/10, 15:35

    I guess what I am saying is that today the average computer user is far more exposed to the risk of catastrophic data loss than every in history. Our data accumulation has outstripped our backup strategies.

    Sure, there are ways to cope — Mark has some good ideas here. But would anyone care to estimate how many people actually manage to do this? My guess is very few.

    I want a back up solution that is idiot proof. I want a subscription plan that mails me a new terabyte drive every two weeks. I plug it it via USB, and it automatically encrypts and copies everything. I put the drive into a pre-addressed, pre-paid padded mailer and send it away for offsite storage. The data could then be duplicated off-line, the disks erased, and sent out to the next person. I don’t actually own any of the disks. I’m just paying for this as a service, like Netflix.

    Say the drive costs $200 with a useful life of 2 years with that level of abuse. Mailing is $10 either direction. So costs over 2 years per subscriber would be 200+(2*10*26) = $720, or $30 per month. Would you pay $50/month for this level of security? $75 a month? Compare this to Amazon’s S3 cloud storage, where 1TB costs $150/month just for storage, not including data transfer fees. The latency of the US Postal Service may be large, but the bandwidth is huge!

  • Chris Ward 2009/02/10, 17:46

    Places like CERN expect that some of their data will be unreadable; some of the tapes they store things on will ‘unaccountably fail’ when called on.

    So the scientists have to design their experiments with this in mind; to be able to reconstruct enough given ‘most of’ the data.

    You hope they’re not processing mortgage payments. But I think their targets are different.

    And what kind of warranty do you want ? I guess at the ‘low bid’ end of things, you would hope that the backup service would replace ‘lost or damaged’ media with the equivalent amount of unused material. At the high end, is it possible to insure enough so that you could repeat the relevant life experiences, and reconstruct the archive that way ? Would you want to ?

    And then in 100 years, it’s all over anyway. Ashes to ashes. That puts a limit on the value.

  • Anonymous 2009/02/11, 02:13

    I think the solution is to get rid of the word “backup” and replace it with “distributed storage”.

    The two points of interest are:
    – Unused backups become stale. They will fail silently, get lost, or metadata (what ‘s in there) get lost

    – Backing up to live on-line services is unrealistic (22 days of saturated up links)

    In contrast, the next generation of files systems has virtually unlimited distributed and replicated storage, eg, Hammer, btrfs, ZFS.

    Connect such a local file system with an off-site/on-line file system and let it replicate snapshots constantly.

    You do not create 2 TB of data a day. If your data creation rate is modest, eg, tens of GB a day, you can handle that by background up links.

    If any of the storage facilities fail, the data are transparently replicated again.

    And then it does not even matter whether you use the internet or surface mail to “replicate”.

    Winter

  • stevenj 2009/02/11, 12:29

    You wrote Forget about online backups — the latency would make a backup take a month. We’re not going to change the speed of light so that option will never scale. But this makes no sense to me. Latency is not a serious issue for online backup, only bandwidth, and bandwidth is not limited by the speed of light.

  • Victor 2009/02/12, 10:59

    So costs over 2 years per subscriber would be 200+(2*10*26) = $720, or $30 per month. Would you pay $50/month for this level of security?

    This is bogus. You still need some storage on far end (probably the same 1TB/user), to pay some salaries, etc. In the end it’ll be the same $150/month.

    The only reason online storage can offer “unlimited” tariffs for $4.95 or so is the fact that you have limited bandwidth and will hopefully not use it to backup petabytes of data (hmm… interesting idea: to rent such a service and to upload 1PB of random data there).

    BTW I’ve recently checked stats from my ISP: I’m actually uploading/downloading something like 300-500GB per month – is it really so far from 1TB needed for full backup? Do I really need this full backup every month?

    I’m pretty sure online backup is the future. The latency argument is bogus and bandwidth is growing constantly…

  • Rob 2009/02/13, 09:51

    I’m not convinced online back solutions will work. The best arguments I’ve heard say that they are marginally useful now. If you have good bandwidth and your data changes slowly, then you might take a few weeks for the initial backup and then do the incremental backups after that.

    However, a quick calculation shows that sending a 1TB harddrive in the mail to a central processing center has a bandwidth of around 50 Mbps, assuming Netflix-like service centers with 2-days delivery time. How many people have 50 Mbps bandwidth to the outside world today from their homes?

    Then consider the next 10 years. If the current trend increases, my data backup needs will double or quadruple in that time. Sending a harddrive through the mail scales up to that task nicely. But do we really think we’ll have 200Mbps internet connections at home then?

    I think the “last mile” connectivity problem is the hard one. The Post Office is the master of the last mile. They can get any small package from one place in the country to any other place in the country overnight. But DSL, cable modems, etc., all rely on miles and miles of cables buried under roads and hanging from poles. As we saw in the recent ice storms, this infrastructure is barely able to stay up. But even though I lost power, phone and cable for a week in the ice storm, I’ll tell you that my mail was still being delivered!

    So that’s the question — what will scale up as data needs increase?

  • putt1ck 2009/02/13, 11:54

    And I say unto you Unison (not the oddly confused British union) – assuming you have an permanent Internet connection at home and the wherewithal to put in a low power server (VIA/Atom). Synch multiple boxes with diverse operating systems, diffs rather than changed files etc..

    If you don’t have a permanent Internet connection, can do it between two boxes when at home or at work.

  • Anonymous 2009/02/16, 04:31

    @Rob:
    “However, a quick calculation shows that sending a 1TB harddrive in the mail to a central processing center has a bandwidth of around 50 Mbps, assuming Netflix-like service centers with 2-days delivery time.”

    You are in good company here:

    “Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.” —Tanenbaum, Andrew S. (1996). Computer Networks. New Jersey: Prentice-Hall. pp. 83. ISBN 0-13-349945-6.

    TB disk would do the job too.

    Winter

Leave a Comment