Data de-duplication

The recent technological advance called data de-duplication is an enabling technology that is fundamentally changing the economics of disk storage and data transmission. We have all experienced some benefits from this technology in a primitive form called Single Instance Store (SIS).

SIS means that an exact duplicate of a file is stored in a system in a way where only the first instance is retained. Subsequent instances are stored only as a tiny pointer — using virtually no additional space. Users of Microsoft’s Exchange have, often unknowingly, used this for years when an email attachment is sent to a list of internal recipients.

Only one instance of the attachment is stored in the Exchange database until a user modified their instance in some way. Understandably, this can save considerable space.

This same concept can be extended to the storage of files in a server, and there are a small number of applications that support this (for example EMC Centerra, HP RIS).

However while it is good to remove redundancy of the same files, what about similar files? It is common to have two files where one is a derivation of the other, for example a PPT with a new title slide and a couple of extra slides, a database copy with a subset of rows, a new version of a video or audio file that has an audio overlay in minutes 7-9, or any version-based document that varies from its original only by the five new paragraphs added that day.

In truth, granular data changes within files are frequent across organisations, and technology advances to more efficiently store data are heralded as the new legs to the storage industry; greening data centres through significantly smaller footprints, clearly saving on power and cooling.

To break data apart and down into granular chunks, much smaller than individual files means some advanced analytics into the natural boundaries inherent within data structures, identifying them in a repeatable and consistent way, across multiple contexts. For example, an element of a JPEG image versus that same image element when embedded into a PPT or Word document. Such algorithms need to consistently find the same patterns to be of value. This is the true definition of data de-duplication.

The most obvious initial deployment of such a technology was the area of backup and recovery. This was because of some key business drivers:
1. Compliance, recovery service levels and improved best practice standards are mandating more and faster data recovery points
2. Users are looking to move more toward disk-based backup and recovery (and de-duplication is a essentially a disk-based technology)
3. Importantly, backup and recovery is a highly repetitive and high-profile systems management function, and lends itself to de-duplication with the best returns.

Users of de-duplication solutions will typically see in the order of 10 to 50 times data reduction when used in this mode, and it is an enabler for a greater standard of backup and recovery methods. De-duplication users are doing more with backup and recovery than they ever thought possible.

As such, selling backup isn’t what it used to be.

A fact of IT life is that Backup Windows continue to shrink and the number of servers and the sheer quantity of data continues to grow. Until recently, the only way to back up this rising tide of data and servers was to attach more tape drives and multiple backup jobs. Then along came ‘backup to disk’ which solved part of the problem, by allowing many jobs to run concurrently, but they eventually ran out of bandwidth as the low-cost disk was usually slower than the servers it was backing up.

Low-cost disk evolved into Virtual Tape Libraries (VTLs) with awesome throughput and they arrived in the mid-range market at about the same time as LTO-3, which is fortunate since the vast majority of servers do not have sufficient throughput to keep an LTO-3 tape drive streaming.

The accepted wisdom is that most primary backup should be first staged to disk or VTL where it should be retained for anything from a week to several months, and then copied to tape for off-site storage or archive. The technology of de-duplication has further enhanced the appeal of disk-based backup as de-duplication increases the quantity of backup data that can be retained by 10 to 50 times. In addition, de-duplication devices have a capacity for replication which offers the possibility of removing tape from remote sites.

To put all this into perspective, previously if you were selling a backup solution to an organisation that had a large head office and several remote sites you would have offered a large tape library with many drives for head office and smaller libraries with single drives for remote sites, plus a courier service to move tapes between sites.

With current technology, the solution would be entirely different. There would be VTLs with de-duplication capability at each site and backups would replicate between them. A single tape library with a small number of high-capacity LTO-4 drives would be situated at head office for archive, and to generate media for off-site storage.

In the same way that servers and storage have evolved to be virtual and easily manageable, so has backup. Don’t be left behind by thinking that backup is just about tape.

As a sign of the times, Quantum, one of the world’s largest backup vendors, now has more disk-based devices in its product catalogue than conventional tape libraries.

Granular de-duplication technology is being extended to other areas of storage – not just backup and recovery. For example, the StorNext HSM system is a world-first data management platform offering a feature called DRS (Data Reduction Storage), which can eliminate granular redundancy in primary storage layer applications. This will offer users more overall capacity for a lower price, with side benefits of reduced power, cooling and footprint. Unlike backup, which is highly repetitive, the benefits of de-duplication in such primary storage applications represents a saving of typically around 10-50 percent, but this varies widely depending on how these systems are used and the types of redundancy found in the data.

The algorithms that perform the de-duplication today are essentially software-based, and as such use computer resources in order to parse data into the granular chunks. Understandably, in high-performance storage and retrieval systems, performance is paramount. The art will be to take these algorithms and have them supercharged by running them in a hardware ASIC, akin to how hardware compression works today with tape drives and disk drives in VTLs. In backup and recovery, users avoid software compression as it impedes the backup performance. Similarly, when de-duplication is driven down to the hardware level, users will ignore software offerings because of the lack of speed – and this applies whether the de-duplication operates at the backup and recovery or primary storage level.

The area of de-duplication is maturing fast, and we already see strong market acceptance. Much new development will occur over the coming 12 to 24 months, including introduction into a broader range
of applications.

A word from Quantum
Quantum is the leading global specialist in backup, recovery and archive and holds the pioneering patent in the most effective method of data de-duplication, known as variable-length data de-duplication.

Quantum is leveraging its leadership in data de-duplication technology to provide customers with a range of solutions that deliver the full benefits of this technology and more.

DXi-Series disk-based backup and recovery appliances
Quantum’s data de-duplication technology is a critical element of its DXi-Series disk-based backup and replication appliances.

It allows users in mid-range and data centre environments to retain 10 to 50 times more backup data on fast recovery disk and cost-effectively store data for months instead of days.

These appliances also enable WAN-based remote replication of backup data as a practical tool for disaster recovery between distributed sites such as data centres and regional offices.

De-duplication is a core element of the DXi-Series’ integrated software layer that also includes a high-performance embedded file system, support for high-speed data compression, interface flexibility and technology links to enable such services as remote diagnostics, monitoring and alerts.

StorNext SAN File System and HSM software
Quantum has also integrated its data de-duplication technology into its StorNext data management software, making it the first solution in its class to provide such functionality for archiving.

StorNext helps users share, retain and re-use revenue-generating digital assets such as video, images, audio files, and research-analysis data sets by providing resilient, high-speed access to shared pools of such data on both LAN and SAN servers.

StorNext is also built on an open architecture and has embedded data movement technology which transparently moves data between storage tiers for better cost control and data protection.

“Selling backup isn’t what it used to be.”

“Don’t be left behind by thinking that backup is just about tape.”

By Craig Tamlin, country manager, Australia & NZ Quantum and Ross Smith, channel sales manager, WA & SA Quantum