Skip navigation and jump directly to page content

 IU Trident Indiana University

UITS Research Technologies

Cyberinfrastructure enabling research and creative activities
banner-image

Scholarly Data Archive

The Scholarly Data Archive (SDA) Service is a distributed storage service offered by Indiana University to faculty, staff, and graduate students who need large scale archival or near-line data storage, arranged in large files, for their research projects. The SDA at IU is delivered using a consortium developed product called the High Performance Storage System (HPSS). (The words SDA and HPSS are often used interchangably to describe the same service.)

A hierarchical storage management (HSM) system by design, HPSS makes transparent (to the user) use of a hierarchy of storage media to provide massive data storage capacity. At Indiana University, this hierarchy includes disk caches which total roughly 150 terabytes (one terabyte is a thousand gigabytes, a million megabytes, or a thousand trillion bytes), back-ending into two high-end tape libraries which provide a total uncompressed data storage capacity of nearly 5.7 petabytes (PB). Such a near-line, tape-based storage system mediated by fast, efficient disk caches gives the user the appearance of massive disk capacity at a fraction (usually a hundredth) of the cost of storing the same data on spinning disks.

While names of files placed in the SDA remain visible to the user, the actual data bits migrate to tape if not accessed for a period. If on tape, the bits require some time (up to two minutes per file) to retrieve (as the tape robot locates, mounts, and reads a tape) when the file is accessed again. Due to the overhead involved in manipulating data this way, the SDA is not well suited for storing a large number of small files.

Once a user has an SDA account, the service can be accessed from any networked host which offers at least a TCP/IP based file transfer protocol client, including high performance access methods, namely parallel FTP (PFTP) and Hierarchical Storage Interface (HSI), as well as an HPSS API available for programmers.

IU's implementation of HPSS has a couple of notable features: We are the first site in the world to have the hpssfs interface to HPSS in production. Also, we are the first HPSS site to implement a remote data mover (at IUPUI) in production. IU's remote mover has demonstrated the feasibility of a widely distributed (across a wide area network or WAN) HPSS in which data stored/accessed by users at IUPUI are served locally by the IUPUI data mover, at high, local area network (LAN) speeds. A small stream of metadata (administrative data about the files stored) flows on the WAN segment between IUB and IUPUI (this is necessary since the metadata engine is located at IUB). Such a widely geographicaly distributed storage system design is highly cost effective and is of great interest to many.

With the institution of a high performance network between IUB and IUPUI called I-Light in late 2001, IU's MDSS is now able to create two tape copies of user data over I-Light (one at IUB and the other at IUPUI), adding a degree of disaster tolerance to either site.

Getting Started:

Access Methods:

Architecture:

SDA Documentation:

SDA Technical Information: