Computer Science Technical Reports Project

Architecture of the Digital Library


THE FOLLOWING IS A DRAFT OF A PAPER THAT DESCRIBES AN ARCHITECTURE WHICH IS STILL SUBJECT TO CHANGE. ALTHOUGH THE FINAL VERSION WILL PROBABLY BE CLOSE TO THIS DRAFT, SIGNIFICANT CHANGES ARE POSSIBLE.

FROM COURTESY TO THE AUTHORS, PLEASE BE VERY CAREFUL IN REFERRING TO THIS PAPER AND ALWAYS CITE THE VERSION OF THE DRAFT.


Accessing Digital Library
Services and Objects:
A Frame of Reference

Robert Kahn, Corporation for National Research Initiatives
Robert Wilensky, University of California at Berkeley

DRAFT 4.4 FOR DISCUSSION PURPOSES
February 2, 1995

1. Introduction

This document describes fundamental aspects of a network-based infrastructure to support transactions for digital library services. It defines basic entities to be found in a distributed library system, provides naming conventions for identifying and locating digital objects, describes a service for using object names to locate, prepare and disseminate objects and provides basic elements of an access protocol.

Only the most basic elements of the infrastructure are described here. These elements constitute a minimal set of requirements and services that must be in place to effect the infrastructure of a universal wide-area digital library system (the System). We anticipate that many other services and elaborations will be come into existence as the System is further developed, either building upon or otherwise added to these elements. This paper focuses on the network-based aspects of the infrastructure, namely those for which knowledge of the contents of digital objects is not required. Definition of the content-based aspects of the infrastructure is purposely not addressed in this paper.

An important goal in limiting the description of the infrastructure in this way is not to constrain the higher level user and service level choices that, for many reasons, might be inappropriate to fix upon at this point in time. With only the most basic elements of the infrastructure in place, technological evolution would not be overly constrained. Further, the likelihood of achieving widespread interoperability of services at some early point in the future will be preserved. Perhaps the resulting capability will have a greater potential for enhancement and evolution through the participation of many others in helping to define it.

2. Definitions

In the definitions below, we introduce the notions of digital objects, repositories, handles and metadata. In addition, we discuss the role of originators and naming authorities. A digital object is the basic data item in the System and metadata is data about the digital object. A digital object consists of a number of sequences of bits; these include data and a unique identifier known as a handle. Formally, a digital object has two parts, typed-data and key metadata. The typed-data has a type specification, and data to be interpreted in accordance with this type specification. The key metadata includes a handle unique to the digital object, and may include other metadata. Possible primitive and composite data types for digital object data are discussed below.

A repository is a digital storage system in which digital objects may be stored for possible subsequent access or retrieval. The repository has a mechanism for adding new objects to its collection (depositing) and for making them available (accessing). Authors and other rights holders or their agents supply digital objects to repositories. The repository may contain other related information, services and management systems. Repositories provide users access to stored objects under terms and conditions that may be set by the depositor (generally, by the originator, rights holder or its agent) and/or a given repository.

Each repository contains a properties record for each of its stored digital objects. The properties record includes all metadata for a digital object, including its key metadata, but also, other metadata. Notionally, the key metadata component is a subset of metadata which is invariant for a digital object over repositories. No attempt is made in this paper to delineate how much of the metadata should be included in the key metadata, other than requiring that it include the mandatory handle. Possible examples of repository-dependent metadata are the general terms and conditions for access and usage of the digital object and the date and time of deposit.

Every repository is expected to offer (or arrange to have offered on its behalf) reference information about its own collection of digital objects. This service is provided by an Information Reference Server (IR Server) accessible via the repository.

A simple digital object protocol (SDOP) is supported by each repository (see section 3.1 below). Only the minimal necessary aspects of the SDOP are specified below. We anticipate that these aspects of the SDOP, or the SDOP itself, will be a subset of the interface protocol used by repositories, and require only the functions or operation of the SDOP not be affected by any implemented supersets of the protocol. In particular, the SDOP allows for accessing a stored digital object or its metadata by specifying its handle, a service request type and additional parameters. If this request is complied with, the output of the service request is a disseminated digital object. A disseminated digital object maybe be a digital object, possibly with additional data affixed to it, such as the identity of the repository, information about the communications pathway, or digitally signed terms and conditions, if required for specific use of the object. Such a disseminated digital object may result from a service request to retrieve the object corresponding to a given handle. It is also possible that a disseminated digital object is not a digital object proper. Examples of service requests that might produce such a dissemination digital object are a request for proper subpart of a digital object, and a request to return the results of invoking a digital object whose data is a computer program.

An originator is an entity that authorizes or validates a set of digital objects within its domain or sphere of influence. An originator may propose identifiers to be assigned to its digital objects. There may be a number of kinds of originators worth distinguishing.

Individuals and organizations (or their machines) may be originators. Each originator is responsible for each digital object it authorizes or validates including the responsibility for making it available in the System or for making changes to it or for setting terms and conditions on its use. An originator may authorize others within their organization to have this ability and may also delegate some or all of this responsibility to others outside their organization. Specifically, any organization that acts as an agent for another organization or which operates a repository may be delegated the responsibility to act as an originator for another entity.

A naming authority assigns locally unique identifiers to digital objects. These may be identifiers proposed by the originator or they may be self-generated by the naming authority. The naming authority may be a person, an organization, or a fully-automated process running on some machine. An originator may control a naming authority, but there may be naming authorities that are not controlled by originators.

A global naming authority for the System insures that naming authority names are themselves globally unique. Prospective naming authorities must have their global names validated for registration by the global naming authority.

A digital object's data may incorporate information or material in which copyright, design patent or other rights or interests are claimed. There may also be rights associated with the digital object itself. An author may have submitted a digital object for purposes of registering a claim to copyright in a work that may be incorporated in the object. Since the copyright pertains to the underlying work fixed in the form of the particular submitted representation, the rights would normally pertain to all representations of the work, including, but not limited to, those representations of the work that are contained in other digital objects.

As mentioned earlier, the data of each digital object is typed. Data types assumed to be in the System include bit-sequence, digital-object, and handle. In addition, the composite data-type constructor set is defined. Therefore, set-of-bit-sequences, set-of-digital-objects and set-of-handles are valid composite types for digital object data. No other types are currently defined. However, it is expected that data subtypes will subsequently be defined, derived from these types, but are not considered part of the infrastructure described here. As an example, the type GIF might be defined and used as a subtype of bit-sequence, for data containing an image in GIF format; the contents of a digital object containing a pair of GIF images in no defined order could be of a defined type called set-of-GIF, which would ultimately be a subtype of set-of-bit-sequences. Similarly, an executable program might be defined and used as a subtype of bit-sequence.

We shall informally refer to digital objects whose data is a set, one of whose elements is of type digital-object, as composite digital objects. We explicitly exclude the application of the adjective composite to a digital object that contains nothing but another digital object (i.e., whose data is of type digital-object. A digital object that is not composite is said to be elemental.

The terms and conditions of a composite object may implicitly or explicitly be unioned with those of its constituent objects to arrive at the terms and conditions for those constituent objects. Terms and conditions may be explicitly imposed only on the composite object, in which case they would apply to each constituent object; or each constituent may have its own separate terms and conditions in addition. (Of course, creating composite digital objects would be subject to the copyright and any other legal restrictions pertaining to its constituent objects.)

While we intentionally avoid issues of content in the digital library infrastructure, we note that the entities provided thus far give users a number of means to include digital objects that contain or may be interpreted to manifest the same or similar information or material. As an example, a literary work may be fixed in a number of different formats, e.g., LaTex, PostScript and GIF page images. Each fixation may correspond to a distinct (elemental) digital object, each with its own unique handle, and other metadata). A composite digital object may then be created whose data is the set of these digital objects. Similarly, one could create a composite digital object whose constituent objects were the fixations of the literary works of Shakespeare in PostScript. The handle of this composite digital object, in effect, names the PostScript collection of Shakespeare's literary works.

Note that is possible to construct objects with similar effects without using composite digital objects. For example, the single digital object intended to correspond to a work could have data of type set-of-bit-sequences, rather than of type set-of-digital-objects, and contain each of the forms of fixation therein. In this case, digital objects may not exist corresponding to the individual fixations. Another possibility is to have a digital object whose data is of type set-of-handles. In this case, the handles would name the individual fixations (which may not even be available from the same repository). The such a digital object may contain other data fields that further describe (or annotate) the handles. Yet another possibility is to create a markup language which admits handles, plus other conventions for expressing how they relate to each other (for example, whether the individual handles are meant to be interpreted as different fixations of the same work, or a list of bibliographic citations, etc.) A digital object whose data comprise sentences in this markup language could serve to represent the same entities as do composite digital objects.

We use the informal term meta-object to refer to a digital object whose primary purpose is to provide references to other digital objects. Both digital objects whose data are of type set-of-handles and digital objects in a markup language that admits handles, would be instances of meta-objects.

A digital object may be mutable in that it may be changed after it is placed in a repository. Although none of the key metadata may be changed, nor may any known digital object that it contains be changed (unless the original digital object is also changed), most other changes are permissible. Minor changes might be made to correct a misspelling or other such error; changes to the title of a mutable digital object may be permissible. A mutable composite digital object could be modified to add the representation of an underlying work in a new format. Mutability would also be a useful way to allow digital objects that are designed to change with time or are dynamically computed. A digital object that cannot be changed is said to be immutable. The properties record may be used to indicate whether a digital object is mutable or not.

Naming authorities have unique names of the form X.Y.Z... where X, Y, Z are arbitrary strings (not containing the character "/"). These strings need not necessarily have a semantic derivation, but semantically motivated names generally will be assigned where possible. A naming authority with globally unique name X may create additional derived naming authorities (using the dot convention from left to right to concatenate additional descriptors as many times as desired) and without the need to register each of the derived names separately. If the name X was globally unique, the name X.Y is guaranteed to be globally unique as well.

A handle is a unique string, composed of two logical parts separated by the character "/": These two parts are: 1) The globally unique name of a naming authority (which does not contain "/"); and 2) a locally unique string assigned by the naming authority. The globally unique part is mandatory; the locally unique part may be null, in which case the "/" is optional and may be omitted if desired. Handles have no prescribed maximum length in principle, but there will be a default length in existence at any time which can be adjusted upwards if necessary.

Some servers may treat an entire handle as nothing more than a unique string without semantics, and, for example, provide a service in which a handle is mapped to one or more repositories containing the object associated with the handle. A simple way to create a handle without semantics is for a program run by the naming authority to generate a current date-time-stamp as the local handle; however, it is also possible to impose semantic conventions upon the locally unique string. These semantic conventions may provide useful information to humans; and some servers may try to exploit the semantics directly to help locate resources likely to contain the named objects.

A digital object has associated with it in a repository a transaction record, which records transactions involving the digital object. The transaction record may contain entries such as the time and date of deposit of the object, the time and date of each request for retrieval of the object, the identity of the requesting party, the handle for the object, and the applicable terms and conditions including amount and method of payment. Transaction records will only be made available to authorized parties.

There must always be at least one official IR server where a repository's contents are indexed. However, this IR server need have no other formal relation to the repository. In particular, an IR server at one site might agree to index all of another site's materials, and hence be designated as the first site's official IR Server. The contents of an IR Server may be made available to other value-added service providers, if desired.

Each naming authority or other authority may also maintain an IR Server that contains a copy of the properties record for each digital object within its domain; such an IR Server need not even be co-located with a repository containing other digital objects. In this paper, we will treat such an IR Server as part of a repository many or all of whose stored digital objects are meta-objects. Thus, the naming authority "berkeley.cs" might correspond to a repository containing meta-objects for all berkeley cs digital objects and the repository named "berkeley.cs.sequoia" might contain a subset of berkeley digital objects.

Each repository must provide an interface that implements the SDOP described below. This interface will normally provide access to the repository's IR Server (as well as its digital objects) so that it may be used, subject to appropriate administrative controls, to identify the material stored in the repository. An IR server may provide access to digital signatures or other fingerprints of its digital objects suitable for verification purposes. These signatures may be centrally maintained or may be replicated in properties records. There is one logical IR service for each repository or naming authority, but the implementation may be replicated or otherwise distributed for reliability or efficiency. The command language of the IR server is not defined here.

IR Servers may be nested into logical hierarchies as appropriate. In particular, an IR Server for a given repository or authority need not be made available publicly and the information contained within it may be provided at several logically higher levels. The IR Service may also be used by its authorized users for browsing, verification, and to provide alerts to changes in the system, such as the addition or deletion of objects. It may provide bibliographic information; it may contain information about a local collection; or it may be more global in its scope. The IR server may also provide intelligent agent services involving informational material contained in other repositories and other IR servers.

Repositories have official, unique names, assigned or approved to assure uniqueness by the global naming authority. This convention follows the same general format as the naming authority (i.e., "X.Y.Z...."), but does not name a particular host. It allows reference to repositories without having to commit to their particular location or address format specifications. For example, the repository name "USAToday" may correspond to numerous different repository locations in major cities on the network.

There is no requirement that a digital object be stored in a repository in any particular manner. Conceptually, the description of a digital object is strictly a logical one and is not intended to describe any particular implementation. In particular, it is possible that, in response to a request to access a particular digital object, a server runs a program that computes the digital object on the fly. It is possible for multiple digital objects to be embedded in a program (e.g., a data base manager or knowledge based system) that emits them upon request. The program may itself be a digital object. Thus, accessing and depositing are virtual processes, and may or may not involve that actual depositing and retrieval of actual objects per se, although such actual storage and retrieval is likely to be prevalent.

3. Accessing Digital Objects

3.1. SDOP

Each repository must support a simple protocol to allow deposit and access of digital objects or information about digital objects from that repository. This is called the Simple Digital Object Protocol (SDOP). Many repositories may support other more powerful query languages that allow users to access objects that meet meaningful criteria. SDOP is meant to provide only the most basic capabilities and may evolve over time. At present, it includes deposit of digital objects, access to digital objects by handle, and related repository services. In particular, the protocol supports requests to obtain (i) metadata (GET_META). The results will depend upon the service request type and additional parameters. Examples of metadata service requests include obtaining the (possibly redacted) properties records for a digital object whose handle is presented, including the terms and conditions or the data type specifier for a given digital object and (ii) access to the digital object (GET_DO). Access to the digital object will generally invoke a service program that performs stated operations on the digital object depending on the parameters supplied with the service request. Defined service requests include key-metadata and all; the former requests only the key-metadata, and the latter, the entire digital object (i.e., the key-metadata and the typed-data). It is possible that other systems-level services are defined. Possible examples of such additional services might be data (request only the data) and encrypt, although we do not define such requests at this point. In addition, it is possible that data-type-dependent service requests will be introduced. Possible examples of such data-type-dependent services requests might be execute (for digital objects whose data component is of type program), or subpart (which requests only a component of the data of the digital object, further specified by some parameter). We emphasize that such data-type-dependent service requests are not defined as part of the System infrastructure.

Other request types are (iii) to deposit a digital object and its properties record (DEPOSIT_DO) (iv) to access the IR Server (ACCESS_IR). Accessing the IR Server will allow information about the repository to be retrieved depending on the service request and additional parameters. In particular, one service request type is to return a list of alternative access methods supported by the repository.

Initially, the protocol has been purposely kept simple, and all the more complex transactions are assumed to be handled by other protocols, or by subsequent extensions of the SDOP. In the first case, a primary use of the SDOP for more sophisticated repositories is to have it present the other protocols that it supports (e.g., Z39.50, SQL3, ZQL, Dienst) as alternative access methods. Another example of an alternative access method might be to supply a software agent to the repository. It may be desirable to extend the SDOP in any number of ways, for example, to explicitly include, for example, a payment mechanism or a negotiation mechanism or a more sophisticated interactive model-based interaction mechanism.

Note that the repository inputs and outputs are all structured to be in the format of digital objects so that they may be interpreted in a standardized way. In particular, the output of a repository may be a digital object that was prepared by the repository in response to a service request immediately prior to dissemination.

Notionally, a digital object is analogous to a self-contained package which must first be opened to access its contents. Access to contents is assumed to be available only to parties (such as users and service providers) that are authorized to open the package, not necessarily within the network-based infrastructure itself. For example, above we described the possibility that a user may construct a single digital object whose data is the set of all fixations (i.e., known formats) of a given work. If so, then there is as yet no formally defined method within the SDOP to determine what formats are available, and then, to extract one of them. We expect a set of mechanisms to be developed which expand upon the internal structure of the objects in the infrastructure, but this level of description has intentionally been omitted here.

When a digital object is accessed via GET_DO, the recipient receives a disseminated digital object, that is, the result of the service request, along with information such as the identity of the repository, the service request that produced the result, the method of communication (if appropriate) and a transaction string corresponding to an entry in the transaction record. The transaction string is unique to the repository. In addition, the disseminated digital object may contain an appropriately authenticated copy of some portion of the properties record for that object, including the specific terms and conditions that apply to this use of the digital object and the materials contained therein. As noted above, depending on the nature of the GET_DO service request, the disseminated digital object may include the digital object in its entirety, i.e., as stored in the repository; however, it might instead include data that is not properly a digital object, such as a portion of a digital object's data, the digital object data in a compressed format, or the result of executing the data of the digital object. In all cases, however, the key-metadata (including, of course, the handle) of the digital object is included. Since the service request that produced the resulting disseminated object is included in the disseminated digital object, the relation of the data of the disseminated digital object to the data of the digital object stored in the repository is recorded in the disseminated digital object. We leave unspecified whether there is an additional specification in the disseminated digital object clarifying this relationship.

3.2. The Handle Server Infrastructure

A highly reliable distributed system of handle servers is maintained as part of the infrastructure. These servers map handles to network resources at which the corresponding digital objects are available. Handle directory servers are also stipulated; these will be located at certain well known locations and will maintain a table of network addresses of handle servers. This table will generally be downloaded by each participating site frequently enough to be acceptably up to date at all times. The handle directory server may be replicated for reliability. Caching handle servers may be run locally to store location information for frequently used handles.

A handle is sent to a handle server to locate network addresses of repositories containing that object. The handle is mapped to locate the handle server from the handle directory server table but is not otherwise interpreted. One can also supply a handle to a separate system, which invokes the above procedures to find the stated object and associated rights management system. Local handle servers may use any technique to do the mapping. The handle servers maintained as part of the infrastructure map the handles by hashing them.

No guarantee is made that the resulting repositories will provide the designated object. Rather, the user is assured that the result is what authorized maintainers of repository services have indicated are the appropriate choices.

The handle server system is intended to be a means of universal basic access to objects in the System. That is, in the worst case, a user can present a handle to a handle server and be advised of some repository which an authorized party has asserted contains the object designated by the handle. The handle server is not meant to be the only, or even primary, means, to access repositories. Primary access may be provided locally and also by value-added service providers, likely in a variety of different and possible incompatible ways. Users interacting with such services may not encounter handles and such services may interact with repositories via SDOP or via protocols that do not involve handles.

Since a handle is just a unique string, it can be mapped to an actual repository by any of several mechanisms, including a mechanism that attempts to interpret the string. Since repository names are not actual network addresses; they must first be mapped to network locations. The method for accomplishing these mappings is not specified. The handle service is one available means for both kinds of mappings; it would specify at least the location of the interface that supports the SDOP protocol for a given repository. There may also be a need to explicitly provide a country identifier for repositories, name authorities and/or originators. For the present, however, country identifiers will be omitted while the legal issues are considered.

When a repository is found by lookup in a handle server, it may be more efficient to map the handle directly into the network address (or addresses) of the repository. This mapping avoids having to do a double lookup from repository name to repository location. However, if the location of the repository were to change, the handle server would have to be notified so it could make the corresponding changes. It is possible that certain repository names may resolve to broadcast addresses to locate specific machines. This might be the case where a single repository actually consisted of multiple machines on a local area network a given site.

4. Imposing Semantics on Handles

As discussed above, a handle is presumed to have two logical components, a naming authority, and an identifier unique to that naming authority. These naming authorities will be assigned in a manner similar to the way in which Internet domain names are assigned, but will be clearly distinguishable from them. For example, there may be a naming authority named "berkeley", which will authorize other naming authorities within the berkeley domain. Within the berkeley domain, names are locally assigned to other naming authorities. Thus, the name "berkeley.cs" might be assigned to the authority responsible for naming the UCB Computer Science technical report series (or to several such series). Note that this particular naming authority need not correspond to a valid Internet address, even though it may follow similar semantic conventions.

Particular naming authorities may follow their own conventions for assigning semantic or non-semantic strings to their objects. For example, "berkeley.cs" may follow a proposed convention for its technical reports, and give each of the corresponding digital objects (whether composite objects or meta-objects) a local handle, e.g., "csd- 93-712". (The "csd" -- for "Computer Science Division" is perhaps redundant; however, we use it here to indicate the possibility of a single naming authority issuing several distinct series.)

The full unique handle for this digital object would be

berkeley.cs/csd-93-712

where the "/" separates the naming authority name from the string unique to that authority.

In addition, digital objects may exist for this work in each of a number of fixations (formats). The handles for these fixations may also be semantically interpretable, e.g., the string "csd-93-712/all.ps" might be the unique local part of the handle for the digital object corresponding to the PostScript version of this work; "csd-93- 712/all.tif" the handle for the tiff representation. (Note that the character "/" is allowed in the local name. It may also be desirable to distinguish other characters, but this is not discussed further in this paper.)

Other schemes may be used to generate handles in other ways. For example, the local portion of a handle might correspond to a date- time format, so that the digital object above might instead have the handle

berkeley.cs/1994.12.05.23.42.12;7

These handle forms can be embedded within various syntactic wrappers to distinguish them in various contexts from other notations. For example, the handle might be expressed in URN syntax as follows:

<URN:ASCII:ELIB-v.2.0:berkeley.cs/csd-93-712>

Here "ELIB-v.2.0" is supposed to suggest (via "ELIB") that this is a URN for electronic library material, and also, (via "-v.2.0") that some particular naming convention is used by the naming authority. Another possibility is the notation used by Grass and Arms (GA1994), which resembles that for URLs, and proceeds that handle with the prefix "hdl://" (to denote that a handle follows), or just "//" (if it is important to distinguish a global root for the handle), e.g.:

hdl://berkeley.cs/csd-93-712

//berkeley.cs/1994.12.05.23.42.12;7

The user of this notation is cautioned to avoid confusion with URLs, which name services, while handles name digital objects, not network services.

Various services might exploit semantic conventions to locate an object given its handle, without consulting a handle server. For example, a naming authority may have its own repository and IR server associated with it; the latter might be looked up (perhaps via an additional service), and queried for the location(s) of this particular report.

Users may, of course, attempt to incorporate all manners of semantic or system content in handles. Also, it is plausible that imposing any content in handles per se could be troublesome. Instead, handles per se could be declared to be uninterpreted, and an additional level of indirection be introduced to interpret them. Additional name services could be created to translate user-oriented nicknames to system-oriented handles, as are done for file systems today. We stop short of advocating such a system here, however, assuming that a semantically-motivated convention, such as that which has served for URLs, will continue to be useful at some level, and does not require an additional level of mediation.

5. Conclusion and Summary

This proposal provides a method for naming, identifying and/or invoking digital objects in a system of distributed repositories that provides great flexibility and is well-suited to a national-level enterprise. It allows the possibility of locating digital objects without making any presumptions about the object or its locations(s). It also admits value-added conventions that various users may use to their own advantage. For example, an IR server might internally refer to an object by its global handle, and, additionally, keep track of repositories in which this object is believed or known to reside. If a user requests this object, the IR server might look up the repository name or address, determine the repository service, and ask that repository to deliver a version of the object to the user. Alternatively, the IR server might instead use the object's handle at run time to query syntactically a handle server for the name of repositories or services that house the object.

This system also allows for public and private naming authorities. Many naming authorities will be private, and only assign identifiers to their chosen clientele (e.g., department members eligible to produce technical reports); however, public naming authorities could provide a service whereby they generate an identifier to anyone who requests one. Individual citizens not associated with any official body might use a public naming authority to generate identifiers for objects they wish to store for private purposes or for public dissemination on their own (this is an example of a situation in which the originator does not control the naming authority.)

In the CS-TR project, CNRI is providing the global naming authority plus a handle management service that supports handles with and without semantics. Participating institutions may wish to take advantage of handle semantics, if any, to retrieve objects directly. Each participating institution would be free to propose or request names of its own choice. Each of these names may also have associated with them a non-semantic identifier (such as a date-time- stamp) which is not otherwise specified in this document.



wya
3/3/95