Tuesday, January 20, 2009

Glossary of Data Warehousing

Glossary of Data Warehousing
http://inmoncif.com/library/glossary/
A B C D E F
G H I J K L
M N O P Q R
S T U V W X
Y Z



access - the operation of seeking, reading, or writing data on a storage unit.

access method - a technique used to transfer a physical record from or to a mass storage device

access mode - a technique in which a specific logical record is obtained from or placed onto a file assigned to a mass storage device

access pattern - the general sequence in which the data structure is accessed (eg. from tuple to tuple, from record to record, from segment to segment, etc.)

access plan - the control structure produced during program preparation and used by a data base manager to process SQL statements during application execution.

access time - the time interval between the instant an instruction initiates a request for data and the instant the first of the data satisfying the request is delivered. Note that there is a difference - sometimes large - between the time data is first delivered and the time when ALL the data is delivered.

accuracy - a qualitative assessment of freedom from error or a quantitative measure of the magnitude of error, expressed as a function of relative error.

active data dictionary - a data dictionary that is the sole source for an application program insofar as metadata is concerned.

activity - (1) the lowest level function on an activity chart (sometimes called the "atomic level") (2) a logical description of a function performed by an enterprise (3) a procedure (automated or not) designed for the fulfillment of an activity

activity ratio - the fraction of records in a data base which have activity or are otherwise accessed in a given period of time or in a given batch run.

ad hoc processing - one time only, casual access and manipulation of data on parameters never before used

address - an identification (eg., number, name, storage location, byte offset, etc.) for a location where data is stored

addressing - the means of assigning data to storage locations, and locating the data upon subsequent retrieval, on the basis of the key of the data

after image - the snapshot of data placed on a log upon the completion of a transaction.

agent of change - a motivating force large enough to not be denied, usually aging of systems, changes in technology, radical changes in requirements, etc.

AIX - Advanced Interactive executive - IBM's version of the UNIX operating system.

algorithm - a set of statements organized to solve a problem in a finite number of steps

alias - an alternative label used to refer to a data element

alphabetic - a representation of data using letters - upper and/or lower case - only

alphanumeric - a representation of data using numbers and/or letters, and punctuation

ANSI - American National Standards Institute

analytical processing - the usage of the computer to produce an analysis for management decision, usually involving trend analysis, drill down analysis, demographic analysis, profiling, etc.

anticipatory staging - the technique of moving blocks of data from one storage device to another with a shorter access time, in anticipation of their being needed by a program in execution or a program soon to go into execution.

API - Application Program Interface - the common set of parameters needed to connect the communications between programs.

application - a group of algorithms and data interlinked to support an organizational requirement

application blocking of data - the grouping into the same physical unit of storage multiple occurrences of data controlled at the application level

application data base - a collection of data organized to support a specific application

archival data base - a collection of data containing data of a historical nature. As a rule, archival data cannot be updated. Each unit of archival data is relevant to a moment in time, now passed.

area - in network data bases, a named collection of records that can contain occurrences of one or more record types. A record type can occur in more than one area.

artifact - a design technique used to represent referential integrity in the DSS environment.

artificial intelligence - the capability of a system to perform functions typically associated with human intelligence and reasoning.

association - a relationship between two entities that is represented in a data model

associative storage - (1) a storage device whose records are identified by a specific part of their contents rather than their name or physical position in the data base (2) Content addressable memory. See also parallel search storage.

atomic - (1) data stored in a data warehouse (2) the lowest level of process analysis

atomic data base - a data base made up of primarily atomic data; an enterprise data warehouse; a DSS foundation data base.

atomicity - the property in which a group of actions is invisible to other actions executing concurrently to yield the effect of serial execution. It is recoverable with successful completion (i.e., commit) or total backout (i.e., rollback) of previous changes associated with that group.

atomic level data - data with the lowest level of granularity. Atomic level data sits in a data warehouse and is time variant (i.e., accurate as of some moment in time now passed)

attribute - a property that can assume values for entities or relationships. Entities can be assigned several attributes (eg., a tuple in a relationship consists of values). Some systems also allow relationships to have attributes as well.

audit trail - data that is available to trace activity, usually update activity.

authorization identifier - a character string that designates a set of privilege descriptors.

availability - a measure of the reliability of a system, indicating the fraction of time when the system is up and available divided by the amount of time the system should be up and available. Note there is a difference between a piece of hardware being available and the systems running on the hardware also being available.


back end processor - a data base machine or an intelligent disk controller

back up - to restore the data base to its state as of some previous moment in time

backup - a file serving as a basis for the activity of backing up a data base. Usually a snapshot of a database as of some previous moment in time.

Backus-Naur Form (BNF) - a metalanguage used to specify or describe the syntax of a language. In BNF, each symbol on the left side of the forms can be replaced by the symbol strings on the right to develop sentences in the grammar of the defined language. Synonymous with Backus - Normal Form.

backward recovery - a recovery technique that restores a data base to an earlier state by applying before images.

base relation - a relation that is not derivable from other relations in the data base

batch - computer environment in which programs (usually long running, sequentially oriented) access data exclusively, and user interaction is not allowed while the activity is occurring.

batch environment - a sequentially dominated mode of processing; in batch, input is collected and stored for future, later processing. Once collected, the batch input is transacted sequentially against one or more data bases.

batch window - the time at which the online system is available for batch or sequential processing. The batch window occurs during nonpeak processing hours.

before image - a snapshot of a record prior to update, usually placed on an activity log.

bill of materials - a listing of the parts used in a manufacturing process along with the relation of one product to another insofar as assembly of the final product is concerned. The bill of materials is a classical recursive structure.

binary element - a constituent element of data that takes either of two values or states - either true or false, or one or zero.

binary search - a dichotomizing search with steps in which the sets of remaining items are partitioned into two equal parts.

bind - (1) to assign a value to a data element, variable, or parameter. (2) the attachment of a data definition to a program prior to the execution of the program.

binding time - the moment in time when the data description known to the dictionary is assigned to or bound to the procedural code.

bit - (b)inary digi(t) - the lowest level of storage. A bit can be in a 1 state or a 0 state.

bit map - a specialized form of an index indicating the existence or nonexistence of a condition for a group of blocks or records. Bit maps are expensive to build and maintain, but provide very fast comparison and access facilities.

block - (1) a basic unit of structuring storage (2) the physical unit of transport and storage. A block usually contains one or more records (or contains the space for one or more records). In some DBMS a block is called a page.

blocking - the combining of two or more physical records so that they are physically colocated together. The result of their physical colocation is that they can be accessed and fetched by a single execution of a machine instruction.

block splitting - the data management activity in which a filled block is written into two unfilled blocks, leaving space for future insertions and updates in the two partially filled blocks.

B-tree - a binary storage structure and access method that maintains order in a data base by continually dividing possible choices into two equal parts and reestablishing pointers to the respective sets but not allowing more than two levels of difference to exist concurrently.

buffer - an area of storage that holds data temporarily in main memory while data is being transmitted, received, read, or written. A buffer is often used to compensate for the differences in the timing of transmission and execution of devices. Buffers are used in terminals, peripheral devices, storage units, and CPUs.

bus - the hardware connection that allows data to flow from one component to another (eg., from the CPU to the line printer.)

byte - a basic unit of storage - made up of 8 bits.


C - a programming language.

cache - a buffer usually built and maintained at the device level. Retrieving data out of a cache is much quicker than retrieving data out of a cylinder.

call - to invoke the execution of a module.

canonical model - a data model that represents the inherent structure of data without regard to either individual use or hardware or software implementation.

cardinality (of a relation) - the number of tuples (i.e., rows) in a relation. See also degree of a relation.

CASE - Computer Aided Software Engineering

catalog - a directory of all files available to the computer.

chain - an organization in which records or other items of data are strung together.

chain list - a list in which the items cannot be located in sequence but in which each item contains an identifier (i.e., pointer) for the finding of the next item.

channel - a subsystem for input and output to and from the computer. Data from storage units, for example, flows into the computer by way of a channel.

character - a member of the standard set of elements used to represent data in the data base.

character type - the characters that can represent the value of an attribute.

checkpoint - an identified snapshot of the database or a point at which the transactions against the data base have been frozen or have been quiesced.

checkpoint/restart - a means of restarting a program at some point other than the beginning - for example, when a failure or interruption has occurred. N checkpoints may be used at intervals throughout an application program. At each of those points sufficient information is stored to permit the program to be restored to the moment in time the checkpoint has been taken.

child - a unit of data existing in a 1:n relationship with another unit of data called a parent where the parent must exist before the child can exist, but the parent can exist even if no child unit of data exists.

CIO - chief information officer - an organizational position managing all of the information processing functions.

circular file (queue) - an organization of data in which a finite number of units of data are allocated. Data is then loaded into those units. Upon reaching the end of the allocated units, new data is written over older data at the start of the queue. Sometimes called a "wrap around" queue.

CICS - Customer Information Control System - an IBM teleprocessing monitor.

"CLDS" - the facetiously named system development life cycle for analytical, DSS systems. CLDS is so named because in fact it is the reverse of the classical systems development life cycle - SDLC.

claimed block - a second or subsequent physical block of data designated to store table data when the originally allocated block has run out of space.

class (of entities) - all possible entities held by a given proposition.

cluster - (1) in Teradata, a group of physical devices controlled by the same AMP (2) in DB2 and Oracle, the practice of physically colocating data in the same block based on the content of data.

cluster key - the key around which data is clustered in a block (DB2/Oracle).

coalesce - to combine two or more sets of items into any single set.

COBOL - Common Business Oriented Language - a computer language for the business world. A very common language.

CODASYL model - a network data base model that was originally defined by the Data Base Task Group (DBTG) of the Conference on Data System Language (CODASYL) organization.

code - (1) to represent data or a computer program in a form that can be accepted by a data processor (2) to transform data so that it cannot be understood by anyone who does not have the algorithm used to decode the data prior to presentation (sometimes called "encode").

collision - the event that occurs when two or more records of data are assigned the same physical location. Collisions are associated with randomizers or hashers.

column - a vertical table in which values are selected from the same domain. A row is made up of one or more columns.

command - (1) the specification of an activity by the programmer (2) the actual execution of the specification.

commit - a condition raised by the programmer signalling to the DBMS that all update activity done by the program be executed against a data base. Prior to the commit, all update activity can be rolled back or cancelled with no ill effects on the contents of the data base.

commit protocol - an algorithm to ensure that a transaction is successfully completed.

commonality of data - similar or identical data that occurs in different applications or systems. The recognition and management of commonality of data is one of the foundations of conceptual and physical data base design.

communication network - the collection of transmission facilities, network processors, and so on, which provides for data movement among terminals and information processors.

compaction - a technique for reducing the number of bits required to represent data without losing the content of the data. With compaction, repetitive data is represented very concisely.

component - a data item or array of data items whose component type defines a collection of occurrences with the same data type.

compound index - an index over multiple columns.

concatenate - to link or connect two strings of characters, generally for the purpose of using them as a single value.

conceptual schema - a consistent collection of data structures expressing the data needs of the organization. This schema is a comprehensive, base level, and logical description of the environment in which an organization exists, free of physical structure and application system considerations.

concurrent operations - activities executed simultaneously, or during the same time interval.

condensation - the process of reducing the volume of data managed without reducing the logical consistency of the data. Condensation is essentially different than compaction.

connect - to forge a relationship between two entities, particularly in a network system.

connector - a symbol used to indicate that one occurrence of data has a relationship with another occurrence of data. Connectors are used in conceptual data base design and can be implemented hierarchically, relationally, in an inverted fashion, or by a network.

content addressable memory - main storage that can be addressed by the contents of the data in the memory, as opposed to conventional location addressable memory.

contention - the condition that occurs when two or more programs try to access the same data at the same time.

continuous time span data - data organized so that a continuous definition of data over a span of time is represented by one or more records.

control character - a character whose occurrence in a particular context initiates, modifies, or stops an operation.

control data base - a utilitarian data base containing data not directly related to the application being built. Typical control data bases are audit data bases, terminal data bases, security data bases, etc.

cooperative processing - the ability to distribute resources (programs, files and data bases) across the network.

coordinator - the two phase commit protocol defines one data base management system as coordinator for the commit process. The coordinator is responsible to communicate with the other data base manager involved in a unit of work.

corporate information warehouse (cif) - the architectural framework that houses the ODS, data warehouse, data marts, i/t interface, and the operational environment. The cif is held together logically by metadata and physically by a network such as the Internet.

CPU - central processing unit.

CPU-bound - the state of processing in which the computer can produce no more output because the CPU portion of the processor is being used at 100% capacity. When the computer is CPU-bound, typically the memory and storage processing units are less than 100% utilized. With modern DBMS, it is much more likely that the computer be I/O-bound, rather than CPU-bound.

CSP - Cross System Product - an IBM application generator.

CUA - Common User Access. Specifies the ways in which the user interface to systems is to be constructed.

current value data - data whose accuracy is valid as of the moment of execution. As opposed to time variant data.

cursor - (1) an indicator that designates a current position on a screen (2) a system facility that allows the programmer to thumb from one record to the next when the system has retrieved a set of records.

cursor stability - an option that allows data to move under the cursor. Once the program is through using the data examined by the cursor it is released. As opposed to repeatable read.

cylinder - the area of storage of DASD that can be read without the movement of the arm. The term originated with disk files, in which a cylinder consisted of one track on each disk surface so that each of these tracks could have a read/write head positioned over it simultaneously.


DASD - see direct access storage device.

data - a recording of facts, concepts, or instructions on a storage medium for communication, retrieval, and processing by automatic means and presentation as information that is understandable by human beings.

data administrator - (DA) - the individual or organization responsible for the specification, acquisition, and maintenance of data management software and the design, validation and security of files or data bases. The data model and the data dictionary are classically the charge of the DA.

data aggregate - a collection of data items.

data base - a collection of interrelated data stored (often with controlled, limited redundancy) according to a schema. A data base can serve a single or multiple applications.

data base administrator (DBA) - the organizational function charged with the day to day monitoring and care of the data bases. The dba function is more closely associated with physical data base design than the DA is.

data base key - a unique value that exists for each record in a data base. The value is often indexed, although it can be randomized or hashed.

data base machine - a dedicated-purpose computer that provides data access and management through total control of the access method, physical storage, and data organization. Often called a "back end processor." Data is usually managed in parallel by a data base machine.

data base management system (DBMS) - a computer based software system used to establish and manage data.

data base record - a physical root and all of its dependents (in IMS).

data definition - the specification of the data entities, their attributes, and their relationships in a coherent data base structure to create a schema.

data definition language (DDL) (also called a data description language) - the language used to define the data base schema and additional data features that allows the DBMS to generate and manage the internal tables, indexes, buffers, and storage necessary for data base processing.

data description language - see data definition language.

data dictionary - a software tool for recording the definition of data, the relationship of one category of data to another, the attributes and keys of groups of data, and so forth.

data division (COBOL) - the section of a COBOL program that consists of entries used to define the nature and characteristics of the data to be processed by the program.

data driven development - the approach to development that centers around identifying the commonality of data through a data model and building programs that have a broader scope than the immediate application. Data driven development differs from classical application oriented development.

data driven process - a process whose resource consumption depends on the data on which it operates. For example, a hierarchical root has a dependent. For one occurrence the are two dependents for the root. For another occurrence of the root there are 1,000 occurrences of the dependent. The same program that accesses the root and all its dependents will use very different amounts of resources when operating against the two roots although the code will be exactly the same.

data element - (1) an attribute of an entity (2) a uniquely named and well defined category of data that consists of data items and that is included in a record of an activity.

data engineering (see information engineering) - the planning and building of data structures according to accepted mathematical models, on the basis of the inherent characteristics of the data itself, and independent of hardware and software systems.

data independence - the property of being able to modify the overall logical and physical structure of data without changing any of the application code supporting the data.

data item - a discrete representation having the properties that define the data element to which it belongs. See data element.

data item set (dis) - a grouping of data items, each of which directly relates to the key of the grouping of data in which the data items reside. The data item set is found in the mid level model.

data manipulation language (DML) - (1) a programming language that is supported by a DBMS and used to access a data base (2) language constructs added to a higher-order language (eg., COBOL) for the purpose of data base manipulation.

data mart - a department specific data warehouse. There are two types of data marts - independent and dependent. An independent data mart is fed data directly from the legacy environment. A dependent data mart is fed data from the enterprise data warehouse. In the long run, dependent data marts are architecturally much more stable than independent data marts.

data model - (1) the logical data structures, including operations and constraints provided by a DBMS for effective data base processing (2) the system used for the representation of data (eg., the ERD or relational model.)

data record - an identifiable set of data values treated as a unit, an occurrence of a schema in a data base, or a collection of atomic data items describing a specific object, event, or tuple.

data security - the protection of the data in a data base against unauthorized disclosure, alteration, or destruction. There are different levels of security.

data set - a named collection of logically related data items, arranged in a prescribed manner, and described by control information to which the programming systems has access.

data storage description language (DSDL) - a language to define the organization of stored data in terms of an operating system and device independent storage environment. See also device media control language.

data structure - a logical relationship among data elements that is designed to support specific data manipulation functions (eg., trees, lists, and tables.)

data type - the definition of a set of representable values that is primitive and without meaningful logical subdivision.

data view - see user view.

data volatility - the rate of change of the content of data.

data warehouse - a collection of integrated subject oriented data bases designed to support the DSS function, where each unit of data is relevant to some moment in time. The data warehouse contains atomic data and lightly summarized data. A data warehouse is a subject oriented, integrated, non volatile, time variant collection of data designed to support management DSS needs

data warehouse administrator (dwa) - the organization function designed to create and maintain the data warehouse. The dwa combines several disciplines, such as the da, dba, emn user, etc.

DB2 - a data base management system by IBM.

DB/DC - data base / data communications

DBMS language interface (DB I/O module) - software that applications invoke in order to access a data base. The module in turn has direct access with the DBMS. Standards enforcement and standard error checking are often features of an I/O module.

deadlock - see deadly embrace.

deadly embrace - the event that occurs when transaction A desires to access data currently protected by transaction B, while at the same time transaction B desires to access data that is currently being protected by transaction A. The deadly embrace condition is a serious impediment to performance.

decision support system (DSS) - a system used to support managerial decisions. Usually DSS involves the analysis of many units of data in a heuristic fashion. As a rule, DSS processing does not involve the update of data.

decompaction - the opposite of compaction; once data is stored in a compacted form, it must be decompacted to be used.

decryption - the opposite of encryption. Once data is stored in an encrypted fashion, it must be decrypted in order that it can be used.

degree (of a relation) - the number of attributes or columns of a relation. See cardinality of a relation.

delimiter - a flag, symbol, or convention used to mark the boundaries of a record, field, or other unit of storage.

demand staging - the movement of blocks of data from one storage device to another device with a shorter access time when programs request the blocks and the blocks are not already in the faster access storage.

denormalization - the technique of placing normalized data in a physical location that optimizes the performance of the system.

derived data - data whose existence depends on two or more occurrences of a major subject of the enterprise.

derived data element - a data element that is not necessarily stored but that can be generated when needed (eg., age given current date and date of birth.)

derived relation - a relation that can be obtained from previously defined relations by applying some sequence of retrieval and derivation operator (eg., a table that is the join of others plus some projections.) See a virtual relation.

design review - the quality assurance process in which all aspects of a system are reviewed publicly prior to the striking of code.

device media control language (DMCL) - a language used to define the mapping of the data onto the physical storage media. See data storage description language.

dimension table - the table that is joined to a fact table in a star join. The dimension table is the structure that represents then non populous occurrences of data in a data mart.

direct access - retrieval or storage of data by reference to its location on a volume. The access mechanism goes directly to the data in question, as is generally required with online use of data. Also called random access or hashed access.

direct access storage device (DASD) - a data storage unit on which data can be accessed directly without having to progress through a serial file such as a magnetic tape file. A disk unit is a direct access storage device.

directory - a table specifying the relationships between items of data. Sometimes a table or index giving the addresses of data.

distributed catalog - a distributed catalog is needed to achieve site autonomy. The catalog at each site maintains information about objects in the local data bases. The distributed catalog keeps information on replicated and distributed tables stored at that site and information on remote tables located at another site that cannot be accessed locally.

distributed data base - a data base controlled by a central DBMS but in which the storage devices are geographically dispersed or not attached to the same processor. See parallel I/O.

distributed data warehouse - where more than one enterprise data warehouse is built, the combination is called a distributed data warehouse

distributed environment - a set of related data processing systems, where each system has its own capacity to operate autonomously, but with some applications which execute at multiple sites. Some of the systems may be connected with teleprocessing links into a network in which each system is a node.

distributed free space - space left empty at intervals in a data layout to permit insertion of new data.

distributed metadata - distributed metadata is data metadata that resides at different architectural entities, such as data marts, enterprise data warehouses, ODS, etc.

distributed request - a transaction across multiple nodes

distributed unit of work - the work done by a transaction that operates against multiple nodes

division - an operation that partitions a relation on the basis of the contents of data found in the relation.

DL/1 - IBM's Data Language One, for describing logical and physical data structures

domain - the set of legal values from which actual values are derived for an attribute or a data element.

dormant data - data loaded into a data warehouse that has a future probability of access of zero

download - the stripping of data from one data base to another based on the content of data found in the first data base.

drill down analysis - the type of analysis where examination of a summary number leads to the exploration of the components of the sum.

dual data base - the practice of separating high performance, transaction oriented data from decision support data

dual data base management systems - the practice of using multiple data base management systems to control different aspects of the data base environment

dumb terminal - a device used to interact directly with the end user where all processing is done on a remote computer. A dumb terminal acts as a device that gathers data and displays data only.

dynamic SQL - SQL statements that are prepared and executed within a program while the program is executing. In dynamic SQL the SQL source is contained in host language variables rather than being coded into the application program.

dynamic storage allocation - a technique in which the storage areas assigned to computer programs are determined during processing.

dynamic subset of data - a subset of data selected by a program and operated on only by the program, and released by the program once the program ceases execution.


EDI - Electronic Data Interchange.

EIS (Executive Information Systems) - systems designed for the top executive, featuring drill down analysis and trend analysis.

embedded pointer - a record pointer (i.e., a means of internally linking related records) that is not available to an external index or directory. Embedded pointers are used to reduce search time, but require maintenance overhead.

encoding - a shortening or abbreviation of the physical representation of a data value (eg., male = "M", female ="F")

encryption - the transformation of data from a recognizable form to a form unrecognizable without the algorithm used for the encryption. Encryption is usually done for the purposes of security.

enterprise - the generic term for the company, corporation, agency, or business unit. Usually associated with data modelling.

enterprise data warehouse - a data warehouse holding the most atomic data the corporation has. Two or more enterprise data warehouses may be combined in order to create a distributed data warehouse

entity - a person, place or thing of interest to the data modeller at the highest level of abstraction.

entity - relationship - attribute (ERA) model - a data model that defines entities, the relationship between the entities, and the attributes that have values to describe the properties of entities and/or relationships.

entity - relationship diagram (ERD) - a high level data model - the schematic showing all the entities within the scope of integration and the direct relationship between those entities.

event - a signal that an activity of significance has occurred. An event is noted by the information system.

event discrete data - data relating to the measurement or description of an event.

expert system - a system that captures and automates the usage of human experience and intelligence.

explorer - a DSS end user who operates on a random basis looking at large amounts of detailed data for patterns, associations, and other previously unnoticed relationships

extent - (1) a list of unsigned integers that specifies an array (2) a physical unit of disk storage attached to a data set after the initial allocation of data has been made.

external data - (1) data originating from other than the operational systems of a corporation (2) data residing outside the central processing complex

external schema - a logical description of a user's method of organizing and structuring data. Some attributes or relationships can be omitted from the corresponding conceptual schema or can be renamed or otherwise transformed. See view.

extract - the process of selecting data from one environment and transporting it to another environment.


fact table - the central component of the star join. The fact table is the structure where the vast majority of the occurrences of data in the data mart reside

farmer - a DSS user who repetitively looks at small amounts of data and who often finds what he/she is looking for

field - See data item.

file - a set of related records treated as a unit and stored under a single logical file name.

(FIFO) first in first out - a fundamental ordering of processing in a queue.

(FILO) first in last out - a standard order of processing in a stack.

flag - an indicator or character that signals the occurrence of some condition.

flat file - a collection of records containing no data aggregates, nested repeated data items, or groups of data items.

floppy disk - a device for storing data on a personal computer.

foreign key - an attribute that is not a primary key in a relational system, but whose values are the values of the primary key of another relation.

format - the arrangement or layout of data in or on a data medium or in a program definition.

forward recovery - a recovery technique that restores a data base by reapplying all transactions using a before image from a specified point in time to a copy of the data base taken at that moment in time.

fourth generation language - language or technology designed to allow the end user unfettered access to data.

functional decomposition - the division of operations into hierarchical functions (i.e., activities) that form the basis for procedures.


gigabyte - a measurement of data between a megabyte and a terabyte - 10 x 10**9 bytes of data

global data warehouse - a data warehouse that is distributed around the world. In a global data warehouse the system of record resides in the local site.

graphic - a symbol produced on a screen representing an object or a process in the real world.

granularity - the level of detail contained in a unit of data. The more detail there is, the lower the level of granularity. The less detail there is, the higher the level of granularity.


hash - to convert the value of the key of a record into a location on DASD.

hash total - a total of the values of one or more fields, used for the purposes of auditability and control.

header record or header table - a record containing common, constant or identifying information for a group of records that follow.

heuristic - the mode of analysis in which the next step is determined by the results of the current step of analysis. Used for decision support processing.

hierarchical model - a data model providing a tree structure for relating data elements or groups of data elements. Each node in the structure represents a group of data elements or a record type. There can be only one root node at the start of the hierarchical structure.

hit - an occurrence of data that satisfies some search criteria.

hit ratio - a measure of the number of records in a file expected to be accessed in a given run. Usually expressed as a percentage - number of input transactions/number of records in the file x 100 = hit ratio

homonyms - identical names that refer to different attributes.

horizontal distribution - the splitting of a table across different sites by rows. With horizontal distribution rows of a single table reside at different sites in a distributed data base network.

host - the processor receiving and processing a transaction.

Huffman code - a code for data compaction in which frequently used characters are encoded with fewer bits than infrequently used characters.


IDMS - a network DBMS from CA.

IEEE - Institute of Electrical and Electronics Engineers.

IMS - Information Management System - an operational DBMS by IBM.

image copy - a procedure in which a data base is physically copied to another medium for the purposes of backup.

index - the portion of the storage structure maintained to provide efficient access to a record when its index key item is known.

index chains - chains of data within an index.

index sequential access method (ISAM) - a file structure and access method in which records can be processed sequentially (eg., in order, by key) or by directly looking up their locations on a table, thus making it unnecessary to process previously inserted records.

index point - a hardware reference mark on a disk or drum; used for timing purposes.

indirect addressing - any method of specifying or locating a record through calculation (eg., locating a record through the scan of an index)

information - data that human beings assimilate and evaluate to solve a problem or make a decision.

information engineering (IE) - the discipline of creating a data driven development environment.

information center - the organizational unit charged with identifying and accessing information needed in DSS processing.

Informix - a leading data warehouse vendor

input/output (I/O) - the means by which data is stored and/or retrieved on DASD. I/O is measured in milliseconds (i.e., mechanical speeds) whereas computer processing is measured in nanoseconds (i.e., electronic speeds).

instance - a set of values representing a specific entity belonging to a particular entity type. A single value is also the instance of a data item.

integration/transformation (i/t) program - a program designed to convert and move data from the legacy environment to the data warehouse environment. I/T programs are notoriously unstable and require constant maintenance

integrity - the property of a data base that ensures that the data contained in the data base is as accurate and consistent as possible.

intelligent data base - a data base that contains shared logic as well as shared data and automatically invokes that logic when the data is accessed. Logic, constraints, and controls relating to the use of the data are represented in an intelligent data model.

interactive - a mode of processing that combines some of the characteristics of online transaction processing and batch processing. In interactive processing the end user interacts with data over which he/she has exclusive control. In addition, the end user can initiate background activity to be run against the data.

interleaved data - data from different tables mixed into a simple table space where there is commonality of physical colocation based on a common key value.

internal schema - the schema that describes logical structures of the data and the physical media over which physical storage is mapped.

internet - a network that connects many public users

interpretive - a mode of data manipulation in which the commands to the DBMS are translated as the user enters them (as opposed to the programed mode of process manipulation.)

intersection data - data that is associated with the junction of two or more record types or entities, but which has no meaning when disassociated with any records or entities forming the junction.

intranet - a network that connects many private users

inverted file - a file structure that uses an inverted index, where entries are grouped according to the content of the key being referenced. Inverted files provide for the fast spontaneous searching of files.

inverted index - an index structure organized by means of a nonunique key to speed the search for data by content.

inverted list - a list organized around a secondary index instead of around a primary key.

I/O - input / output operation. Input / output operations are the key to performance because they operate at mechanical speeds, not at electronic speeds.

I/O bound - the point after which no more processing can be done because the I/O subsystem is saturated.

ISAM - see Indexed Sequential Access Method.

"is a type of" - an analytical tool used in abstracting data during the process of conceptual data base design (eg., a cocker spaniel is a type of dog.)

ISDN (Integrated Services Digital Network) - telecommunications technology that enables companies to transfer data and voice through the same phone lines.

ISO - international Standards Organization

item - see data item.

item type - a classification of an item according to its domain, generally in a gross sense.

iterative analysis - the mode of processing in which the next step of processing depends on the results obtained by the existing step in execution; heuristic processing.


jad (joint application design) - an organization of people - usually end users - to create and refine application system requirements.

join - an operation that takes two relations as operands and produces a new relation by concatenating the tuples and matching the corresponding columns when a stated condition holds between the two.

judgment sample - a sample of data where data is accepted or rejected for the sample based on one or more parameters.

junction - from the network environment, an occurrence of data that has two or more parent segments. For example, an order for supplies must have a supplier parent and a part parent.

justify - to adjust the value representation in a character field to the right or to the left, ignoring blanks encountered.


keeplist - a sequence of data base keys maintained by the DBMS for the duration of the session.

key - a data item or combination of data items used to identify or locate a record instance (or other similar data groupings.)

key compression - a technique for reducing the number of bits in keys; used in making indexes occupy less space.

key, primary - a unique attribute used to identify a single record in a data base.

key, secondary - a nonunique attribute used to identify a class of records in a data base.


label - a set of symbols used to identify or describe an item, record, message, or file. Occasionally a label may be the same as the address of the record in storage.

language - a set of characters, conventions, and rules used to convey information and consisting of syntax and semantics.

latency - the time taken by a DASD device to position the read arm over the physical storage medium. For general purposes, average latency time is used.

least frequently used (LFU) - a replacement strategy in which new data must replace existing data in an area of storage; the least frequently used items are replaced.

least recently used (LRU) - a replacement strategy in which new data must replace existing data in an area of storage; the least recently used items are replaced.

legacy environment - the transaction oriented, application based environment

level of abstraction - the level of abstraction appropriate to a dimension. The level of abstraction which is appropriate is entirely dependent on the ultimate user of the system.

line - the hardware by which data flows to or from the processor. Lines typically go to terminals, printers, and other processors.

line polling - the activity of the teleprocessing monitor in which different lines are queried to determine whether they have data and/or transactions that need to be transmitted.

line time - the length of time required for a transaction to go from either the terminal to the processor or the processor to the terminal. Typically line time is the single largest component of online response time.

linkage - the ability to relate one unit of data to another.

linked list - set of records in which each record contains a pointer to the next record on the list. See chain.

list - an ordered set of data items.

living sample - a representative data base typically used for heuristic statistical analytical processing in place of a large data base. Periodically the very large data base is selectively stripped of data so that the resulting living sample data base represents a cross section of the very large data base as of some moment in time.

load - to insert data values into a data base that was previously empty.

local site support - within a distributed unit of work, a local site update allows a process to perform SQL update statements referring to the local site.

local transaction - in a distributed DBMS, a transaction that requires reference only to data that is stored at the site where the transaction originated.

locality of processing - in distributed data base, the design of processing so that remote access of data is eliminated or reduced substantively.

lockup - the event that occurs when update is done against a data base record and the transaction has not yet reached a commit point. The online transaction needs to prevent other transactions from accessing the data while update is occurring.

log - a journal of activity.

logging - the automatic recording of data with regard to the access of the data, the updates to the data, etc.

logical representation - a data view or description that does not depend on a physical storage device or a computer program.

loss of identity - when data is brought in from an external source and the identity of the external source is discarded, loss of identity occurs. A common practice with microprocessor data.

LU6.2 - Logical Unit Type 6.2 - peer to peer data stream with network operating system for program to program communication. LU6.2 allows mid-range machines to talk to one another without the involvement of the mainframe.


machine learning - the ability of a machine to improve its performance automatically based on past performance.

magnetic tape - (1) the storage medium most closely associated with sequential processing (2) a large ribbon on which magnetic images are stored and retrieved.

main storage data base (msdb) - a data base that resides entirely in main storage. Such data bases are very fast to access, but require special handling at the time of update. Another limitation of msdb's are that they can only manage small amounts of data.

master file - a file that holds the system of record for a given set of data (usually bound by an application.)

maximum transaction arrival rate (MTAR) - the rate of arrival of transactions at the moment of peak period processing.

megabyte - a measurement of data - 10 x 10**6 bytes of data

message - (1) the data input by the user in the online environment that is used to drive a transaction (2) the output of a transaction.

metadata - (1) data about data (2) the description of the structure, content, keys, indexes, etc. of data

metalanguage - a language used to specify other languages.

microprocessor - a small processor serving the needs of a single user.

migration - the process by which frequently used items of data are moved to more readily accessible areas of storage and infrequently used items of data are moved to less readily accessible areas of storage.

mips (million instructions per second) - the standard measurement of processor speed for minicomputers and mainframe computers.

mode of operation - a classification for systems that execute in a similar fashion and share distinctive operational characteristics. Some modes of operation are operational, DSS, online, interactive, etc.

modulo - an arithmetic term describing the remainder of a division process. 10 modulo 7 is 3. Modulo is usually associated with the randomization process.

multilist organization - a chained file organization in which the chains are divided into fragments and each fragment is indexed. This organization of data permits faster access to the data.

multiple key retrieval - that requires searches of data on the basis of the values of several key fields (some or all of which are secondary keys.)

MVS - Multiple Virtual Storage - IBM's mainline operating system for mainframe processors. There are several extensions of MVS.


Named Pipes - program to program protocol with Microsoft's LAN manager. The Named Pipes API supports intra and inter machine process to process communications.

natural forms:

first normal form - data that has been organized into two dimensional flat files without repeating groups

second normal form - data that functionally depends on the entire candidate key

third normal form - data that has had all transitive dependencies on data items other than the candidate key removed.

fourth normal form - data whose candidate key is related to all data items in the record and that contains no more than one nontrivial multivalued dependency on the candidate key.

natural join - a join in which the redundant logic components generated by the join are removed.

natural language - a language generally spoken, whose rules are based on current usage and not explicitly defined by a grammar.

navigate - to steer a course through a data base, from record to record, by means of an algorithm which examines the content of data.

network - a computer network consists of a collection of circuits, data switching elements and computing systems. The switching devices in the network are called communication processors. A network provides a configuration for computer systems and communication facilities within which data can be stored and accessed and within which DBMS can operate.

network model - a data model that provides data relationships on the basis of records, and groups of records (i.e., sets) in which one record is designated as the set owner, and a single member record can belong to one or more sets.

nine's complement - transformation of a numeric field calculated by subtracting the initial value from a filed consisting of all nines.

node - a point in the network at which data is switched.

nonprocedural language - syntax that directs the computer as to what to do, not how to do it. Typical non procedural languages include RAMIS, FOCUS, NOMAD, and SQL.

normalize - to decompose complex data structures into natural structures.

NT - an operating system built by Microsoft

null - an item or record for which no value currently exists or possibly may ever exist.

numeric - a representation using only numbers and the decimal point.


occurrence - see instance.

offset pointer - an indirect pointer. An offset pointer exists inside a block and the index points to the offset. If data must be moved, only the offset pointer in the block must be altered; the index entry remains untouched.

online storage - storage devices and storage medium where data can be accessed in a direct fashion.

operational data store (ODS) - the form that data warehouse takes in the operational environment. Operational data stores can be updated, do provide rapid and consistent response time, and contain only a limited amount of historical data.

operational application - see legacy application

operating system - software that enables a computer to supervise its own operations and automatically call in programs, routines, languages, and data as needed for continuous operation throughout the execution of different types of jobs.

operational data - data used to support the daily processing a company does.

operations - the department charged with the running of the computer.

optical disk - a storage medium using lasers as opposed to magnetic devices. Optical disk is typically write only, is much less expensive per byte than magnetic storage, and is highly reliable.

ORACLE - a DBMS by ORACLE Corp.

order - to place items in an arrangement specified by such rules as numeric or alphabetic order. See sort.

OS/2 - the operating system for IBM's Personal System / 2.

OSF - Open Software Foundation

OSI - Open Systems Interconnection

overflow - (1) the condition in which a record or a segment cannot be stored in its home address because the address is already occupied. In this case the data is placed in another location referred to as overflow. (2) the area of DASD where data is sent when the overflow condition is triggered.

ownership - the responsibility for update for operational data.


padding - a technique used to fill a field, record, or block with default data (eg., blanks or zeros)

page - (1) a basic unit of data on DASD (2) a basic unit of storage in main memory.

page fault - a program interruption that occurs when a page that is referred to is not in main memory and must be read in from external storage.

page fixed - the state in which programs or data cannot be removed from main storage. Only a limited amount of storage can be page fixed.

paging - in virtual storage systems, the technique of making memory appear to be larger than it really is by transferring blocks (pages) of data or programs into external memory.

parallel data organization - an arrangement of data in which the data is spread over independent storage devices and is managed independently.

parallel I/O - the process of accessing or storing data on multiple physical data devices.

parallel search storage - a storage device in which one or more parts of all storage locations are queried simultaneously for a certain condition or under certain parameters. See associative storage.

parameter - an elementary data value used as a criteria for qualification, usually of searches of data or in the control of modules.

parent - a unit of data in a 1:n relationship with another unit of data called a child, where the parent can exist independently, but the child cannot exist unless there is a parent.

parsing - the algorithm that translates syntax into meaningful machine instructions. Parsing determines the meaning of statements issued in the data manipulation language.

partition - a segmentation technique in which data is divided into physically different units. Partitioning can be done at the application or the system level.

path length - the number of instructions executed for a given program or instruction.

peak period - the time when the most transactions arrive at the computer with the expectation of execution.

performance - the length of time from the moment a request is issued until the first of the results of the request are received.

periodic discrete data - a measurement or description of data taken at a regular time interval.

physical representation - (1) the representation and storage of data on a medium such as magnetic storage (2) the description of data that depends on such physical factors as length of elements, records, pointers, etc.

pipes - vehicles for passing data from one application to another.

plex or network structure - a relationship between records or other groupings of data in which a child record can have more than one parent record.

plug compatible manufacturer (PCM) - a manufacturer of equipment that functionally is identical to that of another manufacturer (usually IBM).

pointer - the address of a record or other groupings of data contained in another record so that a program may access the former record when it has retrieved the latter record. The address can be absolute, relative, or symbolic, and hence the pointer is referred to as absolute, relative, or symbolic.

pools - the buffers made available to the online controller.

populate - to place occurrences of data values in a previously empty data base. See load.

precision - the degree of discrimination with which a quantity is stated. For example a three digit numeral discriminates among 1,000 possibilities, from 000 to 999.

precompilation - the processing of source text prior to compilation. In an SQL environment, SQL statements are replaced with statements that will be recognized by the host language compiler.

prefix data - data in a segment or a record used exclusively for system control, usually unavailable to the user.

primary key - an attribute that contains values that uniquely identify the record in which the key exists.

primitive data - data whose existence depends on only a single occurrence of a major subject area of the enterprise.

privacy - the prevention of unauthorized access and manipulation of data.

privilege descriptor - a persistent object used by a DBMS to enforce constraints on operations.

problems data base - the component of a DSS application where previously defined decision parameters are stored. A problems data base is consulted to review characteristics of past decisions and to determine ways to meet current decision making needs.

processor - the hardware at the center of execution of computer programs. Generally speaking processors are divided into three categories - mainframes, minicomputers, and microcomputers.

processor cycles - the hardware's internal cycles that drive the computer (eg., initiate I/O, perform logic, move data, perform arithmetic functions, etc.)

production environment - the environment where operational, high performance processing is run.

program area - the portion of main memory in which application programs are executed.

progressive overflow - a method of handling overflow in a randomly organized file that does not require the use of pointers. An overflow record is stored in the first available space and is retrieved by a forward serial search from the home address.

projection - an operation that takes one relation as an operand and returns a second relation that consists of only the selected attributes or columns, with duplicate rows eliminated.

proposition - a statement about entities that asserts or denies that some condition holds for those entities.

protocol - the call format used by a teleprocessing monitor.

punched cards - an early storage medium on which data and input were stored. Today punched cards are rare.

purge data - the data on or after which a storage area may be overwritten. Used in conjunction with a file label, it is a means of protecting file data until an agreed upon release date is reached.


query language - a language that enables an end user to interact directly with a DBMS to retrieve and possibly modify data managed under the DBMS


record - an aggregation of values of data organized by their relation to a common key

record-at-a-time processing - the access of data a record at a time, a tuple at a time, etc.

recovery - the restoration of the database to an original position or condition, often after major damage to the physical medium

Red Brick - a DSS data base management system by Red Brick corp

redundancy - the practice of storing more than one occurrence of data. In the case where data can be updated, redundancy poses serious problems. In the case where data is not updated, redundancy is often a valuable and necessary design tool.

referential integrity - the facility of a DBMS to ensure the validity of a predefined relationship.

reorganization - the process of unloading data in a poorly organized state and reloading the data in a well organized state. Reorganization in some DBMS is used to restructure data. Reorganization is often called - "reorg", or an "unload/reload" process.

repeating groups - a collection of data that can occur several times within a given record occurrence

rolling summary - a form of storing archival data where the most recent data has the lowest level of detail stored and the older data has higher levels of detail stored.


scope of integration - the formal definition of the boundaries of the system being modelled

SDLC - system development life cycle - the classical operational system development life cycle that typically includes requirements gathering, analysis, design, programming, testing, integration, and implementation. Sometimes called a "waterfall" development life cycle.

sequential file - a file in which records are ordered according to the values of one or more key fields. The records can be processed in this sequence starting from the first record in the file, continuing to the last record in the file.

serial file - a sequential file in which the records are physically adjacent, in sequential order.

set-at-a-time processing - access of data by groups, each member of which satisfies some selection criteria

snapshot - a database dump or the archiving of data as of some one moment in time

snow flake structure - the grouping together of two or more star joins

star join - a denormalized form of organizing data optimal for the access of a group of people, usually a department. Star joins are usually associated with data marts. Star joins were popularized by Ralph Kimball

storage hierarchy - storage units linked to form a storage subsystem, in which some units are fast to access and consume small amounts of storage, but which are expensive, and other units are slow to access and are large, but are inexpensive to store

subject database - a database organized around a major subject of the corporation. Classical subject databases are for customer, transaction, product, part, vendor, etc.

Sybase - a data base management system by Sybase Corp

system log - an audit trail of relevant system events (for example, transaction entries, database changes, etc.)

system of record - the definitive and singular source of operational data or metadata. If data element abc has a value of 25 in a database record but a value of 45 in the system of record, by definition the first value must be incorrect. The system of record is useful for the management of redundancy of data. For metadata, at any one moment in time, each unit of metadata is owned by one and only one organizational unit.


table - a relation that consists of a set of columns with a heading and a set of rows (i.e., tuples)

Teradata - a data base management system by NCR

terabyte - a measurement of a large amount of data, 10 x 10**12 bytes

time stamping - the practice of tagging each record with some moment in time, usually when the record was created or when the record was passed from one environment to another.

time variant data - data whose accuracy is relevant to some one moment in time. The three common forms of time variant data are continuous time span data, event discrete data, and periodic discrete data. See current value data.

transaction processing - the activity of executing many short, fast running programs, providing the end user with consistent two to three second response time

transition data - data possessing both primitive and derived characteristics; usually very sensitive to the running of the business. Typical transition data include interest rates for a bank, policy rates for an insurance company, retail sale prices for a manufacturer/distributor, etc.

trend analysis - the process of looking at homogeneous data over a spectrum of time

true archival data - data at the lowest level of granularity in the current level detail data base


UNIX - a popular operating system for data warehouses

update - to change, add, delete, or replace values in all or selected entries, groups, or attributes stored in a database

user - a person or process issuing commands or messages and receiving stimuli from the information system


Zachman framework - a specification for organizing blueprints for information systems, popularized the great John Zachman

No comments: