Building the data warehouse 1st Chapter summary

The data warehouse requires an architecture that begins by looking at the whole and then works down to the particulars. Certainly, details are important throughout the data warehouse. But details are important only when viewed in a broader context.

The Evolution

ecision support systems (DSS)

The magnetic tapes were good for storing a large volume of data cheaply, but the drawback was that they had to be accessed sequentially.

===============================================================

The proliferation of master files and redundant data presented some very insidious problems:

The need to synchronize data upon update

The complexity of maintaining programs

The complexity of developing new programs

The need for extensive amounts of hardware to support all the master files

================================================================

Disk storage was fundamentally different from magnetic tape storage in that data could be accessed directly on a DASD.

The Advent of DASD

Disk storage was fundamentally different from magnetic tape storage in that data could be accessed directly on a DASD.

===============================================================

The purpose of the DBMS was to make it easy for the programmer to store and access data on a DASD. In addition, the DBMS took care of such tasks as storing data on a DASD, indexing data, and so forth.

PC/4GL Technology

Enter the Extract Program

The extract program is the simplest of all programs. It rummages through a file or database, uses some criteria for selecting data, and, on finding qualified data, transports the data to another file or database.

The extract program became very popular for at least two reasons:

Because extract processing can move data out of the way of high performance online processing, there is no conflict in terms of performance when the data needs to be analyzed en masse.

When data is moved out of the operational, transaction-processing domain with an extract program, a shift in control of the data occurs. The end user then owns the data once he or she takes control of it. For these (and probably a host of other) reasons, extract processing was soon found everywhere.

The Spider Web

The larger and more mature the organization, the worse the problems of the naturally evolving architecture become.

Problems with the Naturally Evolving Architecture

The naturally evolving architecture presents many challenges, such as:

Data credibility:

No time basis of data

The algorithmic differential of data

The levels of extraction

The problem of external data

No common source of data from the beginning

Productivity:

Locate and analyze the data for the report.

Compile the data for the report.

Get programmer/analyst resources to accomplish these two tasks.

Inability to transform data into information:

The systems found in the naturally evolving architecture are simply inadequate for supporting information needs. They lack integration and there is a discrepancy between the time horizon (or parameter of time) needed for analytical processing and the available time horizon that exists in the applications.

A Change in Approach

Primitive data and derived data:

============================================================

Following are some other differences between the two.

Primitive data is detailed data used to run the day-to-day operations of the company. Derived data has been summarized or otherwise calculated to meet the needs of the management of the company.

Primitive data can be updated. Derived data can be recalculated but cannot be directly updated.

Primitive data is primarily current-value data. Derived data is often historical data.

Primitive data is operated on by repetitive procedures. Derived data is operated on by heuristic, non-repetitive programs and procedures.

Operational data is primitive; DSS data is derived. Primitive data supports the clerical function. Derived data supports the managerial function.

=============================================================

Primitive data and derived data are so different that they do not reside in the same database or even the same environment.

The Architected Environment

These different levels of data are the basis of a larger architecture called the corporate information factory (CIF).

The operational level of data holds application-oriented primitive data only and primarily serves the high-performance transaction-processing community.

The data warehouse level of data holds integrated, historical primitive data that cannot be updated. In addition, some derived data is found there.

The departmental or data mart level of data contains derived data almost exclusively. The departmental or data mart level of data is shaped by end-user requirements into a form specifically suited to the needs of the department.

While data in the data mart certainly relates to data found in the operational level or the data warehouse, the data found in the departmental or data mart environment is fundamentally different from the data found in the data warehouse environment, because data mart data is denormalized, summarized, and shaped by the operating requirements of a single department.

The individual level of data is where much heuristic analysis is done.

==============================================================

Some people believe the architected environment generates too much redundant data. Though it is not obvious at first glance, this is not the case at all. Instead, it is the spider web environment that generates the gross amounts of data redundancy.

=============================================================

Note that the records in the data warehouse do not overlap. Also note that there is some element of time associated with each record in the data warehouse.

Data Integration in the Architected Environment

Who Is the User?

It is important to peer inside the head of the DSS analyst and view how he or she perceives the use of the data warehouse. The DSS analyst has a mindset of “Give me what I say I want, and then I can tell you what I really want.” In other words, the DSS analyst operates in a mode of discovery. Only on seeing a report or seeing a screen can the DSS analyst begin to explore the possibilities for DSS.

The Development Life Cycle

the operational environment is supported by the classical systems development life cycle (the SDLC). The SDLC is often called the “waterfall” development approach because the different activities are specified and one activity—upon its completion—spills down into the next activity and triggers its start.

The CLDS starts with data. Once the data is in hand, it is integrated and then tested to see what bias there is to the data, if any. Programs are then written against the data. The results of the programs are analyzed, and finally the requirements of the system are understood. Once the requirements are understood, adjustments are made to the design of the system, and the cycle starts all over again for a different set of data. Because of the constant resetting of the development life cycle for different types of data, the CLDS development approach is usually called a “spiral” development methodology.

Patterns of Hardware Utilization

Building the data warehouse 1st Chapter summary

推荐阅读更多精彩内容