Translate
Sunday, March 30, 2008
Difference between Transactional and Warehoused Data
TRANSACTION PROCESS Vs DECISION SUPPORT
1) Operational & Analytical
2) Simple Queries & Complex Queries
3) Results of 1-30 rows & Results of 1-30 million rows
4) Data manipulation & Data retrieval
5) Dynamic Data & Static Data.
DATA WAREHOUSING
Data warehousing is essentially what you need to do in order to create a data warehouse, and what you do with it. It is the process of creating, populating, and then querying a data warehouse and can involve a number of discrete technologies such as:
· Source System Identification: In order to build the data warehouse, the appropriate data must be located. Typically, this will involve both the current OLTP (On-Line Transaction Processing) system where the "day-to-day" information about
· Data Warehouse Design and Creation: This describes the process of designing the warehouse, with care taken to ensure that the design supports the types of queries the warehouse will be used for. This is an involved effort that requires both an understanding of the database schema to be created, and a great deal of interaction with the user community. The design is often an iterative process and it must be modified a number of times before the model can be stabilized. Great care must be taken at this stage, because once the model is populated with large amounts of data, some of which may be very difficult to recreate, the model can not easily be changed.
· Data Acquisition: This is the process of moving company data from the source systems into the warehouse. It is often the most time-consuming and costly effort in the data warehousing project, and is performed with software products known as ETL (Extract/Transform/Load) tools. There are currently over 50 ETL tools on the market. The data acquisition phase can cost millions of dollars and take months or even years to complete. Data acquisition is then an ongoing, scheduled process, which is executed to keep the warehouse current to a pre-determined period in time, (i.e. the warehouse is refreshed monthly).
· Changed Data Capture: The periodic update of the warehouse from the transactional system(s) is complicated by the difficulty of identifying which records in the source have changed since the last update. This effort is referred to as "changed data capture". Changed data capture is a field of endeavor in itself, and many products are on the market to address it. Some of the technologies that are used in this area are Replication servers, Publish/Subscribe, Triggers and Stored Procedures, and Database Log Analysis.
· Data Cleansing: This is typically performed in conjunction with data acquisition (it can be part of the "T" in "ETL"). A data warehouse that contains incorrect data is not only useless, but also very dangerous. The whole idea behind a data warehouse is to enable decision-making. If a high level decision is made based on incorrect data in the warehouse, the company could suffer severe consequences, or even complete failure. Data cleansing is a complicated process that validates and, if necessary, corrects the data before it is inserted into the warehouse. For example, the company could have three "Customer Name" entries in its various source systems, one entered as "IBM", one as "I.B.M.", and one as "International Business Machines". Obviously, these are all the same customer. Someone in the organization must make a decision as to which is correct, and then the data cleansing tool will change the others to match the rule. This process is also referred to as "data scrubbing" or "data quality assurance". It can be an
· Data Aggregation: This process is often performed during the "T" phase of ETL, if it is performed at all. Data warehouses can be designed to store data at the detail level (each individual transaction), at some aggregate level (summary data), or a combination of both. The advantage of summarized data is that typical queries against the warehouse run faster. The disadvantage is that information, which may be needed to answer a query, is lost during aggregation. The tradeoff must be carefully weighed, because the decision can not be undone without rebuilding and repopulating the warehouse. The safest decision is to build the warehouse with a high level of detail, but the cost in storage can be extreme.Now that the warehouse has been built and populated, it becomes possible to extract meaningful information from it that will provide a competitive advantage and a return on investment. This is done with tools that fall within the general rubric of "Business Intelligence".
Data Warehouse:
The term Data Warehouse was coined by Bill Inmon in 1990, which he defined in the following way: "A warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process".
He defined the terms in the sentence as follows:
· Subject Oriented: Data that gives information about a particular subject instead of about a company's ongoing operations.
· Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent whole.
· Time-variant: All data in the data warehouse is identified with a particular time period.
· Non-volatile: Data is stable in a data warehouse. More data is added but data is never removed. This enables management to gain a consistent picture of the business.
(Source: "What is a Data Warehouse?" W.H. Inmon, Prism, Volume 1, Number 1, 1995). This definition remains reasonably accurate almost ten years later. However, a single-subject data warehouse is typically referred to as a data mart, while data warehouses are generally enterprise in scope. Also, data warehouses can be volatile. Due to the large amount of storage required for a data warehouse, (multi-terabyte data warehouses are not uncommon), only a certain number of periods of history are kept in the warehouse. For instance, if three years of data are decided on and loaded into the warehouse, every month the oldest month will be "rolled off" the database, and the newest month added.
Ralph Kimball provided a much simpler definition of a data warehouse. As stated in his book, "The Data Warehouse Toolkit", on page 310, a data warehouse is "a copy of transaction data specifically structured for query and analysis". This definition provides less insight and depth than Mr. Inmon's, but is no less accurate.
Biographical Information
Bill Inmon is universally recognized as the "father of the data warehouse." He has over 26 years of database technology management experience and data warehouse design expertise, and has published 36 books and more than 350 articles in major computer journals. His books have been translated into nine languages. He is known globally for his seminars on developing data warehouses and has been a keynote speaker for every major computing association. Before founding Pine Cone Systems, Bill was a co-founder of Prism Solutions, Inc.
Ralph Kimball
Ralph Kimball was co-inventor of the Xerox Star workstation, the first commercial product to use mice, icons, and windows. He was vice president of applications at Metaphor Computer Systems, and founder and CEO of Red Brick Systems. He has a Ph.D. from Stanford in electrical engineering, specializing in man-machine systems. Ralph is a leading proponent of the dimensional approach to designing large data warehouses. He currently teaches data warehousing design skills to IT groups, and helps selected clients with specific data warehouse designs. Ralph is a columnist for Intelligent Enterprise magazine and has a relationship with Sagent Technology, Inc., a data warehouse tool vendor. His book "The Data Warehouse Toolkit" is widely recognized as the seminal work on the subject.
Wednesday, March 26, 2008
INTRODUCTION TO DATA WAREHOUSE
INTRODUCTION TO DATA WAREHOUSE
DATA WAREHOUSE DEFINITION ( 1 )
· To take facts based decisions
· Building intelligence to business.
· Converting information into actionable decisions.
· Discovering unknown things.
· Increase Sales.
· Deepen Customer relationships
· Build better products
· Provide better services
· Streamline operations
· Reduce costs.
· Data Mart
· OLAP
· Mining
Start Learning DataWarehousing
I am a Data Warehousing developer. I have some basic knowledge in Data warehousing and ETL and Reporting tools. In my blog i decide to share that knowledge with other. I think sure it will be use for readers. If u have any doubts mail me sundar.skec@gmail.com. friends learn the content in date wise then it will be easy to understand.
Introduction to Data Warehousing:
- Data Warehousing Concept
- Difference between Transactional and Warehoused Data
- Data Warehousing Architecture
- Logical & Physical Design
- Difference between Data Warehouse & Data Mart
- Source Data ,Staging Area & Target Data
- Operational Data Source (ODS)
- Star Schema & Snow Flake Schema
Data Warehouse Design and Modeling:
Dimension Model (Schema) for Data Warehouse
- Defining Dimensions And Measures
- Defining Levers and Hierarchies
- Granularity of data
- Introducing Star Schema
- Introducing Snow Flake Schema
- Dimension Table And Fact Table
ETL (Extract, Transform & Load) Using Informatica:
- Designer : Defining Target for Staging data base & Data Warehouse
- Designer : Creating Multi-dimensional Cubes
- Designer : Creating Mapping (with different transformation) and Mapplets
- Transformation
- Filter
- Router
- Sorter
- Rank
- Aggregator
- Expression
- update staterge
- Lookup
- Normalizer
- stored Procedure
- source qualifier
- Task
- Session
- command
- Event Wait
- Event Raise
- Timer
- Assignment
- Control
- Server Manager : Creating Sessions and batches
- Concepts Workflow monitor
OLAP & PRESENTION using COGNOS:
- Impromptu Administration : Defining database and setting up catalog
- Impromptu Administration : Creating and Managing Folders, Prompts, Filter, Formula
- Impromptu Administration : Defining Classes, Users & Privileges
- Impromptu Administration : Generating & Formulating Falt Reports
- Impromptu Administration : Generating Query Definition File
- Cognos Powerplay Transformer : Creating Dimensions and Measures
- Cognos Powerplay Transformer : Generating Cubes, Manipulating Cube Properties
- Cognos Powerplay Transformer : Generating Multi-Dimensional reports using cube
- Cognos Powerplay Transformer : Formatting & Printing Reports
- Cognos Powerplay Transformer : Drill-down, Drill-up and Drill-through Reports
- Cognos Powerplay Transformer : Generating Chats using cube
OLAP & PRESENTATION using BUSINESS OBJECTS
- Supervisor: Creating Repository and Supervisor
- Supervisor: Database Connections
- Supervisor: Defining Groups and Users
- Supervisor: Defining Privileges and Profiles for users
- Supervisor: Scheduling Users Connection Sessions
- Designer: Creating and Maintaining Universe
- Designer: Defining Tables & Columns as Classes and Objects
- Designer: Creating and Maintaining measures
- Designer: Defining Relations, Formulas etc
- Business Object: Defining Query, Reports Type
- Business Object: Formatting Reports
- Business Object: Slicing & Dicing Reports
- Business Object: Drill-down, Drill-up Reports
- Business Object: Filtering Records, Grouping Records etc.
Tuesday, March 25, 2008
History
The concept of data warehousing dates back to the mid-1980s [3] when IBM researchers Barry Devlin and Paul Murphy developed the "information warehouse". In essence, the data warehousing concept was intended to provide an architectural model for the flow of data from operational systems to decision support environments. The concept attempted to address the various problems associated with this flow - mainly, the high costs associated with it. In the absence of a data warehousing architecture, an enormous amount of redundancy of information was required to support the multiple decision support environment that usually existed. In larger corporations it was typical for multiple decision support environments to operate independently. Each environment served different users but often required much of the same data. The process of gathering, cleaning and integrating data from various sources, usually long existing operational systems (usually referred to as legacy systems), was typically in part replicated for each environment. Moreover, the operational systems were frequently reexamined as new decision support requirements emerged. Often new requirements necessitated gathering, cleaning and integrating new data from the operational systems that were logically related to prior gathered data.
Based on analogies with real-life warehouses, data warehouses were intended as large-scale collection/storage/staging areas for corporate data. Data could be retrieved from one central point or data could be distributed to "retail stores" or "data marts" which were tailored for ready access by users.
Key developments in early years of data warehousing were:
- 1983 - Teradata introduces a database management system specifically designed for decision support.
- 1986 - Barry Devlin and Paul Murphy publish the article An architecture for a business and information systems in IBM Systems Journal where they introduce the term "information warehouse".
- 1990 - Red Brick Systems introduces Red Brick Warehouse, a database management system specifically for data warehousing.
- 1991 - Prism Solutions introduces Prism Warehouse Manager, software for developing a data warehouse.
- 1991 - Bill Inmon publishes the book Building the Data Warehouse.
- 1995 - The Data Warehousing Institute, a for-profit organization that promotes data warehousing, is founded.
- 1996 - Ralph Kimball publishes the book The Data Warehouse Toolkit.
- 1997 - Oracle 8, with support for star queries, is released.
Monday, March 24, 2008
Data warehouses versus operational systems
Operational systems are optimized for preservation of data integrity and speed of recording of business transactions through use of database normalization and an entity-relationship model. Operational system designers generally follow the Codd rules of data normalization in order to ensure data integrity. Codd defines five increasingly stringent rules of normalization. Fully normalized database designs (that is, those satisfying all five Codd rules) often result in information from a business transaction being stored in dozens to hundreds of tables. Relational databases are efficient at managing the relationships between these tables. The databases have very fast insert/update performance because only a small amount of data in those tables is affected each time a transaction is processed. Finally, in order to improve performance, older data are usually periodically purged from operational systems.
Data warehouses are optimized for speed of data retrieval. Frequently data in data warehouses are denormalised via a dimension-based model. Also, to speed data retrieval, data warehouse data are often stored multiple times - in their most granular form and in summarized forms called aggregates. Data warehouse data are gathered from the operational systems and held in the data warehouse even after the data has been purged from the operational systems.
Top-down versus bottom-up design methodologies
Top-down design
Bill Inmon, one of the first authors on the subject of data warehousing and the man credited with coining the term "data warehouse", has defined a data warehouse as a centralized repository for the entire enterprise.[3] Inmon is one of the leading proponents of the top-down approach to data warehouse design, in which the data warehouse is designed using a normalized enterprise data model. "Atomic" data, that is, data at the lowest level of detail, are stored in the data warehouse. Dimensional data marts containing data needed for specific business processes or specific departments are created from the data warehouse. In the Inmon vision the data warehouse is at the center of the "Corporate Information Factory" (CIF), which provides a logical framework for delivering business intelligence (BI) and business management capabilities. The CIF is driven by data provided from business operations.[4]
Inmon states that the data warehouse is:
- Subject-oriented
- The data in the data warehouse is organized so that all the data elements relating to the same real-world event or object are linked together.
- Time-variant
- The changes to the data in the data warehouse are tracked and recorded so that reports can be produced showing changes over time.
- Non-volatile
- Data in the data warehouse is never over-written or deleted - once committed, the data is static, read-only, and retained for future reporting.
- Integrated
- The data warehouse contains data from most or all of an organization's operational systems and this data is made consistent.
The top-down design methodology generates highly consistent dimensional views of data across data marts since all data marts are loaded from the centralized repository. Top-down design has also proven to be robust against business changes. Generating new dimensional data marts against the data stored in the data warehouse is a relatively simple task. The main disadvantage to the top-down methodology is that it represents a very large project with a very broad scope. The up-front cost for implementing a data warehouse using the top-down methodology is significant, and the duration of time from the start of project to the point that end users experience initial benefits can be substantial. In addition, the top-down methodology can be inflexible and unresponsive to changing departmental needs during the implementation phases.[3]
Bottom-up design
Ralph Kimball, another well known author on data warehousing, defines a data warehouse as "a copy of transaction data specifically structured for query and analysis."[5] Kimball is a proponent of the bottom-up approach to data warehouse design. In the bottom-up approach data marts are first created to provide reporting and analytical capabilities for specific business processes. Data marts contain atomic data and, if necessary, summarized data. These data marts can eventally be unioned together to create a comprehensive data warehouse. The combination of data marts is managed through the implementation of what Kimball calls "a data warehouse bus architecture".[3]
The bottom-up approach to data warehouse design provides the advantage of a quick turnaround. Business value can be returned as quickly as the first data marts can be created. However, a long term risk of this approach is inconsistencies in the multiple data marts and the resulting multiple "version of the truths" seen by users retrieving data from the data marts.[3] Conforming dimensions among data marts and maintaining tight management over the data warehouse bus architecture can help mitigate these risks.
Hybrid design
Over time it has become apparent to proponents of bottom-up and top-down data warehouse design that both methodologies have benefits and risks. Hybrid methodologies have evolved to take advantage of the the fast turn-around time of bottom-up design and the enterprise-wide data consistency of top-down design.
Friday, March 21, 2008
Data warehouse Layer
There is no widespread agreement on exactly what constitutes a data warehouse architecture. Though they may not be contradictory, views differ as to the relative importance of the possible components. One possible conceptualization of a data warehouse architecture consists of the following interconnected layers:
- Operational database layer
- The source data for the data warehouse
- Informational access layer
- The data accessed for reporting and analyzing and the tools for reporting and analyzing data
- Data access layer
- The interface between the operational and informational access layer
- Metadata layer
- The data directory (which is often much more detailed than an operational system data directory).
Benefits of data warehousing
- A data warehouse provides a common data model for data, regardless of the data's source. This makes it easier to report and analyze information than it would be if multiple data models from disparate sources were used to retrieve information such as sales invoices, order receipts, general ledger charges, etc.
- Prior to loading data into the data warehouse inconsistencies are identified and resolved. This greatly simplifies reporting and analysis.
- Information in the data warehouse is under the control of data warehouse users so that, even if the source system data is purged over time, the information in the warehouse can be stored safely for extended periods of time.
- Because they are separate from operational systems, data warehouses provide fast retrieval of data without slowing down operational systems.
- Data warehouses facilitate decision support system applications such as trend reports (e.g., the items with the most sales in a particular area within the last two years), exception reports, and reports that show actual performance versus goals.
- Data warehouses can work in conjunction with and, hence, enhance the value of operational business applications, notably customer relationship management (CRM) systems.