Data Analysis

by Padmini Dalai

The exponentially increasing amounts of data being generated each year make getting useful information from that data more and more critical. The information frequently is stored in a data warehouse, a repository of data gathered from various sources, including corporate databases, summarized information from internal systems, and data from external sources. Analysis of the data includes simple query and reporting, statistical analysis, more complex multidimensional analysis, and data mining.

Data analysis and data mining are a subset of business intelligence (BI), which also incorporates data warehousing, database management systems, and Online Analytical Processing (OLAP).

The technologies are frequently used in customer relationship management (CRM) to analyze patterns and query customer databases. Large quantities of data are searched and analyzed to discover useful patterns or relationships, which are then used to predict future behavior.

Some estimates indicate that the amount of new information doubles every three years. To deal with the mountains of data, the information is stored in a repository of data gathered from various sources, including corporate databases, summarized information from internal systems, and data from external sources. Properly designed and implemented, and regularly updated, these repositories, called data warehouses, allow managers at all levels to extract and examine information about their company, such as its products, operations, and customers’ buying habits.

With a central repository to keep the massive amounts of data, organizations need tools that can help them extract the most useful information from the data. A data warehouse can bring together data in a single format, supplemented by metadata through use of a set of input mechanisms known as extraction, transformation, and loading (ETL) tools. These and other BI tools enable organizations to quickly make knowledgeable business decisions based on good information analysis from the data.

Analysis of the data includes simple query and reporting functions, statistical analysis, more complex multidimensional analysis, and data mining (also known as knowledge discovery in databases, or KDD). Online analytical processing (OLAP) is most often associated with multidimensional analysis, which requires powerful data manipulation and computational capabilities.

With the increasing data being produced each year, BI has become a hot topic. The increasing focus on BI has caused a number of large organizations have begun to increase their presence in the space, leading to a consolidation around some of the largest software vendors in the world. Among the notable purchases in the BI market were Oracle’s purchase of Hyperion Solutions; Open Text’s acquisition of Hummingbird; IBM’s buy of Cognos; and SAP’s acquisition of Business Objects.

Definition

The purpose of gathering corporate information together in a single structure, typically an organization’s data warehouse, is to facilitate analysis so that information that has been collected from a variety of different business activities may be used to enhance the understanding of underlying trends in their business. Analysis of the data can include simple query and reporting functions, statistical analysis, more complex multidimensional analysis, and data mining. OLAP, one of the fastest growing areas, is most often associated with multidimensional analysis. According to The BI Verdict (formerly The OLAP Report), the definition of the characteristics of an OLAP application is “fast analysis of shared multidimensional information.

Data warehouses are usually separate from production systems, as the production data is added to the data warehouse at intervals that vary, according to business needs and system constraints. Raw production data must be cleaned and qualified, so it often differs from the operational data from which it was extracted. The cleaning process may actually change field names and data characters in the data record to make the revised record compatible with the warehouse data rule set. This is the province of ETL.

A data warehouse also contains metadata (structure and sources of the raw data, essentially, data about data), the data model, rules for data aggregation, replication, distribution and exception handling, and any other information necessary to map the data warehouse, its inputs, and its outputs. As the complexity of data analysis grows, so does the amount of data being stored and analyzed; ever more powerful and faster analysis tools and hardware platforms are required to maintain the data warehouse.

A successful data warehousing strategy requires a powerful, fast, and easy way to develop useful information from raw data. Data analysis and data mining tools use quantitative analysis, cluster analysis, pattern recognition, correlation discovery, and associations to analyze data with little or no IT intervention. The resulting information is then presented to the user in an understandable form, processes collectively known as BI. Managers can choose between several types of analysis tools, including queries and reports, managed query environments, and OLAP and its variants (ROLAP, MOLAP, and HOLAP). These are supported by data mining, which develops patterns that may be used for later analysis, and completes the BI process.

BUSINESS INTELLIGENCE COMPONENTS

The ultimate goal of Data Warehousing is BI production, and analytic tools represent only part of this process. Three basic components are used together to prepare a data warehouse for use and to develop information from it, including:

ETL tools, used to bring data from diverse sources together in a single, accessible structure, and load it into the data mart or data warehouse.

Data mining tools, which use a variety of techniques, including neural networks, and advanced statistics to locate patterns within the data and develop hypotheses.

Analytic tools, including querying tools and the OLAP variants, used to analyze data, determine relationships, and test hypotheses about the data.

Analytic tools continue to grow within this framework, with the overall goal of improving BI, improving decision analysis, and, more recently, promoting linkages with business process management (BPM), also known as workflow.

DATA MINING

Data mining can be defined as the process of extracting data, analyzing it from many dimensions or perspectives, then producing a summary of the information in a useful form that identifies relationships within the data. There are two types of data mining: descriptive, which gives information about existing data; and predictive, which makes forecasts based on the data.

BASIC REQUIREMENTS

A corporate data warehouse or departmental data mart is useless if that data cannot be put to work. One of the primary goals of all analytic tools is to develop processes that can be used by ordinary individuals in their jobs, rather than requiring advanced statistical knowledge. At the same time, the data warehouse and information gained from data mining and data analysis needs to be compatible across a wide variety of systems. For this reason, products within this arena are evolving toward ease of use and interoperability, though these have become major challenges.

References:

Arcplan: http://www.arcplan.com/

IBM Cognos: http://www.cognos.com/

Informatica: http://www.informatica.com/

Information Builders: http://www.informationbuilders.com/

The BI Verdict: http://www.bi-verdict.com/

Open Text: http://www.opentext.com/

OLAP Council: http://www.olapcouncil.org/

Oracle: http://www.oracle.com/

SAP BusinessObjects: http://www.sap.com/solutions/sapbusinessobjects/index.epx

SAS: http://www.sas.com/

SmartDrill: http://www.smartdrill.com/

Sybase: http://www.sybase.com/

Advertisements