ETL Systems Concepts, Processes, and Tools
IME provides data management solutions, including data integration, data quality, archiving, masking, in addition to machine learning and artificial intelligence solutions, handling big data and operational data store efficiently. Moreover, IME has more than 60 customers all over the world, working in various business lines.
Informatica supports ETL tools and winned several awards in the last years and has more than 500 partners, offering services, support, training, and certificate. We will explain briefly ETL systems and its advantages in this post.
Information is power. Any organization has a rich asset which is its information. There are two main systems in any organization: the operational system, and data warehouse/ business (DW/BI) intelligent systems. Operational systems contain operational activities like daily transactions, they don’t maintain history of the data.
On the other hand, DW/BI systems focus on performance evaluation and data analysis for decision making and data visualization. In this article, we will explore the main concepts and architecture of data warehouse and ETL tools, discussing the advantages of using ETL tools. We finalize with numerical evidences by reporting statistics regarding the effect of using an ETL tool and the growth of publishing it.
Goals of DW/BI systems
The mission of the data warehouse is to publish the organization’s data assets to most effectively support decision making
- Data accessibility: Queries on the data can be done in a fast time.
- Data consistency: Data must be carefully assembled from many sources, cleansed, quality assured, and released only when it is fit for user consumption.
- Support of decision making: the decisions are based on the analytic evidence presented and deliver the business impact and value attributable to the DW/BI system.
- Data Security and Privacy: The DW/BI system must effectively control access to the organization’s confidential information.
DW/BI systems architecture
Figure 1: Core elements of common DW/BI architecture 
The data warehouse’s ETL system (ETL stands for Extract-Transform-Load) resembles the restaurant’s kitchen, as in Figure 1. Source data is magically transformed into meaningful, presentable information. The back room ETL system must be architected long before any data is extracted from the source. Like the kitchen, the ETL system is designed to ensure throughput. It transforms raw source data into the target model efficiently, minimizing unnecessary movement.
ETL highly concern about data quality, integrity, and consistency. Gathered data is checked for reasonable quality as it enters. Conditions are continually monitored to ensure ETL outputs are of high integrity. Business rules to consistently derive value-add metrics and attributes are applied once by skilled professionals in the ETL system.
ETL contains process of how the data are loaded from several source systems to the data warehouse. Currently, the ETL encompasses a cleaning step as a separate step .
The sequence is then Extract-Clean-Transform-Load. In brief, the steps of the ETL process are:
In the Extract step, data is extracted from the source system being accessible for further processing. The main objective of the extract step is to retrieve all the required data from the source system with as little resources as possible. The extract step should be designed in a way that it does not negatively affect the source system in terms or performance, response time or any kind of locking.
There are several methods to perform the extract:
- Update notification: the source system is sends a notification when a record has been changed and describe the change, this is the easiest method to get the data.
- Incremental extract: the system identifies which records have been modified and provide an extract of such records. During further ETL steps, the system needs to identify changes and propagate it down.
- Full extract: it requires keeping a copy of the last extract in the same format in order to be able to identify changes. Full extract handles deletions as well.
In incremental or full extracts, the extract period is very important, especially for full extracts as the data volumes may be of large volume (tens of gigabytes).
The cleaning step is very important as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as:
- Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown)
- Convert null values into a standard form (Not Available/Not Provided value )
- Convert phone numbers, ZIP codes to a standardized form
- Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str
- Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street).
In the transform step, a set of rules are applied to transform the data from the source to the target. including converting any measured data to the same dimension using the same units so that they can later be joined. Also requires joining data from several sources, generating aggregations, generating surrogate keys, sorting, and deriving new computed values. In addition to applying advanced validation rules.
For accurate loading, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database. In order to make the load process efficient, it is recommended to disable any constraints and indexes before the load is done, and enable them back only after the finishing loading. The referential integrity must be maintained by ETL tool to ensure consistency.
Benefits of ETL tools
There are big benefits for any company or organization of using ETL tools  such as the visual view that is available for decision makers, and for monitoring the data flow and detect any crisis or abnormal behavior in operations. ETL tools also introduce a structured system design that insure the integrity in the data warehouse. Moreover, gathering data from many resources can be done efficiently by ETL tools, followed by cleaning step and doing the required mapping and even complex transformations using a rich set of cleansing functions.
Not only traditional data can be processed by ETL tools, but also the big data. Most of ETL tools currently provide Hadoop connectors or similar interfaces to handle big data. Finally, high performance of the data warehouse is achieved by ETL tools that support performance enhancing technologies. Parallel processing and cluster awareness are examples of such high ETL tools technologies.
ETL tools: A success story
As an example of ETL tools, Informatica is a data integration tool based on ETL architecture. It provides data integration software and services for various businesses, industries and government organizations including telecommunication, health care, financial and insurance services.
Growth in the demand for Informatica certification can be indicated through the following points :
- 2015 Revenue : $1.06 billion, more than the combined revenue of Abinitio, datastage, SSIS, and other ETL tools
- 7-year Annual CAGR: 30%
- Partners : 450+
- Major SI, ISV, OEM and On-Demand Leaders
- Customers: Over 5,000
- Customers in 82 countries & direct Presence in 28 countries
- # 1 in customer loyalty rankings, 7 years in a row
Informatica is founded in 1993. In 2015, Permira and the Canada Pension Plan Investment Board acquired Informatica for approximately $5.3 billion. More than 9,000 customers over 25 years in 82 countries. Customers include 85 of the Fortune 100 companies. More than $1 billion in annual revenue. Also has 4,200+ employees. Processing 8+ trillion cloud transactions per month across all ecosystems 
- Kimball, Ralph, and Margy Ross. The data warehouse toolkit: The definitive guide to dimensional modeling. John Wiley & Sons, 2013.