In the data warehousing sphere, smart adoption of ETL (extract, transform and load) tools improves productivity and eases metadata management and overall manageability. These benefits are generally applicable for organizations in the initial stages of building ETL tools and data warehouse applications. However, organizations that already have custom ETL code in production (for instance, Oracle PL/SQL scripts or Unix shell scripts) could be looking to replace these with ETL tools such as Informatica PowerCenter, Datastage, Ab Initio or BO Data Services, while simultaneously enhancing performance and increasing ROI. Here are some guidelines on how to go about this exercise, with a focus on the factors that influence ETL tool adoption.
In the extract, transform and load functionality of an ETL tool, transformations are the most compute intensive operation in the data warehouse environment. Transformations can be classified into:
- Database intensive transformations, involving straight table-to-table transformation or multiple relational joins while extracting from the source.
- File intensive transformations such as file reading and writing, temporary file creation, and so on.
If there are several transformations, these are best done outside the database, while just a few transformations are best done within the database.
Right tool for the right job
Database scripts such as Oracle PL/SQL are the best choice for database intensive transformations, while ETL tools are best for file intensive transformations. Databases are meant to support reliable processes through “ACID” -- atomicity, consistency, isolation and durability. These processes, however, are additional overheads for file intensive transformations.
ETL tool advantages for file intensive transformations
Features available in most ETL tools include:
- Lookup (connected and unconnected) transformations that cache the data in memory, thus reducing database hits when there are multiple lookups, resulting in faster processing.
- For transformations, many ETL tools provide native support for parallel processing. Some ETL tools offer several configurable partitioning schemes, and enabling of set-based processing.
- For file processing, an ETL tool obviates the need for separate temporary tables, thus avoiding database read/writes (and ACID overhead).
- Some ETL tools offer in-memory support for union of data, resulting in better performance.
Transformation complexity classification
Use Table 1 to identify whether your organization’s transformations are file intensive or not, and whether ETL tools can help in enhancing performance.
Table 1. Candidates for ETL tools.
Factors influencing ROI
The following factors need to be considered in determining whether ETL tools can provide better ROI.
• Volume growth
Organic: Does the organization foresee tremendous business growth (and hence data volume)? Is the organization moving towards pervasive business intelligence that would require real-time integration needs?
Inorganic: Are any acquisitions on the radar or is consolidation needed due to an existing acquisition?
• Transformation complexity
Database (DB) intensive: Do current ETL processes involve much DB read/write?
File intensive: Is there an opportunity to leverage an efficient ETL tool to avoid excessive DB read/write?
Environment complexity: How many data sources and targets are interfacing?
Architectural differences: For instance, relational databases, hierarchical databases, ERP, Excel, and so on.
Technological differences: For instance, different database vendors such as Oracle, IBM DB2, Microsoft SQL Server, and so on.
SLA Management: What are the schedule-related challenges? For example, critical reports need to be available by 6 am in the morning; critical data loads need to be completed in agreed SLA so that downstream applications are accessible during daytime. Is the current ETL process able to meet requirements easily? Are there challenges in maintaining the SLA for specific ETL processes, or are bottlenecks caused due to the cumulative number of ETL processes?
Processes: Are set processes being followed? Examples include release management processes, migration processes and change management strategy.
Documentation: How comprehensive is the process documentation and is it updated regularly to include lessons learned?
Resource skill: What are the experience levels of available resources and inclination towards GUI-based development?
Enablement: What is the knowledge management strategy in place?
Software: What are the ETL tool license costs, forecasts, vendor relationships and promises?
Hardware: What are the costs of ownership and opportunities for consolidation?
Administration capacity: What resources are available to maintain the legacy applications?
Overall, these factors define the total cost of ownership (TCO), which can influence the decision making process for ETL tools. A thorough assessment on each of the parameters leads to the determination of the “tipping point” at which the ETL tool needs to be deployed.
About the authors:
Sankara Narayanan Arumugam, a senior technical architect at Infosys, has 12 years’ experience in architecture, design and consulting in database and data warehouse technologies.
Jayant Devanathan, a senior project manager at Infosys, has 12 years’ experience in managing application architecture and design for data warehouse and business intelligence projects.