The need for speed is provoking IT leaders to take a fresh look at modes of data integration. Is the data warehousing model of ETL (extract, transform and load) to be eclipsed by data virtualisation? Or does the pressure to be more agile undermine a more considered, long-term, strategic approach to integrating data from different...
We asked our round table of data management experts how organisations should best think about and approach data integration programmes. They offer some answers below.
Ted Friedman, research vice president and analyst, Gartner
Organisations have traditionally approached data integration in an application- or project-specific manner. More than 90% of businesses form separate teams and buy different tools or write custom code for each data integration project. However, this approach creates information silos and raises costs by involving potentially redundant work, additional maintenance and the purchase of unnecessary licences. CIOs, CTOs and data integration leaders can save money and improve the flexibility of their data management environments by creating a team to implement data integration capabilities as a shared service for the organisation at large.
Integration leaders should take three main steps to establish a shared-service team:
1. Bring together the five core roles and skills:
- A data integration shared-service director manages the team, articulates its mission and value to the rest of the organisation, and is responsible for ensuring ongoing sponsorship from both the CIO and the business by demonstrating the team's value. He or she controls the team's budget, oversees the tool selection process, works with the purchasing department in contract negotiations and hires consultants where needed.
- A data integration shared-service project manager oversees the overall planning and staffing of the various ongoing projects. This role needs both management and technical skills.
- A data integration architect designs the delivery of the data assets.
- A data integration developer ensures that the solution follows best practices for data integration and meets the requirements specified by the data integration lead.
- A data integration tester participates in -- and reviews -- data integration design, and defines consistent test strategies based on the data integration specifications and service-level agreements (SLAs).
2. Define the activities that the shared-service team must perform and how the team should interact with other projects and stakeholders. For each data integration project, leaders should do the following:
- Model. Collect, document and model business requirements.
- Identify. Find critical source data and metadata relevant to data integration.
- Govern. Ensure that the data delivered to the business follows corporate and regulatory policies, as well as data quality policies established with the business.
- Convey. Define the best way of delivering data by considering latency requirements, data volumes and the amount of transformation needed.
- Control. Guarantee that in the operational phase, the data delivered meets governance requirements and SLAs.
3. Create metrics to gauge the success of data integration projects:
Focus on metrics that show the business benefits of data integration shared services, rather than just the benefits to the IT organisation. Examples of business benefits are increased agility, lower software licence and maintenance costs, reduced project implementation and support costs and improved data quality.
Chris Bradley, head of business consultancy practice, IPL
Many different approaches are now available for data integration, yet far and away the most popular approach still remains extract, transform and load (ETL).
However, the pace of business changes and the requirement for agility demands that organisations support multiple styles of data integration.
Three leading options present themselves; let’s now describe the differences among the three major styles of integration.
- Physical movement and consolidation
Probably the most commonly used approach is physical data movement. This is used when you need to replicate data from one database to another. There are two major genres of physical data movement: ETL and change data capture (CDC).
ETL is typically run according to a schedule and is used for bulk data movement, usually in batch. CDC is event-driven and delivers real-time incremental replication. Example products in these areas are Informatica (ETL) and GoldenGate (CDC).
2. Message-based synchronisation and propagation
Whilst ETL and CDC are database-to-database integration approaches, the next approach, message-based syncronisation and data propogation, is used for application-to-application integration. Once again, there are two main genres: enterprise application integration (EAI) and enterprise service bus (ESB) approaches, but both of these are used primarily for the purpose of event-driven business process automation. A leading product example in this area is the ESB from Tibco.
3.Abstraction or virtual consolidation (aka federation)
Thirdly, you have data virtualisation (DV). The key here is that the data source, which is usually a database, and the target or consuming application, which is usually a business application, are isolated from each other. The information is delivered on demand to the business application when the user needs it. The consuming business application can consume the data as though it were a database table, a star schema, an XML message or many other forms. The key point with a DV approach is that the form of the underlying source data is isolated from the consuming application. The key rationale for data virtualisation within an overall data integration strategy is to overcome complexity, increase agility and reduce cost. A leading product example in this area are products from Composite Software.
ETL or DV?
The suitability of data integration approaches needs to be considered for each case. Here are six key considerations to ponder:
1. Will the data be replicated in both the data warehouse (DW) and the operational system?
- Will data need to be updated in one or both locations?
- If data is physically in two locations, beware of regulatory and compliance issues -- such as SoX, HIPPA, BASEL2 and FDA -- associated with having additional copies of the data.
2. Data governance
- Is the data only to be managed in the originating operational system?
- What is the certainty that a DW will be a reporting DW only
(versus an operational DW)?
3. Currency of the data --does it need to be up to the minute?
- How up to date are the data requirements of the DW?
- Is there a need to see the operational data?
4. Time to solution --how quickly is the system required?
- Is it an immediate requirement?
- Who are the confirmed users and what is their usage?
5. What is the life expectancy of the source system(s)?
- Are any of the source systems likely to be retired?
- Will new systems be commissioned?
- Are new sources of data likely to be required?
6. Need for historical,summary,aggregate data
- How much historical data is required in the DW system?
- How much aggregated orsummary data is required in the DW system?
Leading analyst firms like Gartner are recommending that data virtualisation be added to your integration tool kit and that you should use the right style of data integration for the job for optimal results.
Andy Hayler, CEO, The Information Difference
Experience has shown that the cost of maintaining interfaces eats up a major chunk of IT budgets. Despite the wave of ERP implementations, large organisations still require many separate applications to support their business processes. One large company I work with admits to having 600 significant applications, of which only one is SAP -- and of course, they have multiple instances of that.
The more applications you have, the worse the interface spaghetti problem becomes. One alternative approach is to define integration hubs that do the hard work of integration just once, and then act as a master source of shared data to other applications. Rather than writing even more individual application interfaces, applications go to the hub for the data that they need.
Such hubs can take different forms. There are many master data hubs on the market that focus on certain types of shared data such as “customer” or “product.” Some are more generic, and can in principle handle further types of shared data, such as “asset” or “location” or “person.” There are different ways of implementing these depending on how deeply you are prepared to re-engineer your operational systems, which currently handle such data separately.
Another approach is “virtual integration,” where data integration from the operational systems is carried out dynamically on demand. This sounds appealing as it skips the costly step of building permanent data hubs, but it can encounter significant performance issues, depending on the use case. I was talking to a large company just today that had to abandon a multimillion dollar project that took this approach due to encountering insuperable performance issues.
Whichever approach you adopt, defining a data model for key shared data is going to be important. More important still is getting business ownership of such data by the business, since getting control of competing definitions of key data needs business input. This accounts for the dramatic rise in interest in data governance recently, which can support such integration initiatives through defining processes and organisational responsibilities for data, in terms of definitions but also its quality. Although there are many challenges, more and more enterprises are now moving in the direction of implementing data governance, along with data integration projects that attempt to tame the data mess that most enterprises have to deal with today.
Ollie Ross, head of research, The Corporate IT Forum
Data virtualisation as a progress step to agility? For our members, it's clearly very early days. We've noted [in discussions internal to the Forum] that “virtualisation combined with automation and simplification can give business agility,” and I've seen the recommendation to “virtualise storage, servers and the network.” I have also read an enlightened report of a member case study around building and moving to a new, agile data centre, with extensive virtualisation and a view to future use of cloud capabilities. And, of course, there's a steadily growing interest in doing more with your data -- so a “pre-beginning,” if you like, of an agile data strategy.
But it’s very early days for corporate IT and data virtualisation.
Ted Friedman is chairing the Gartner Master Data Management Summit 2012, which will take place on 8 to 9 February in London. For more information, visit www.gartner.com/eu/mdm.
Chris Bradley has recently co-authored Data Modeling For The Business. He also writes the Information Asset Management “Expert channel” on the BeyeNETWORK, blogs on information management and tweets at @InfoRacer.
Andy Hayler is co-founder and CEO of analyst firm The Information Difference and a regular speaker at international conferences on master data management, data governance and data quality. He is also a restaurant critic and author.
Ollie Ross is head of research at The Corporate IT Forum.