Aug 21, 2010

Automating ETL to do OLTP to OLTP - Part 1

ETL: Extraction, Transformation, and Loading


Traditionally, ETL refers to performing tasks to extract, transform and load data from an OLTP system to an OLAP system or environment. In this particular context, we are concerned with performing ETL tasks between a source OLTP system and a different destination OLTP system. This is the typical scenario when you are asked to convert data from an old system to a new one, where both source and target systems are main-line operational systems.



Having defined the subject of our discussion, then a question I'd like to visit today is which of the 3 tasks (ETL) to automate? and how much of it to automate? Some colleagues of mine have argued that it is silly to try to automate "E" and "T" of the items listed above. "Why waste your time" they've argued. "The only one that is worth your time is the L". The more I think about this it seems to me that all 3 are automate-able, up to a point.


The "L" of the "ETL" process:
Let's take the easy one to automate: Loading. The underlining assumption here is that if you have a semi-static (or slow changing) target OLTP system a guy could very easily set up a data stagging area (an intermediary data base proxy) and use it to reshape the data of the source OLTP to resemble the data model of the target OLTP. Then the next step is to build programs and rule engines that can check the data on the stagging area and validate it for correctness before this body of information is transfered to its final resting place: the target OLTP.


The "E" of the "ETL" process:
Let's take another one (a little more complex) to automate: Extraction. The issue that concerns us here is to write a software library that can allow us to perform some basic tasks:

  • Define a connection mechanism (i.e. connection string + suitable RDBMS driver)
  • Connect to the source OLTP database (network connectivity, IP addressing out of reach, port numbers and firewall problems will be your major pitfalls here)
  • Explore and collect interesting information from the source database schema and metadata
  • Prepare a "landing" area for a copy of the data from the source OLTP system
  • Create programs that allow you to "pump" data from the source database into a copy of it

The "E" part of the "ETL" has become more manageable due to the large variety of readily available data base drivers. You can transfer data even from stone-tablets (provided you obtain a working version of a suitable ODBC driver).


The "T" of the "ETL" process:
The one that remains is the toughest of them all: Transformation. Well, blogging about this one is going to have to wait. I have to close shop for now. I'll be posting a follow up on this soon.   *** UPDATE ***  you can follow up on the final installment of this series at this post.


Thanks for reading and happy coding!

No comments: