Generating Data Vault Hubs with complex Business Keys using standard SQL

微信扫一扫,分享到朋友圈

Generating Data Vault Hubs with complex Business Keys using standard SQL

This next entry in the ‘ simple ETL generation series ‘ introduces additional scenarios for the automation of Hub ETL development .

Beyond the relatively straightforward approach of defining a Business Key using a direct attribute-to-attribute key mapping, a common requirement is the ability to combine multiple data elements into a Business Key definition.

This is a typical scenario when it is necessary to ‘uniquefy’ a Business Key. This may occur when certain Business Keys are not ‘strong’ enough, for instance because a technical key has been used (there may not be something better available). In this scenario, the Business Key definition may need to be extended by adding a source system identifier, or through some other means of combining attributes to guarantee uniqueness (including hard-coded values).

I use the term Business Key definition to indicate how the intended (target) Business Key is represented in the feeding / source system. In other words; how the Business Key is represented in the source. This definition, which can be a combination of values and attributes, can then be mapped to the target (standardised) Data Warehouse Business Key attribute(s) in metadata to facilitate ETL generation.

When mapping the (source) Business Key definition to the target Business Key attribute(s), the most common scenarios besides the straight-up one-to-one mapping are ‘composition’, ‘concatenation’ and ‘pivoting’. In this post I will focus on the first two, as pivoting can be implemented in different ways that (depending on the solution) do not necessarily require pattern changes.

  • Composition is where the target Business Key is defined as a collection of columns (i.e. a composite key). Each part of the Business Key is an individual attribute mapping.
  • Concatenation is where the Business Key definition contains multiple attributes that are combined into a single string value (including sanding components / delimiters for uniqueness). A concatenated key consists of more than one source attribute but maps to a single target Business Key attribute.

This is shown in the following diagram as well (sanding elements not displayed):


In all cases do the combined Business Keys map to a single Data Warehouse key (i.e. Hash Key, Sequence Value or Natural Business Key).

It is interesting to see to what extent supporting these scenarios complicates the ETL generation code. The more procedural logic is required, the more you may run into development limitations using SQL.

For now, let’s look at how the concatenation and composition scenarios can be supported via SQL-based code generation.

The setup and metadata

Similar to how we generated the regular Hub ETL process using standard SQL we can use the TEAM interface (meta)data (INTERFACE_SOURCE_HUB_XREF) as a sample. This data is available when you install TEAM and generate the sample data, but I have scripted these samples and added these to this post as well.

The sample data can be downloaded here (rename to SQL): Download

This script contains the examples we used in theprevious post, as well as the more complex use-cases we’ll discuss here. The updated code will be able to support all scenarios so far.

If we query this metadata, we will see the following results:

SELECT [HUB_NAME],[HUB_BUSINESS_KEY_DEFINITION],[SOURCE_NAME],[SOURCE_BUSINESS_KEY_DEFINITION] 
  FROM [INTERFACE_SOURCE_HUB_XREF]
  WHERE [HUB_NAME] IN ('HUB_MEMBERSHIP_PLAN', 'HUB_SEGMENT') -- The complex examples, but simple ones are support also

We will expect that five separate ETL processes will be generated from this metadata. One for each source-to-target mapping.

This query is limited to the examples that contain the concatenation and composition scenarios. These can be identified by the COMPOSITE and CONCATENATE keywords in the (source) Business Key definition that have been prepared by TEAM. In these examples, the definition of the Business Key (as it appears in the source) is the composition (or concatenation of) two attributes.

Generating Hubs with complex Business Keys

Before we dive into code, it is worth investigating the process in a more schematic way – using the Activity Diagram UML notation we used earlier. Compared to theregular Hub process, you can see this diagram has been extended quite a bit to accommodate the handling of different Business Key scenarios.


The major change here is the addition of the ‘key parts’ concept, and various local variables to support the code generation.

The diagram can be downloaded and / or edited here: Download

The overarching idea is to be able to break up the Business Key definition into ‘key parts’ and cycle through these key parts to construct the various clauses in the SQL using the local variables.

How can this be done in SQL? We are looking at two ‘loops’ over the five rows of metadata. One ‘outer’ loop to create the individual Hub ETL process using the source-to-target mapping (one ETL process for one row in the metadata) and an ‘inner’ loop to break up the Business Key definition and iterate over the key parts (Business Key components).

Breaking up the Business Key definition

In the sample data, the Business Key definition is captured in a single value – for example COMPOSITE(Plan_Code;RECORD_SOURCE). There is a neat SQL trick we can use to transpose this data into multiple rows by splitting the value at the semicolon (or any other selected value).

This works by converting the original Business Key definition value to XML, and then split the XML across the nodes. You can try this with the sample data using this query, which is a part of the overall process:

SELECT 
  [HUB_NAME],
  [SOURCE_NAME],
  LTRIM(Split.a.value('.', 'VARCHAR(100)')) AS [BUSINESS_KEY_PART],
  ROW_NUMBER() OVER (PARTITION BY [HUB_NAME], [SOURCE_NAME] ORDER BY (SELECT 100)) AS [ROW_NR]
FROM 
(
SELECT 
  [HUB_NAME],
  [SOURCE_NAME],
  [SOURCE_BUSINESS_KEY_DEFINITION], 
  CAST ('<M>' + REPLACE([SOURCE_BUSINESS_KEY_DEFINITION], ';', '</M><M>') + '</M>' AS XML) AS [BUSINESS_KEY_SOURCE_XML]
FROM [INTERFACE_SOURCE_HUB_XREF]
WHERE [HUB_NAME] = 'HUB_MEMBERSHIP_PLAN' AND SOURCE_NAME='STG_PROFILER_PLAN'
) sub
CROSS APPLY sub.[BUSINESS_KEY_SOURCE_XML].nodes ('/M') AS Split(a)

The inner query from the snippet above returns the XML representation of the Business Key as two separate segments :

<M>COMPOSITE(Plan_Code</M>
<M>RECORD_SOURCE)</M>

The Cross Apply then applies each row to the selection, and returns the following:


The individual key parts can then be further cleaned up (i.e. removing brackets and keywords) and iterated over to construct the complex Business Key SQL. In this iteration, and by using the available metadata, the metadata can be evaluated to either be treated as a composite key or a concatenated key.

The full SQL can be downloaded here, but is also shown below. Download

--USE []

-- Parameters
DECLARE @targetDatabase			VARCHAR(100) = '[DVI_200_Integration_Layer]';
DECLARE @targetSchema			VARCHAR(100) = 'dbo'
DECLARE @sourceDatabase			VARCHAR(100) = '[DVI_100_Staging_Area]';
DECLARE @sourceSchema			VARCHAR(100) = 'dbo'
DECLARE @loadDateTimeAttribute	VARCHAR(100) = 'LOAD_DATETIME'
DECLARE @etlProcessIdAttribute	VARCHAR(100) = 'ETL_INSERT_RUN_ID'
DECLARE @recordSourceAttribute	VARCHAR(100) = 'RECORD_SOURCE'

-- Variables / metadata (from the metadata database)
DECLARE @targetTable		VARCHAR(100);
DECLARE @sourceTable		VARCHAR(100);
DECLARE @targetBusinessKey	VARCHAR(MAX);
DECLARE @sourceBusinessKey	VARCHAR(MAX);

-- Variables / local
DECLARE @pattern		VARCHAR(MAX); -- The complete selection / generated output 
DECLARE @targetHashKeyName	VARCHAR(100); -- The derived name of the Hash Key

DECLARE @keyPartSource		VARCHAR(100); -- The key component (source name), in case of composite or concatenated keys
DECLARE @keyPartTarget		VARCHAR(100); -- The key component (target name), in case of composite or concatenated keys
DECLARE @selectPart		VARCHAR(4000);
DECLARE @groupByPart		VARCHAR(4000);
DECLARE @aliasPart		VARCHAR(4000);
DECLARE @joinPart		VARCHAR(4000);
DECLARE @wherePart		VARCHAR(4000);


DECLARE hub_cursor CURSOR FOR   
  SELECT [HUB_NAME],[HUB_BUSINESS_KEY_DEFINITION],[SOURCE_NAME],[SOURCE_BUSINESS_KEY_DEFINITION] 
  FROM [INTERFACE_SOURCE_HUB_XREF]
  WHERE [HUB_NAME] IN ('HUB_MEMBERSHIP_PLAN', 'HUB_SEGMENT') -- The complex examples, but simple ones are support also

OPEN hub_cursor  

FETCH NEXT FROM hub_cursor   
INTO @targetTable, @targetBusinessKey, @sourceTable, @sourceBusinessKey

WHILE @@FETCH_STATUS = 0  
BEGIN  
	--Clear local variables before each iteration
	SET @selectPart = '';
	SET @groupByPart = '';
	SET @aliasPart = '';
	SET @joinPart = '';
	SET @wherePart = '';

	--Create JOIN and WHERE conditions
	DECLARE keypart_cursor CURSOR FOR
	WITH [MAINQUERY] AS
	(
	SELECT
		[HUB_NAME],
		[SOURCE_NAME],
		[BUSINESS_KEY_COMPOSITION],
		CASE
		  WHEN [BUSINESS_KEY_COMPOSITION]='Concatenate' THEN REPLACE(SOURCE_BUSINESS_KEY_DEFINITION,',','+')
		  ELSE [SOURCE_BUSINESS_KEY_DEFINITION]
		END AS [SOURCE_BUSINESS_KEY_DEFINITION],
		[HUB_BUSINESS_KEY_DEFINITION]
	FROM 
	(
	  SELECT 
	    [HUB_NAME],
	    [SOURCE_NAME],
	    CASE 
	  	WHEN CHARINDEX('COMPOSITE(',[SOURCE_BUSINESS_KEY_DEFINITION], 1) > 0 THEN 'Composite' 
	  	WHEN CHARINDEX('CONCATENATE(',[SOURCE_BUSINESS_KEY_DEFINITION], 1) > 0 THEN 'Concatenate' 
	  	ELSE 'Regular'
	    END AS [BUSINESS_KEY_COMPOSITION],
	    REPLACE(
	  	REPLACE(
	  		REPLACE(
	  			REPLACE([SOURCE_BUSINESS_KEY_DEFINITION],'COMPOSITE(','')
	  		,'CONCATENATE(','')
	  	,')','')
	    ,';',',') AS [SOURCE_BUSINESS_KEY_DEFINITION], 
            -- Strip out any metadata information i.e. classification, commas and brackets
	    [HUB_BUSINESS_KEY_DEFINITION]
	  FROM [INTERFACE_SOURCE_HUB_XREF]
	  WHERE [HUB_NAME] = ''+@targetTable+'' AND [SOURCE_NAME] = ''+@sourceTable+''
	) sub
	-- Define the source business key as XML
	), [SOURCEKEY] AS
	(   
	SELECT 
	  [HUB_NAME],
	  [SOURCE_NAME],
	  [BUSINESS_KEY_COMPOSITION],
	  [SOURCE_BUSINESS_KEY_DEFINITION], 
	  CAST ('<M>' + REPLACE([SOURCE_BUSINESS_KEY_DEFINITION], ',', '</M><M>') + '</M>' AS XML) 
            AS [BUSINESS_KEY_SOURCE_XML]
	FROM [MAINQUERY] 
	-- Define the target business key as XML
	), [TARGETKEY] AS
	(     
	SELECT 
	  [HUB_NAME],
	  [SOURCE_NAME],
	  [BUSINESS_KEY_COMPOSITION],
	  [HUB_BUSINESS_KEY_DEFINITION],
	  CAST ('<M>' + REPLACE([HUB_BUSINESS_KEY_DEFINITION], ',', '</M><M>') + '</M>' AS XML) AS [BUSINESS_KEY_TARGET_XML]
	FROM [MAINQUERY] 
	-- Break up the source business key in parts to support composite keys
	), [SOURCEKEYPARTS] AS
	( 
	SELECT 
	  [HUB_NAME],
	  [SOURCE_NAME],
	  [BUSINESS_KEY_COMPOSITION],
	  LTRIM(Split.a.value('.', 'VARCHAR(100)')) AS [BUSINESS_KEY_PART],
	  ROW_NUMBER() OVER (PARTITION BY [HUB_NAME], [SOURCE_NAME] ORDER BY (SELECT 100)) AS [ROW_NR]
	FROM [SOURCEKEY]
	CROSS APPLY [SOURCEKEY].[BUSINESS_KEY_SOURCE_XML].nodes ('/M') AS Split(a)
	-- Break up the target business key to match the composite keys on ordinal position
	), [TARGETKEYPARTS] AS
	( 
	SELECT
	  [HUB_NAME],
	  [SOURCE_NAME],
	  [BUSINESS_KEY_COMPOSITION],
	  LTRIM(Split.a.value('.', 'VARCHAR(100)')) AS [BUSINESS_KEY_PART],
	  ROW_NUMBER() OVER (PARTITION BY [HUB_NAME], [SOURCE_NAME] ORDER BY (SELECT 100)) AS [ROW_NR]
	FROM [TARGETKEY]
	CROSS APPLY [TARGETKEY].[BUSINESS_KEY_TARGET_XML].nodes ('/M') AS Split(a)
	)
	SELECT
	  SOURCEKEYPARTS.BUSINESS_KEY_PART AS SOURCE_BUSINESS_KEY_PART,
	  TARGETKEYPARTS.BUSINESS_KEY_PART AS TARGET_BUSINESS_KEY_PART
	FROM SOURCEKEYPARTS
	JOIN TARGETKEYPARTS ON SOURCEKEYPARTS.[HUB_NAME]=TARGETKEYPARTS.[HUB_NAME] 
          AND SOURCEKEYPARTS.[SOURCE_NAME] = TARGETKEYPARTS.[SOURCE_NAME]
	WHERE SOURCEKEYPARTS.ROW_NR = TARGETKEYPARTS.ROW_NR

	OPEN keypart_cursor
	  
	FETCH NEXT FROM keypart_cursor INTO @keyPartSource, @keyPartTarget
	
	WHILE @@FETCH_STATUS = 0  
	BEGIN  

	-- Also support concatenate keys
	SELECT @keyPartSource = REPLACE(@keyPartSource,'+',' + stg.')

	-- Evaluate the various pattern snippets
	SET @selectPart = @selectPart+'ISNULL(RTRIM(CONVERT(NVARCHAR(100), stg.'+ @keyPartSource +')),''NA'')+''|'' +'+CHAR(13);
	SET @groupByPart = @groupByPart+'  stg.'+@keyPartSource+','+CHAR(13);
	SET @aliasPart = @aliasPart + '  stg.'+@keyPartSource+' AS '+@keyPartTarget+','+CHAR(13);
	SET @joinPart = @joinPart + '  stg.'+@keyPartSource+' = hub.'+@keyPartTarget+' AND'+CHAR(13);
	SET @wherePart = @wherePart + '  stg.'+@keyPartSource+' IS NOT NULL AND hub.'+@keyPartTarget+' IS NULL AND'+CHAR(13);

	FETCH NEXT FROM keypart_cursor   
	INTO  @keyPartSource, @keyPartTarget
	END  

	CLOSE keypart_cursor;  
	DEALLOCATE keypart_cursor;  
	--End of key part cursor

	--Remove trailing parts from key parts
	SET @selectPart		= LEFT(@selectPart,DATALENGTH(@selectPart)-2)
	SET @groupByPart	= LEFT(@groupByPart,DATALENGTH(@groupByPart)-2)
	SET @aliasPart		= LEFT(@aliasPart,DATALENGTH(@aliasPart)-1)
	SET @joinPart		= LEFT(@joinPart,DATALENGTH(@joinPart)-4)
	SET @wherePart		= LEFT(@wherePart,DATALENGTH(@wherePart)-4)
					
	--Derive the hash key column name
	SET @targetHashKeyName = REPLACE(@targetTable,'HUB_','') +'_HSH';

	--Insert into pattern
 	SET @pattern = '-- Working on mapping to ' +  @targetTable + ' from source table ' + @sourceTable+CHAR(13)+CHAR(13);
	SET @pattern = @pattern+'USE '+@sourceDatabase+CHAR(13)+CHAR(13);

	SET @pattern = @pattern+'INSERT INTO '+@targetDatabase+'.'+@targetSchema+'.'+@targetTable+CHAR(13);
	SET @pattern = @pattern+'('+@targetHashKeyName+', '+@targetBusinessKey+', '+@loadDateTimeAttribute+', '+@etlProcessIdAttribute+', '+@recordSourceAttribute+')'+CHAR(13);
	SET @pattern = @pattern+'SELECT'+CHAR(13);
	SET @pattern = @pattern+'  HASHBYTES(''MD5'','+CHAR(13);
	SET @pattern = @pattern+''+@selectPart+CHAR(13);
	SET @pattern = @pattern+'  ) AS '+@targetHashKeyName+','+CHAR(13);
	SET @pattern = @pattern+''+@aliasPart+CHAR(13);
	SET @pattern = @pattern+'  MIN(stg.'+@loadDateTimeAttribute+') AS '+@loadDateTimeAttribute+','+CHAR(13);
	SET @pattern = @pattern+'  -1 AS '+@etlProcessIdAttribute+','+CHAR(13);
	SET @pattern = @pattern+'  stg.'+@recordSourceAttribute+''+CHAR(13);
	SET @pattern = @pattern+'FROM '+@sourceTable+' stg'+CHAR(13);
	SET @pattern = @pattern+'LEFT OUTER JOIN '+@targetDatabase+'.dbo.'+@targetTable+' hub ON '+CHAR(13);
	SET @pattern = @pattern+''+@joinPart+CHAR(13);
	SET @pattern = @pattern+'WHERE '+CHAR(13);
	SET @pattern = @pattern+''+@wherePart+CHAR(13);
	SET @pattern = @pattern+'GROUP BY'+CHAR(13);
	SET @pattern = @pattern+'  HASHBYTES(''MD5'','+CHAR(13);
	SET @pattern = @pattern+''+@selectPart+CHAR(13);
	SET @pattern = @pattern+'  ),'+CHAR(13);
	SET @pattern = @pattern+'  stg.'+@recordSourceAttribute+','+CHAR(13);;   
	SET @pattern = @pattern+@groupByPart

   	PRINT @pattern+CHAR(13);     
		 
    FETCH NEXT FROM hub_cursor   
    INTO @targetTable, @targetBusinessKey, @sourceTable, @sourceBusinessKey
END  
 
CLOSE hub_cursor;  
DEALLOCATE hub_cursor;

Executing this code against the sample data will yield the five expected ETL processes with correct handling of complex Business Keys. An example is added below. The complex key handling is inserted in various places, such as the Hash calculation, the join condition and the Where clause.

INSERT INTO [DVI_200_Integration_Layer].dbo.HUB_MEMBERSHIP_PLAN
(MEMBERSHIP_PLAN_HSH, PLAN_CODE,PLAN_SUFFIX, LOAD_DATETIME, ETL_INSERT_RUN_ID, RECORD_SOURCE)
SELECT
  HASHBYTES('MD5',
    ISNULL(RTRIM(CONVERT(NVARCHAR(100), stg.Plan_Code)),'NA')+'|' +
    ISNULL(RTRIM(CONVERT(NVARCHAR(100), stg.RECORD_SOURCE)),'NA')+'|' 
  ) AS MEMBERSHIP_PLAN_HSH,
  stg.Plan_Code AS PLAN_CODE,
  stg.RECORD_SOURCE AS PLAN_SUFFIX,
  MIN(stg.LOAD_DATETIME) AS LOAD_DATETIME,
  -1 AS ETL_INSERT_RUN_ID,
  stg.RECORD_SOURCE
FROM STG_PROFILER_PERSONALISED_COSTING stg
LEFT OUTER JOIN [DVI_200_Integration_Layer].dbo.HUB_MEMBERSHIP_PLAN hub ON 
  stg.Plan_Code = hub.PLAN_CODE AND
  stg.RECORD_SOURCE = hub.PLAN_SUFFIX 
WHERE 
  stg.Plan_Code IS NOT NULL AND hub.PLAN_CODE IS NULL AND
  stg.RECORD_SOURCE IS NOT NULL AND hub.PLAN_SUFFIX IS NULL 
GROUP BY
  HASHBYTES('MD5',
    ISNULL(RTRIM(CONVERT(NVARCHAR(100), stg.Plan_Code)),'NA')+'|' +
    ISNULL(RTRIM(CONVERT(NVARCHAR(100), stg.RECORD_SOURCE)),'NA')+'|' 
  ),
  stg.RECORD_SOURCE,
  stg.Plan_Code,
  stg.RECORD_SOURCE

What’s next?

The code is quite a bit more complex compared to the original Hub example. Indeed, the SQL code will get progressively more complex the further we continue to support additional scenarios and patterns. The VEDW (code generation) and TEAM (metadata management) applications continue to try to separate concerns and are evolving to make managing this complexity a bit easier (hopefully). The more complex logic becomes, the easier it can be to adopt programming language such as C#. This allows for greater flexibility in managing code complexity compared to SQL.

This is the last post before I head to Europe to deliver the Virtual Data Warehouse implementation and generation training a few times (June 17-9 and 24-26 in The Netherlands). I’ll pick this things up again when I’m back again in Australia.

In the next post I’m planning to look into applying the templating engine approach to generate scenarios covered in this post using the generic interface format for ETL automation metadata .

After this, we can probably leave the Hub behind and start focusing on generating the remaining Data Vault components.

 

微信扫一扫,分享到朋友圈

Generating Data Vault Hubs with complex Business Keys using standard SQL

细说 sqlmap_api

上一篇

使用puppeteer爬取网页数据实践小结

下一篇

你也可能喜欢

Generating Data Vault Hubs with complex Business Keys using standard SQL

长按储存图像,分享给朋友