Parsing Strings by the Last Occurrence of a Search String

November 20th, 2012 § 0 comments § permalink

I needed to do some work recently that involved iterating through a set of files scattered across a set of folders and subfolders.

To get the list of files & folders in the first place, the following script would generate a list of all the files under C:\ETLFILEROOT\ITERATEFILESROOT\:

$ROOT_FOLDER = ‘C:\\ETLFILEROOT\\ITERATEFILESROOT\\’;
$FILENAME_SEARCH = ‘*.*’;
$FILE_DIR = ‘C:\\ETLFILEROOT\\ITERATE_DIR.TXT’;

# “Regular” list of files meeting the spec in the designated folder.
# $CMD = ‘DIR “‘ || $ROOT_FOLDER || $FILENAME_SEARCH || ‘” /A-D /B > “‘ || $FILE_DIR || ‘”‘;

# List of the files meeting the spec in the root folder and all subfolders.
$CMD = ‘DIR “‘ || $ROOT_FOLDER || $FILENAME_SEARCH || ‘” /A-D /B /S > “‘ || $FILE_DIR || ‘”‘;

if ($G_ETLDR_DEBUG = 1) begin
print    ( ‘Executing command $cmd: ‘ || $CMD );
end
$err = exec(‘cmd.exe’, $CMD, 8 );

$ERR_RC = substr( $ERR, 0, 7 );
$ERR_MSG = substr( $ERR, 8, 100 );

if (cast($ERR_RC, ‘INT’) > 0) begin
raise_exception( ‘Executing command ‘ || $CMD || ‘ generated error code & error: ‘ || $ERR );
end

The file it produces, ITERATE_DIR.TXT, looks like this:

C:\ETLFILEROOT\ITERATEFILESROOT\file1.txt
C:\ETLFILEROOT\ITERATEFILESROOT\file2DJDJD.txt
C:\ETLFILEROOT\ITERATEFILESROOT\SUBFOLDER1\file1.txt
C:\ETLFILEROOT\ITERATEFILESROOT\SUBFOLDER1\file2.txt
C:\ETLFILEROOT\ITERATEFILESROOT\SUBFOLDER1\SUBSUBFOLDER1\file1.txt
C:\ETLFILEROOT\ITERATEFILESROOT\SUBFOLDER1\SUBSUBFOLDER1\file2.txt
C:\ETLFILEROOT\ITERATEFILESROOT\SUBFOLDER2\fileDFDFDF2.txt
C:\ETLFILEROOT\ITERATEFILESROOT\SUBFOLDER2\fileEYREUYRIWUE1.txt

For subsequent file manipulations (in a WHILE loop), I needed to read this list-of-files file into a database table and separate the filename from the path. That’s a recurring need, and so should be handled by a pair of functions: one to return everything (from a string) up to and including the last occurrence of a search string, and everything after the last occurrence of that search string. Parsing strings based on the first occurrence of a search string is easy, with the existing index() function, but the last occurrence is just a little more complex.

 

FN_ETLDR_LEFT_OF_LAST()

# ETL Doctor
# etldoctor.com
# Jeff Prenevost
# 2012-08-01

# Free software issued under the Gnu General Public License
# FN_ETLDR_LEFT_OF_LAST()
# Returns all characters to the left of, and including, the last occurrence of a search string. Requires FN_ETLDR_REVERSE() or a suitable
# other function that can reverse strings.
# Parameters:
# $P_str_in varchar(8000)
# $P_search varchar(8000)

$L_return = substr( $P_str , 0, length( $P_str ) – index(FN_ETLDR_REVERSE( $P_str ), FN_ETLDR_REVERSE( $P_search ), 0) + 1);

Return $L_return;

FN_ETLDR_RIGHT_OF_LAST()

# ETL Doctor
# etldoctor.com
# Jeff Prenevost
# 2012-08-01

# Free software issued under the Gnu General Public License
# FN_ETLDR_RIGHT_OF_LAST()
# Returns all characters to the right of last occurrence of a search string, not including that string. Requires FN_ETLDR_REVERSE() or a suitable
# other function that can reverse strings.
# Parameters:
# $P_str_in varchar(8000) — the string in which to search
# $P_search varchar(8000) — the string to search for; everything following the last occurance of this will be returned.

$L_return = substr( $P_str , length( $P_str ) – index(FN_ETLDR_REVERSE( $P_str ), FN_ETLDR_REVERSE( $P_search ), 0) + 2, 8000);
Return $L_return;

Target File Naming

October 19th, 2012 § 0 comments § permalink

You know you should properly name your target or output files: informatively, consistently across time and subject areas, according to some standard. Developers often give this housekeeping little attention and make a mess. Confusion over “the right files” can easily flush a few hours down the drain. Picky, seemingly minor stuff, but sand grains small do damage big in an engine.

After a brief discussion of file names as metadata, we’ll present a ready-to-go file naming standard that may work for you. A custom “generate file name” function, free to download, ties to this standard to help enforce consistent, informative names.

Download the code associated with this article: 

Immature vs. Mature File Naming Code

In the area of file naming, immature BODS development manifests in two primary ways: 1) the absence of a standard or system for file names, and 2) hard-coded file names.

In immature development, a developer, faced with the need to generate a flat file, simply concocts what seems an appropriate name and enters it in as a literal string in the “File name(s)” field. In mature development, we want to see the following:

  1. A written file naming standard, compatible with the project and environment needs, for naming files both a) informatively and b) consistently across developers, subject areas, and time.
  2. The use of variables, and only variables, for file paths and names.
  3. Consistent handling of those variables in script, that is, a standard of not only what the names should be but how to handle their creation
  4. Team management practices to efficiently audit for compliance, to be able to take corrective action; standards are useless if not followed.

File Names as Metadata

Let’s take a step back and consider file names as metadata for the file contents. For file-based data, you’ve got the following metadata options:

  • The file name
  • File properties (handled by the operating system)
  • Descriptions stored externally to the data files themselves, such as file pairs (one with data, one with description) or external lists (spreadsheets, databases, data modeling tools).
  • Internal description, where column names in a single-row header are the simplest and most ubiquitous, to multi-line headers, to the full embedded metadata of an XML file
  • Conformance with an external standard, like HL7 for health care data, or an XML Schema

Of all these, the file name provides the least rich, most restrictive avenue of description – it’s just a single short string. But the file name stands out as the universally present and only absolutely necessary bit of metadata, “in your face” as you navigate a file system, the essential handle for grabbing a particular hunk of file-based data. The other avenues of metadata offer unlimited potential, but require either inspecting file contents or referring to something external. In many circumstances, file names alone, or possibly just file names in a well-organized folder hierarchy, will provide all the metadata you need. Bear in mind, though, that if your metadata needs exceed what’s possible through file naming, you have options.

File Names as Records

You can usefully think of a file name as a data record, where the “table” is a directory or folder, and the fields are contained in the file name handled as either a fixed-length or delimited string.

As such, the main questions involve how to form this filename-as-record:

  • Delimited or fixed-length?
  • If delimited, using what delimiter?
  • Most importantly, according to what schema? What and where are the “fields” inside the file name-as-record?

Question: Delimited or Fixed Length?

Answer: It doesn’t matter much. You can pack more information in with a fixed-length format, especially if you keep to short codes, at the cost of readability. With a file name a hundred characters long, you could pack fifty fields’ worth of metadata in with two character codes. A computer wouldn’t have any trouble parsing out such names, but heaven help the hapless human. You’ll always value transparent, easily understood names, so start with a more verbose, understandable naming schema, and migrate to denser, less readable formats as need dictates.

Question: If delimited, using what delimiter?

Answer: Again, it doesn’t matter much. It has to be a legal character, for the operating system at hand; you might want to avoid spaces unless you don’t mind quoting file names consistently; it should be a character that doesn’t naturally occur in your embedded field contents. Personally, I like the underscore “_” character.

Question: According to what schema? What and where are the “fields” inside the file name?

Answer: Although I’ll suggest a standard later on, it really depends on your project and environment. The main thing is to have a schema that works across time, objects, and developers, and then use it. Stay cognizant of place of the file name in the arsenal of metadata possibilities, and use folder structures, internal description, or external description, too – don’t try to do “everything” in the file name if you have complex needs.

Construct the fields of the file name in order from most to least general, thinking of each field as a “folder” in a hierarchical structure, so that the files sort intuitively. If you can name the file so it can be understood and “placed” in context even when someone receives it or views it outside of the folder structure in which it’s normally found (a folder structure that may provide lots of metadata), that’s great; as an email attachment, it’s more helpful to receive “DW_DIM_CUST_201108230235.csv” than “cust1.txt”.

ETL Doctor File Naming Standard

If you’re looking for a default file name schema and system, you may want to consider and adopt the ETL Doctor standard. This has five fields in the file name proper, with the sixth holding the extension. Separate local (never global!) variables handle the file name fields in workflows that wrap dataflows.  We set all file names in consistent script blocks upstream of the dataflows, into which the file names are passed as parameters, one per target file. If a single dataflow generates multiple output files in parallel, we pass in multiple variables.

We use separate variables for each file name and path component to emphasize and highlight being consistent and organized about file naming; forcing oneself to handle each file name field separately encourages thoughtfulness and consistency. We then use a custom function to assemble a complete path-and-file-name, in a consistent way, out of all the components.  In brief, to move from immature to mature code, we:

  1. Wrap dataflows in workflows
  2. Setup file naming variables in workflows
  3. In workflow script, build file paths-and-names from variables using a special “build the file name” custom function
  4. Pass file names in as parameters to dataflows

It might seem like overkill, but it saves time over the long haul, and workflow templates make short work of the coding.

ETL Doctor File Naming Variables

To setup your file naming code according to this standard, the bulk of the work is to setup the variables and script in the workflow.

Below are all the ETL Doctor standard workflow variables (at least related to file naming, for workflows that wrap dataflows). To adopt this standard, you’d need to set these up as local variables in workflows that wrap dataflows in which you use file-based targets. Most are varchars; the length is indicated for each. Do not use global variables. In the screen shot, I’ve highlighted the variables used to form the file names – the “fields” or components of the file names.

image

Variables used to form the filename itself

We use this set as parameters to function FN_ETLDR_FILENAME_OUTPUT(), to which we “outsource” the actual construction of the file names.

$filename_path (200)

Optional. The path to the file, relative to the root directory, as specified in the BODS “root directory” field (not the root of some operating system volume). Example:

$filename_path = ‘OUTPUT/FILEBATCHES/’ || TO_CHAR(SYSDATE(),’YYYYMMDD’) || ‘/’ || LTRIM_BLANKS(TO_CHAR($BATCHNUM,’9999′)) || ‘/’;

In this case, if $BATCHNUM were equal to, say, 15, on July 4th, 2012, we’d get a path of “OUTPUT/FILEBATCHES/20120704/15/”.

$filename_object (50)

Required. The ETL “object” in question, e.g., “customer,” “accounts payable,” “diagnoses.” Should be pulled from an official list of ETL objects (in a given environment or situation) and their standard abbreviations or codes.

Example:

$filename_object = ‘MATL’;

$filename_segment (50)

Optional. An “object” is often a cluster or tree of data. We may be working on the “customer” object, a nested structure inside of which we find a flat structure for customer addresses, another for customer relationships to internal sales organizations, another establishing customer relationships to each other (“partnerships”). The “segment” identifies the sub-object, if it exists.

Example:

$filename_segment = ‘E1KNA1M’;

$filename_variant (50)

Optional. Basically, an official extra field to cover further classification needs.

$filename_phase (50)

Optional. “Object,” “segment,” “variant,” and “phase” form a 4-part categorization scheme in the ETL Doctor standard system. In this scheme, “phase” refers to the “phase-of-the-ETL-job” at hand, not the project phase. It’s from the point of view of the entire movement of some object’s data, source-to-target, which passes through a number of defined phases, for instance: raw extraction; initial rule-based filtering; source cleansing; data quality; target mapping and business rules; flat file formation.

Example:

$filename_phase = ‘SRCXTRACT’;

$sysdate_char (25)

Required. This is a more general purpose variable used in many contexts, holding a varchar date-time stamp in the form ‘YYYYMMDDHH24MI’ (or ‘YYYYMMDDHH24MISS’ if you want to record to the second, or all the way up to ‘YYYYMMDDHH24MISSFFFFFFFFF’ ). It’s used as part of the file name.

Holds the date-time “of record” for the file. Use whatever’s appropriate for “of record”; usually, this is the system date-time taken shortly before file creation, in the parent workflow of the dataflow.

If you are running simultaneous processes or creating multiple files with otherwise identical names so quickly that they might get identical
$now values (and, thus, identical file names), you’ll need another field, a tie-breaking counter-type field, or one to hold a unique identifier, which you could get from gen_uuid().

Example:

$sysdate_char = to_char(sysdate(),’YYYYMMDDHH24MI’);

$filename_ext (3)

The filename extension.

Example:

$filename_ext = ‘TXT’;

Other Related Variables

For as many different output files as we’ll be creating in the child workflow, we declare multiple (local) variables as filename_output_[x] varchar(500), and matching parameters in the child dataflow. Here’s the Variables & Parameters window for a ETL Doctor standard template, stripped-down to the variables related to files and file naming.

image

$filename_output_[x] (500)

These hold the file names themselves, and are populated as the output of a call to the function FN_ETLDR_FILENAME_OUTPUT(). Each is wired to an identically-named dataflow parameter, as below:

image

For those not familiar with using dataflow parameters, note that the connections are made in the “Calls” tab of the parent container, as highlighted above.

If you had more than two output files in the child dataflow, you would make additional sets of relevant variables.

$filename_output_[x]_all (500)

The “all” variables store wildcard versions of the file names, for use in archiving and purging files. We’ll cover some code for handling directories and file movement in a separate article.

 

FN_ETLDR_FILENAME_OUTPUT() Custom Function

 

With all the variables in hand, we call a function to form the file name. You can download the ATL above, or build your own by copying-and-pasting the code below, and setting the appropriate parameters and variables.

# ETL Doctor Output File Namer
# Jeff Prenevost
# jeff@etldoctor.com
# Free software under the Gnu Public License.

# Generates output file paths-and-names consistently, according to the ETL Doctor 8-part file naming standard and system:
# 1) $P_1_path varchar(200). Path from “global

# root,” $G_ETLDR_ROOTDIRECTORY. The relative path

# from the root directory, which, if you are using

# ETL Doctor file-based data organization, will

# always be $G_ETLDR_ROOTDIRECTORY.
# 2) $P_2_object varchar(50). The abbreviated identifier

# for the “object” of the code.
# 3) $P_3_segment varchar(50). The code segment – the

# identifier for the part or subset of the object being

# addressed in the dataflow at hand. Optional.
# 4) $P_4_variant varchar(50). Variant — an abbreviated description included to distinguish multiple outputs in the same dataflow, varying by content or purpose (not merely format). Optional.
# 5) $P_5_phase varchar(50). The code phase — the abbreviated identifier for the phase of the code. Optional.
# 6) $P_6_dt varchar(14). A datetime stamp, as fine-grained as necessary (and possible).
# 7) $P_7_ext varchar(3). The file extension.

# Path handling – $P_1_path
# Optional. Nulls are acceptable if there’s no desire to put
# files anywhere other than directly into the folder
# pointed at by $G_ETLDR_ROOTDIRECTORY.

# We’re not going to worry about whether there’s a leading
# slash — DS handles the internal concatenation (in effect)
# between the root directory and any path in the file name just fine,
# with slashes or no slashes, backwards or forwards.

# We do, however, need a slash after the path, if there *is* a path. We
# like to use forward slashes, which don’t need to be doubled.
# What we’ll do, therefore, is:
# 1) replace any backward slashes with forward slashes.
# 2) add a forward slash to the end.
# 3) replace any doubled forward slashes with single
forward slashes.

# Convert null to an empty string, trim any blanks,

# and convert backward slashes to forward slashes:
$P_1_path = replace_substr( ltrim_blanks( rtrim_blanks( nvl($P_1_path,”))),’\\’,’/’);

# If there *is* a path at all, and there’s not already

# a slash on the end, tack one on:
if ($P_1_path <> ” and $P_1_path not like ‘%/’) $P_1_path = $P_1_path || ‘/’;

# For all the rest: trim them, force to uppercase,

# and replace nulls with an empty string:
$P_2_object = upper( ltrim_blanks( rtrim_blanks( nvl($P_2_object,”))));
$P_3_segment = upper( ltrim_blanks( rtrim_blanks( nvl($P_3_segment,”)))); $P_4_variant = upper( ltrim_blanks( rtrim_blanks( nvl($P_4_variant,”)))); $P_5_phase = upper( ltrim_blanks( rtrim_blanks( nvl($P_5_phase,”))));
$P_6_dt = upper( ltrim_blanks( rtrim_blanks( nvl($P_6_dt,”))));
$P_7_ext = upper( ltrim_blanks( rtrim_blanks( nvl($P_7_ext,”))));

# If an interior field isn’t blank, append an underscore:
$P_2_object = ifthenelse( $P_2_object = ”, $P_2_object, $P_2_object || ‘_’);
$P_3_segment = ifthenelse( $P_3_segment = ”, $P_3_segment, $P_3_segment || ‘_’);
$P_4_variant = ifthenelse( $P_4_variant = ”, $P_4_variant, $P_4_variant || ‘_’);
$P_5_phase  = ifthenelse( $P_5_phase = ”, $P_5_phase, $P_5_phase || ‘_’);

# Form the file name:
$L_filename = $P_1_path || $P_2_object || $P_3_segment || $P_4_variant || $P_5_phase || $P_6_dt || ‘.’ || $P_7_ext;

# Eliminate any doubled slashes — we are appending this to

# the root directory specification, not replacing it wholesale.
while ($L_filename like ‘%//%’) begin
$L_filename = replace_substr( $L_filename,’//’,’/’);
end

return $L_filename;

Data Services Example

Here’s an example of all this in action. In overview, the dataflow in which we’ll actually create the output file is always wrapped in a workflow that “sets up” the dataflow. I’ve opened the Variables & Parameters window to display the variables for this workflow, as described above. (Ignore the variables unrelated to file naming.)

image

In the Calls tab of the variables and parameters window, you can see that the values we’re setting in the workflow (in the INIT_WF script) and then passing into the dataflow:

image

The file naming work is done in the INIT_WF script:

## ETL Doctor template
## Workflow intended to wrap dataflows
## Ver. 2012-09-18
## etldoctor.com
## Jeff Prenevost
## jeff@etldoctor.com
## VARIABLES
## *********
## Non-generic
## ———–

## ETL Doctor Generic
## ——————

## File naming variables. All are varchar. Each corresponds, one-to-one, to a parameters to
## function FN_ETLDR_FILENAME_OUTPUT; the name of the corresponding function parameter is
## included after the length.

## When we call FN_ETLDR_FILENAME_OUTPUT, these are the parameters:
## 1) $P_1_path varchar(200)
## 2) $P_2_object varchar(50). The abbreviated identifier for the “object” of the code.
## 3) $P_3_segment varchar(50). The code segment — the identifier for the part or subset of the object being addressed in the dataflow at hand. Optional.
## 4) $P_4_variant varchar(50). Variant — an abbreviated description included to distinguish multiple outputs in the same dataflow, varying by content or purpose (not merely format). Optional.
## 5) $P_5_phase varchar(50). The code phase — the abbreviated identifier for the phase of the code. Optional.
## 6) $P_6_sysdate varchar(25). A datetime stamp in the form YYYYMMDDHH24MMSS
## 7) $P_7_ext varchar(3). The file extension.

## $filename_path (200) ($P_1_path)
## Optional. The path to the file, relative to the root directory, as specified in
## the BODS “root directory” field (not the root of the operating system volume).

## Example:
## $filename_path = ‘OUTPUT/FILEBATCHES/’ || TO_CHAR(SYSDATE(),’YYYYMMDD’) || ‘/’ || LTRIM_BLANKS(TO_CHAR($BATCHNUM,’9999′)) || ‘/’;

## In this case, if $BATCHNUM were equal to, say, 15, on July 4th, 2012, we’d get
## a path of “OUTPUT/FILEBATCHES/20120704/15/”.

## $filename_object (50) ($P_2_object)
## Required. The ETL “object” in question, e.g., “customer,” “accounts payable,” “diagnoses.”
## Should be pulled from an official list of ETL objects (in a given environment or situation)
## and their standard abbreviations or codes.

## Example:
## $filename_object = ‘MATL’;

## $filename_segment (50) ($P_3_segment)
## Optional. An “object” is often a cluster or tree of data. We may be working on the “customer”
## object, a nested structure inside of which we find a flat structure for customer addresses, another
## for customer relationships to internal sales organizations, another establishing customer
## relationships to each other (“partnerships”). The “segment” identifies the sub-object, if it exists.

## Example:
## $filename_segment = ‘E1KNA1M’

## $filename_variant (50) ($P_4_variant)
## Optional. Basically, an official extra field to cover further classification needs.

## $filename_phase (50) ($P_5_phase)
## Optional. “Phase,” “segment,” and “variant” form a 3-level categorization scheme in
## the ETL Doctor standard system. In this scheme, “phase” refers to the “phase-of-the-ETL-job”
## at hand, not the project phase. It’s from the point of view of the entire movement of some
## object’s data, source-to-target, which passes through a number of defined phases, for
## instance: extraction; initial cleansing; data quality; target mapping and business
## rules; flat file formation.

## Example:
## $filename_phase = ‘SRCXTRACT’;

## $sysdate_char (25) ($P_6_sysdate)
## Required. This is a more general purpose variable used in many contexts, holding a varchar
## date-time stamp in, by default, the form ‘YYYYMMDDHH24MI’. It’s used as part of the filename.

## Holds the date-time “of record” for the file. Use whatever’s appropriate for “of record”;
## usually, this is the system date-time taken shortly before file creation. Increase the
## scale to seconds or sub-seconds as needed. If you are running simultaneous processes
## or creating multiple files with otherwise identical names so quickly that they might get identical
## $now values (and, thus, identical filenames), you’ll need another field, a tie-breaking
## counter-type field, or one to hold a unique identifier, which you could get from gen_uuid().

## Example:
## $now = to_char(sysdate(),’YYYYMMDDHH24MI’);

## $filename_ext (3) ($P_7_ext)

## Required. The filename extension.

## Example:
## $filename_ext = ‘TXT’;

 

## ETL Doctor Standard 4-Tier File Path System
## ===========================================
## We’ll be calling FN_ETLDR_OUTPUT_FILENAME, the first parameter of which, $P_1_path, is the path to the file.
## This is the relative path from “global root,” $G_ETLDR_ROOTDIRECTORY, when using ETL Doctor standard file-based
## data organization.

## The path string is prepended to (put at the front of) the filename proper, separated from the filename with a forward slash. You
## can use either forward or backward slashes if you want to specify more than one subfolder’s depth under the root folder.
## If you are only specifying a single subfolder of the root, you can enter or leave off trailing slashes — doesn’t matter.
## However, do not enter anything cute or fancy at the beginning — the path should define a relative sub-path to the root.

## Examples:
## Acceptable:
## ‘Output’
## ‘Output/’
## ‘Output\’
## ‘Output\Store1\May2012’

## Not acceptable:
## ‘..\..\Root2’
## ‘/Output’
## ‘Output//’

## ETL Doctor recommends a standard set of subfolders of the folder designated for an object, itself living under a root.
## Roots vary by, or “move” by, system configuration.

## The folder structure looks like this:

## Root
## |
## — Object
## |
## — Archive
## — Error
## — InOut
## — InProcess
## — Input
## — Output
## — Report

## Archive: move files into here after they’re done processing
## Error: move files (or records) in error to here
## InOut: for “regular data files” that, from the ETL perspective, are used for both input to an ETL process and output from another ETL process.
## Input: data being fed into ETL from outside; a “landing space”
## Output: regular data output of an ETL process
## Report: report output of an ETL process; not “data” per se, although often accompanying the output (or input) data of an ETL run

## SET VARIABLE VALUES
## *******************
$sysdate_dt = sysdate( );
$sysdate_char = to_char( $sysdate_dt , ‘YYYYMMDDHH24MI’);

## FILENAME_OUTPUT_1
## =================
## Repeat setting these variables and calling FN_ETLDR_OUTPUT_FILENAME for each separate output file.
$filename_path = ‘Output’; # optional
$filename_object = $G_ETLDR_OBJECT ; # required
$filename_segment = ‘SAMPLESEGMENT’; # optional
$filename_variant = ‘SAMPLEVARIANT’; # optional
$filename_phase = ‘SAMPLEPHASE’; # optional
$filename_ext = ‘txt’;

## Uncomment the following line.
$filename_output_1 = FN_ETLDR_FILENAME_OUTPUT($filename_path, $filename_object, $filename_segment , $filename_variant, $filename_phase, $sysdate_char , $filename_ext);
# The following is used for wildcard operations, such as deleting or archiving.
$filename_output_1_all = FN_ETLDR_FILENAME_OUTPUT($filename_path, $filename_object, $filename_segment , $filename_variant, $filename_phase, ‘*’ , $filename_ext);

## FILENAME_OUTPUT_2
## =================
## Repeat setting these variables and calling FN_ETLDR_OUTPUT_FILENAME for each separate output file.
$filename_path = ‘Output’; # optional
$filename_object = $G_ETLDR_OBJECT ; # required
$filename_segment = ‘SAMPLESEGMENT’; # optional
$filename_variant = ‘SAMPLEVARIANT2’; # optional
$filename_phase = ‘SAMPLEPHASE’; # optional
$filename_ext = ‘txt’;

## Set new variable values, etc.
$filename_output_2 = FN_ETLDR_FILENAME_OUTPUT($filename_path, $filename_object, $filename_segment , $filename_variant, $filename_phase, $sysdate_char , $filename_ext);
$filename_output_2_all = FN_ETLDR_FILENAME_BASE_OUTPUT($filename_path, $filename_object, $filename_segment , $filename_variant, $filename_phase, ‘*’ , $filename_ext);

## Set new variable values
# $filename_output_3 = FN_ETLDR_FILENAME_OUTPUT($filename_path, $filename_object, $filename_segment , $filename_variant, $filename_phase, $sysdate_char , $filename_ext);
# $filename_output_3_all = FN_ETLDR_FILENAME_BASE_OUTPUT($filename_path, $filename_object, $filename_segment , $filename_variant, $filename_phase, ‘*’ , $filename_ext);

# Set new variable values
# $filename_output_4 = FN_ETLDR_FILENAME_OUTPUT($filename_path, $filename_object, $filename_segment , $filename_variant, $filename_phase, $sysdate_char , $filename_ext);
# $filename_output_4_all = FN_ETLDR_FILENAME_BASE_OUTPUT($filename_path, $filename_object, $filename_segment , $filename_variant, $filename_phase, ‘*’ , $filename_ext);

In the dataflow, of course, just use your $filename_output_[#] parameters as the filenames:

image

Duplicate Records

September 2nd, 2012 § 0 comments § permalink

You can avoid unique key violations by not having unique keys on your tables. And cars won’t crash into guard rails if we take the guard rails down.

[Newbie sidebar: a unique key on a table or equivalent is a column or set of columns in which any given set of values must never repeat. You can have multiple such sets of columns. One such set is designated the “primary key,” and, if you have others, they’re “alternate keys.”]

It’s fine for lots of blonde females named “Alice” to live on Apple Tree Lane, and, of course, hair color + first name + street name would make a crummy unique key for a PEOPLE table. But you’d better not have three records in your PEOPLE with the same Social Security number, say. The primary key can define what the table “is” – what each record represents – and ensures you don’t have duplicates (on whatever the table “is.”) Tables without primary keys usually don’t make much sense.

Unique key violations come back from database management systems, and cause Data Services jobs to bomb. Here’s an example of what a primary key violation looks like coming back from SQL Server:

image

New BODS developers tend to be flummoxed by these, it seems, prone to throwing their hands up and removing the keys to “make the problem go away.” The problem doesn’t go away, folks – you’ve still got duplicate records. The key is the guard rail: when a DBMS throws a unique key violation and causes your job to bomb, it’s doing you a favor. When you add primary keys to tables and your jobs start failing, that’s a good thing.

Let’s go through a couple examples and explore how to deal with all this, using “pure” Data Integrator techniques. In a separate article, we’ll explore the Match transform.

image

In this simple PEOPLE table, we start without a primary key defined in our Query transform, which means the table created via the template table object won’t have a primary key. That’s how it works.

image

image

Our opening move will be to set that SSN field right there as the primary key. (You’ll notice that Mr. Pristano & Ms. Chapman share the same SSN, and so we can predict that we about to get a primary key violation, which, again, is a good thing. When your bad code loads bad data, errors are your friends.)

image

We’ve declared the primary key in the query. Given how BODS works, if you leave “Drop and re-create” selected in that template table, you’ll get schema-changes-on-the-fly applied to the target, and the table will be created with a primary key.

image

If we didn’t enable “drop and re-create,” the data would load just fine, as the table in the DBMS would remain unchanged, without a primary key. BODS doesn’t guard against primary key violations merely by virtue of setting primary key columns in Query transforms. Inside of BODS, you can have all kinds of duplicate junk, columns designated as primary keys or not.

If, however, we enable “drop and re-create,” then the table will, indeed, get dropped and re-created, this time with a primary key, and then, when we try to load data in, the dbms will choke on the duplicate values in the SSN column:

image

Despite that the job bombed, the table did get re-created:

image

It’s indicated with a “key” symbol. (Imagine that.) Our job fails, and we can’t load data. We’re making progress.

To investigate various approaches to handling this, and get meaningful execution times, we’re going to need a bigger table.

The following dataflow writes a table of a million rows with three columns:

  1. ROW_ID, int, in which we just number the rows using gen_row_num()
  2. RAND, in which we call rand() * 100 and convert it to decimal(8,5)
  3. RAND_EXT, the same as RAND but calling rand_ext()

image

The GEN_RANDOM_NUMBERS query:

image

With a million rows, we’ll get lots of duplicates from rand() and some from rand_ext(), the weak and robust random number generators, respectively.

Detecting Duplicates

You’ll usually (though certainly not always) want to detect duplicates to report them out, filter them out, send them to a separate stream for advanced processing, etc. So let’s focus on detection and classification first.

The easiest way to detect duplicates is to declare the columns that are supposed to be unique as the primary key, in a Query transform, and then send them into a template table to let the dbms throw an error. You could, I suppose, catch that error and process it, using that as your “duplication detection technique.” But that’s brutal. Let’s actually code for the detection.

For approach 1, we’ll split or multiplex the record stream, with a BODS grouped query in the one branch in which we add a column that counts by the supposed key, and then immediately join the two streams to “apply” the counts-by-key to the main stream.

image

Because, right now, we’re only investigating various approaches to the detection of duplicates, I’m going to throw away the records after detection with Map Operation transforms set to discard everything. Essentially all the execution time will come from the upstream operations.

The “group by” query looks like this:

image

And the joining query:

image

Downstream from the joining query, we have a CT_BY_RAND_EXT column that we can use to distinguish duplicates from non-duplicates; duplicates will, of course, have a CT_BY_RAND_EXT value greater than 1.

With a million records, this takes 32 seconds on the second run, after first collecting statistics. (Collecting statistics for the first time, it runs in 44 seconds; neither collecting nor using collected statistics, it runs in, again, 44 seconds, which means that collecting statistics itself doesn’t slow it down, whereas the use of the collected statistics improve matters. Interesting, but I’m not going to investigate the why of that right now.)

The second time I ran it, I moved the join rank on the “group by” side of the join up to 100. It’ll have somewhat fewer records. This shortened the execution time to… 31 seconds.

Just to try to juice performance, what if we filtered-out all the group-by records except those where CT_BY_RAND_EXT > 1? That would give us a much smaller set of records in that join, and the join rank might have some teeth. Let’s do it with an outer join – if we did an inner join, we’d get only the records carrying duplicated RAND_EXT values, which isn’t quite what we want.

For approach 1A, our dataflow now looks like this:

image

I set the join rank of the FILTER_ONLY_DUPLICATES input to 100 to keep things the same. It ran in 24 seconds, a 25% improvement over where we started. With a hundred million rows, that might start to matter. After the outer join query we have field CT_BY_RAND_EXT which holds either a null (which means “good” – no duplicates), or a number greater than one, i.e., a duplicate.

You may have thought that the first STRAIGHT_SELECT_ALL query doesn’t do anything, and you’re right. Let’s drop it in approach 1B:

image

This accomplishes little. We’re still at 24 seconds. The Group By query is *not* pushed-down to the dbms, though you might have expected otherwise. It seems to me that a cleverer BODS might have noticed two independent reads against the database table and, thus, formed two independent SQL queries to be able to push down the grouping. But we can force the issue…

Approach 2: two push-down queries – one “straight,” one grouped – with a join in BODS.

If BODS refuses to make two push-down-able queries itself, we can force the issue:

image

I’ve got the “Show optimized SQL” window up so you can see that we’re pushing-down two queries, one of which is pushing-down the grouping to the DBMS. This approach runs in 19 seconds.

What might be nice is to push-down that join to the database. To do that, we could either setup two dataflows, staging the results of the group-by, and then joining them in a subsequent dataflow, or we could just use a Data Transfer transform, making sure to send the results of the group-by query back down into the same database:

image

We regressed to 21 seconds of execution time. What if, though, we bulk-loaded that Data Transfer transform table, disabled drop-and-recreate, and cranked-up the number of records committed at once?  You know – speed up that whole in-line staging exercise?

image

With these tweaks, the dataflow takes 20 seconds.

For our third approach, we could just cheat and use a SQL transform to, obviously, push the whole shebang down:

select
bt.row_id
,bt.[rand]
,bt.rand_ext
,d.ct_by_rand_ext
from
BIG_TBL bt left outer join (
select rand_ext, COUNT(*) ‘ct_by_rand_ext’
from big_tbl
group by rand_ext
) d
on bt.rand_ext = d.rand_ext

it executes in 16 seconds. Not bad. The dataflow degenerates into this:

image

But this…

select
bt.row_id
,bt.[rand]
,bt.rand_ext
,d.ct_by_rand_ext
from
BIG_TBL bt left outer join (
select rand_ext, COUNT(*) ‘ct_by_rand_ext’
from big_tbl
group by rand_ext
having COUNT(*) > 1
    ) d
on bt.rand_ext = d.rand_ext

image

…runs in only 6 seconds.

In approach 3a, I modified the SQL query somewhat as follows:

select
bt.row_id
,bt.[rand]
,bt.rand_ext
,cast(‘duplicate’ as varchar(10)) as ‘flag’
from
BIG_TBL bt
where rand_ext in (
select rand_ext
from big_tbl
group by rand_ext
having COUNT(*) > 1
)

This query alone runs in only 3 seconds, but, unfortunately, it only provides the duplicates, not the non-duplicates.  To get the non-duplicates, I could form the reverse query, in a separate SQL transform:

select
bt.row_id
,bt.[rand]
,bt.rand_ext
,cast(‘singles’ as varchar(10)) as ‘flag’
from
BIG_TBL bt
where rand_ext not in (
select rand_ext
from big_tbl
group by rand_ext
having COUNT(*) > 1
)

and then merge the results:

 

image

With this, we’re at 8 seconds, which is a little slower than the approach using joins.

Ugly dogs can run fast. We boosted performance by over 80%. But it’s a deal with the devil: no lineage, no impact, no “View where used” on the underlying tables, bad form all around. Never like to see SQL transforms.

Pushing-down HAVING Clauses Without SQL Transforms

It’s clear that the main performance improvement comes from pushing down the HAVING clause to the database. Can we do this in a Query transform? Not directly – though we have a GROUP BY tab in the BODS query editor, we don’t have any HAVING tab. But we can apply a sort of hack: use pushdown_sql() and a subquery containing HAVING. The dataflow for approach 4 looks like this:

image

The critical part is in the PUSH_DOWN_HAVING_DUPLICATES query:

image

BODS just takes the output of pushdown_sql() and blindly appends it to the WHERE clause of the SELECT query it pushes-down. The optimized SQL looks like this:

image

You can see the result of the pushdown_sql() function tacked-on to the end.

BODS has now pushed the join down, too. Not exactly the most elegant thing – explicit support in the Query transform for aggregated columns and HAVING clauses would be nice – but it works.

This dataflow executes… not at all. It throws an error:image

Right. The generated SQL statement uses aliases, and it’s confused with that literal WHERE clause appendage. We need to use the same aliases BODS is using:

pushdown_sql( ‘ETLDR’, ‘“BIG_TBL”.rand_ext in (select rand_ext from big_tbl group by rand_ext having count(*) > 1)’)

The full optimized SQL of the whole dataflow now reads:

image

This executes in only 3 seconds (with no SQL transforms!). But can you spot the problem? We won’t get all the records – we’ll only get the duplicates. What we wanted was to do a left outer join of all the records to a sort of subquery – those RAND_EXT values having duplicates. But that’s not what we get with this approach – we get BODS pushing-down a regular join and then filtering the results down to only the duplicates.

Let’s solve this problem in approach 5, where we push-down HAVING clauses in two independent queries, one to get the duplicates, the other to get the non-duplicates:

image

This executes in only 8 seconds, a quarter of the original time, and preserves integrity by avoiding SQL transforms. The WHERE clause of the PUSH_DOWN_NOT_HAVING_DUPLICATES query just reverses the logic of the other, specifying that we only want records where we don’t have duplicates:

image

So now we’ve got fast code to detect duplicates. Each record on our desired key – RAND_EXT – has an attribute flagging it as being duplicated or not.

Sometimes, though, that’s not what you want or need…

Pick a Winner: GEN_ROW_NUM_BY_GROUP()

You don’t always need to identify all of a set of duplicated records. It may be sufficient to merely pick a winner out of a set of duplicated records and send the remainder elsewhere, or drop them.

The gen_row_num_by_group() function makes short work of this:

1) Optionally, sort your records by a) the columns that constitute the supposed key, and b) by anything else that would result in the “best” record within a set of duplicates to be the first record listed for that set. The sorting for “best” can get interesting and complex, and may require the addition of extra fields to flag conditions or calculate metrics explicitly to support it.

Example: the “best” record in a duplicated set may be the one with the most characters entered across three different attribute fields. You’d add a field  — say, LEN_3_ATTR – to sums the lengths of the three attribute fields. Then you’d sort your record set by the key columns and your LEN_3_ATTR column, in descending order. When you number the rows-by-group, the “best” (according to your rule) in the each group would float to the top.

If there’s no rule for “best” record, skip sorting.

2) Add a column containing gen_row_num_by_group() and number the records by the key columns. Let’s call this column “ROW_NUM_BY_KEY.”

3) Split-off or filter-out records where ROW_NUM_BY_KEY > 1, taking only those where ROW_NUM_BY_KEY = 1 as your winners.

In approach 6, we code an example of this solution.

image

In SORT, we’re sorting by RAND_EXT and ROW_ID, both in ascending order. This implies that our rule for “best” in a set of duplicates on RAND_EXT is the record with the lowest ROW_ID.

In COL_AD_ROW_NUM_GRP, we add our ROW_NUM_BY_KEY column:

image

And in FILT_WINNERS, we filter on ROW_NUM_BY_KEY = 1. Done. This dataflow executes in 11 seconds.

Make a Winner: Grouping & Aggregating

You can always condense a set of records and remove duplicates by grouping on the columns that are supposed to be the key, if it’s OK to handle all the other columns with aggregate functions – AVG, COUNT, COUNT_DISTINCT, MAX, or MIN – or replace them with literals, or even just get rid of them. Just like with grouping in a query on a database system, all the columns that aren’t in your GROUP BY clause need to be wrapped in an aggregate function.

Sometimes, this is acceptable. Last week, for instance, I used this method, working with my current client (a large manufacturing concern).  We were collecting data for material “issues” – materials (aka “parts”) being taken out of stock. What we wanted was one record per production order + material, with the quantity issued as an attribute. But it’s entirely possible for the same material to be issued multiple times for the same order, and so a “raw” read of the transactional table in question led to primary key violations. Did we want to add a transaction ID to the key? No – it didn’t matter. It was perfectly acceptable, in this business case, to summarize the issued quantities by production order + material.

This can feel quite similar to the gen_row_num_by_group() method, but isn’t at all the same. In the case of using gen_row_num_by_group() and sorting for the “best” record in a set, we don’t change incoming values – you can see “the same record” at the end as at the beginning (unless, of course, other code intervenes). When grouping to eliminate duplicates, you generally don’t.

Let’s look at a few variations on the theme. In approach 7, we group by our desired primary key, RAND_EXT. This guarantees uniqueness by the (desired) key. The only real question is what to do about whatever other columns are present.

“We don’t care about the other columns.”

Then, of course, you don’t have any problem – you just group by the desired key columns, which are the only columns in your query.

image

The GROUP BY is pushed-down to the database server, and BODS isn’t really doing anything. This runs in 6 seconds, but all we’re measuring is the database server performance.

I’m curious – how fast would BODS be at doing the grouping?

image

(The SQL transform just performs a SELECT * FROM BIG_TBL, preventing the GROUP BY from getting pushed-down.) This runs in a whopping 44 seconds.

Given that we’re not taking any columns other than those in the grouping, it would be equivalent to do a distinct selection on this columns. Would that be better?

image

BODS pushes the SELECT DISTINCT to the database server, which provides records, again, in 6 seconds. If we force BODS to do that distinct selection, though…

image

…. the dataflow runs in 27 seconds. Interesting that BODS would do distinct selection much faster than the equivalent with grouping.

“We need the other information.”

Then you need rules – business rules – for each of the other columns. You’ll often need to help business users get their head around the fact that there are multiple records per key, and that you need rules to handle each non-grouped column. In our simple case, we have multiple ROW_ID and RAND values per distinct RAND_EXT. Do we – does the business – want to…

  • Take the maximum of each? Or the minimum?
  • Take the maximum ROW_ID and the minimum RAND?
  • Take the average ROW_ID and RAND?
  • Set them to NULL (which is functionally the same as eliminating them)?
  • Something else?

If the business can come by a rule like this for each non-grouped field, then you can use grouping. Sometimes, simple aggregate functions on each non-grouped field – MAX(), MIN(), AVG(), SUM() – will suffice. But often not. Don’t underestimate how difficult this may be in non-technical and political ways – how much consternation the existence of duplicate records may cause, what negotiations and analysis may be involved in arriving at the rules for what value a single record, condensed from many, should have for each of many attributes. As the rules for “the right values” grow in complexity, you’ll quickly reach a tipping point over which you’ll want to start using the Match transform, providing power (and complexity) far exceeding all we’ve discussed.

In approach 8, we group by RAND_EXT and apply a rule to take the minimum of the RAND and ROW_ID values.

image

The GRP_WITH_AGGS query:

image

This runs in 14 seconds; the whole thing is pushed-down to the database server. Running the GROUP BY and the calculation of aggregates in BODS itself runs in an unsurprising 51 seconds.

Note that we lose what the other RAND and ROW_ID values were – the ones other than the minimum-per-RAND_EXT. Grouping-and-aggregating is both a destructive operation – source records “go away” – and source values go away. With a SUM or AVG function, for instance, a record may appear in the target with values not present per se in the source.

Expanding the Key

Perhaps we’re simply mistaken about what the key should be.

Business Bob: “…and in the ROUTING table, what you’ve got is one record per order number and operation number.”

Developer Dave: “OK. (Quickly runs SELECT ORDERNUM, OPERNUM, COUNT(*) ‘CT’ FROM ROUTING GROUP BY ORDERNUM, OPERNUM HAVING COUNT(*) > 1). Well, Bob, looks here like we’ve got lots of order and operations numbers repeated many times. I mean, order ‘EAD8817′, operation ’70’, has 213 records.”

B. Bob: “What?! Oh, wait – we have alternate operations. Sorry. A given operation number can be repeated for multiple alternate operations.”

D. Dave: (hums while he writes SELECT ORDERNUM, OPERNUM, IsNull(ALTOPERNUM,”) ‘ALTOPERNUM’, COUNT(*) ‘CT’ FROM ROUTING GROUP BY ORDERNUM, OPERNUM, IsNull(ALTOPERNUM,”) HAVING COUNT(*) > 1). “Hey, we’re getting closer, Mr. Bob. But we’ve still got duplicates – order ‘EAD8817′, operation ’70’, no alternate operation – this we see five times.”

B. Bob: “That’s impossible.”

D. Dave: “Yeah, I feel your pain. Want to see them?”

B. Bob: “Yes, show me these. I don’t believe it.”

D. Dave: (types “SELECT * FROM ROUTING WHERE ORDERNUM = ‘EAD8817′ AND OPERNUM = ’70’ AND IsNull(ALTOPERNUM,”) = ”). “Well, here they are – you see them?” (scans the records). “Looks to me like they’re identical except for this FINDNUMBER column. See that?”

B. Bob: “They’re not supposed to repeating those. I was told they don’t do that.”

D. Dave: “Mmm-hmm.”

B. Bob: “OK, we’ll just need to include FINDNUMBER, then.”

Thus, you’d add the FINDNUMBER column to your key.

Real duplicate records problem if you really understand what the primary key really is.

Add an Artificial Key

A final approach is to add an identity columns-style artificial key. This just sweeps the problem under the rug, but it may be an appropriate tactical move. Add a “ROW_NUM” integer column, use the gen_row_num() function as its mapping, and declare it, rather than the real key, to be the primary key on the table. This doesn’t accomplish much in business terms, but for ETL, it allows you to save all the records for further processing or inspection. It should be a temporary workaround, or for support of cleansing efforts.

Summary

Although I’ve seen hundreds of tables without them, I believe almost all tables, including the ubiquitous “temporary” kill-and-fill tables that are dropped-and-recreated via BODS’s template table mechanism, should have primary keys. If you don’t know what the primary key is, then you likely don’t understand your data, and are running without some critical guard rails.

It may be that the existence of a duplicate is simply an unacceptable error condition, disqualifying entire sets of records. In that case, you need to detect the duplicates, as described above. It might be that that’s all you need to do – detect the condition, get the duplicates out of the stream, and set things up on the data quality front, with anywhere from simple reports to full-blown solutions like Information Steward, so the duplicates can be handled in the source systems.

In other cases, you might be able to immediately handle the duplicates. Perhaps you can use GEN_ROW_NUM_BY_GROUP() to pick the single best record out of a set of duplicates. Maybe you can you can use grouping and, if needed, aggregating to condense a set of duplicates to a single record. Perhaps you just don’t understand the primary key, and you need to add columns to it. As a temporary workaround, you can abandon the real primary key and just use a row number as a primary key.

When all this fails — for advanced de-duplication — you’ll want to investigate the Match transform, a subject for the future.

Query Naming

August 10th, 2012 § 1 comment § permalink

“What’s in a name? That which we call a rose by any other name would smell as sweet.” As they say.

Alas, alack, a rose ain’t no query.

You often see dataflows like this:

image

Query transforms in Data Services do many things, so you see them a lot. But nobody could tell what’s going on here without opening each. This is like anti-documentation: the least possible effort given to helping people understand the code. It’s unacceptable.

Something like this would be better:

image

Query Functionality

If we’re going to name queries for what they’re doing, which seems pretty obvious, we should take a tour of query transform functionality and get a little organized. By point of contrast, a Merge transform, with only one function – to merge — is self-documenting: if you recognize the icon, you’re all set. But with queries, you can’t tell what’s going on because they can do so much:

Schema Modification

Heaven help us all, we can modify schemas in queries. Other transforms – Merge, Case, Map Operation, Key Generation – leave schemas untouched, but with queries, you certainly can’t assume the schema in is the schema out. Questions about what happened to that source column or where this other one came from, column-level lineage and impact, and anything to do with nested structures – all that falls under “schema modification.” This has nothing to do with the content of the columns except insofar as data types come into play, nor do schema modifications have much to do with what or how many records you have.

The schema modifying query functions are:

  • Add columns
  • Rename columns
  • Drop columns
  • Change datatypes
  • Change primary key
  • Change nested structures

Content Modification

Holding the schema constant, with the same number of records in the same order, you can still raise havoc with the contents. The contents of a column that’s “straight mapped” through a query, with no change in data type, is not subject to change – that’s a sort of do-nothingness in a query. Even if you rename or change the order of columns, on the other side of the query, you still have “the same data,” albeit perhaps with different column labels (which, of course, can feel like content modification, but you shouldn’t trust your feelings.) Real content modifications, on the other hand, all involve at least one of the following:

  • A data type change that automatically forces a change in the column value for at least some records.
  • A formula
  • A procedure call
  • A function call

Record Set Modification

The number of records in vs. out can vary if you do anything of these:

  • Join
  • Filter
  • Group
  • Select distinct

The only way to increase the number of records through a query transform is with a join, although inner joins can also easily result in fewer or the same number of records (from the perspective of one of the input tables), depending on the number of records here which don’t have matches there, and the cardinality of the relationships.

All the other query functions cause, or at least can cause, a decrease in the record count.

Mature ETL code strictly accounts for the numbers of records coming & going, eliminating mysteries and surprises. This topic deserves its own separate and exhaustive treatment, but right now we note that when we have problems keeping track of records or can’t easily explain why we have 1,255 rather than 1,240, it’s often because of query transforms. It’s good to highlight, via good naming and in other ways, when the record count might be changing in a query.

Record Order

Finally, in a query, we can sort.

Naming for Functionality

To name queries for what they’re doing, we should probably have some standard abbreviations of all this functionality we’ve just cataloged. Consistency would improve readability over time and across a team. I’ve got my own “ETL Doctor” standard set here, which has held up pretty well over time:

ETL Doctor Standard Query Naming Function Abbreviations

Functionality Function Group Short Code Long Code
Add column Schema Modification CA COL_AD
Drop column Schema Modification CD COL_DR
Rename column Schema Modification CR COL_RN
Change datatype Schema Modification DT DTYPE
Change primary key Schema modification PK PK
Change nested structure Schema Modification NS NRDM
Group Record Set Modification GR GRP
Select distinct Record Set Modification DI DIS
Filter Record Set Modification FI FILT
Perform inner join Record Set Modification IJ IJ
Perform outer join Record Set Modification OJ OJ
Perform cross join Record Set Modification XJ XJ
Content modification Content modification CO CONT
Sort Sorting SO SORT

I tend to use the longer, more easily readable codes except when cramming lots of functionality into a single query, whereupon I resort to the shorter. A single query in which we add a column, drop another, sort, and filter, working, say, with vendor data drawn from “source system #1,” might be named “VEN_SRC1_CA_CD_SO_FI”.

Functionality Partioning

As just implied, if you pile lots of functionality onto a single query, it’s harder to name clearly. You can resort to a query’s properties, or an adjacent comment box, but if you’re keen to let the names themselves tell the story, you should consider “partitioning” what you need to accomplish with your data across relatively more queries, even if you might, technically, be able to stack your manipulations onto fewer. I think this also makes the code easier to maintain.

At one extreme, you could have a hard-and-fast rule of never doing more than one “thing” per query. But if you wanted to drop a dozen columns, it would be silly to spread that out over a dozen different queries (although those queries could certainly get entirely transparent names: “DROP_COL1”, “DROP_COL2”, etc.).  Crystalline names notwithstanding, you don’t enhance code readability by having a hundred queries where ten would do.

Less rigidly, your rule could be to try to avoid doing more than one or two “basic functions” per query. You want to add some columns? Great – add a whole bunch of them in one query, and call the query “VEN_SRC1_ADD_COLS”. But then sort your records in a different query. And make content modifications in another. If you took that approach, the names themselves would pretty easily tell the story of what was going on in the dataflow, and help you easily find where some column came from (in, of course, a “COL_AD” query!), or where the records could possibly have decreased in number (in a query with one of the record count-affecting functions). This is pretty much the rule I try to follow.

Multiple Streams in Single Dataflows

Within a single dataflow, you might have multiple independent streams – chains of sequential transforms which go on for a while, functioning as mini-dataflows. If you end up with lots of these, and they get long and complicated, combining and splitting and joining and re-splitting until your dataflow looks like the Nile delta, you should probably find a way to stop the madness, but there’s nothing wrong with a few such branches.

If you’ve got these and they’re on the long side, consider coming up with a code for each, and incorporating that code in the relevant query names.

Thus, A Query Naming Standard

We seem, then, to have three basic parts to a query name under our seminal querulous naming regimen:

  1. An optional “stream” or “branch” code
  2. An non-optional indication of the function, preferably using consistent codes or abbreviations
  3. Optional supplemental description

Regarding the third part, if, for instance, you were adding the single column “VBELN,” your query name might be “COL_AD_VBELN.” If were adding a bunch of columns related to the ultimate target table schema, you could say, “COL_AD_TARGET_TBL_SET.” If you were sorting to get ready to number or rank rows, you could say “SORT_FOR_RANKING”.

“QRY_”?

Hold the press: what’s happened to “Q” or “QRY” at the beginning of the name of each and every query so-help-me-God? Well… I, personally, don’t see the point of that sort of Hungarian notation-style transform naming in BODS, although you see “standards” documents all the time with such mandates. Amazing how people think they’ll tame chaos by acting like a martinet. In all the contexts and user interfaces provided with the Data Services product, you can instantly see what types of transforms you’ve got – for Query transforms, the ubiquitous sideways hourglass icon is instantly recognizable. Some bureaucrat wants to burn the best part of the name with some redundant preface like “QRY_”?

Nah.


Happy coding.

Automatically Deleting Old Dump (.DMP) Files

August 2nd, 2012 § 0 comments § permalink

When a Data Services job crashes “hard,” it writes a .DMP file to the %LINK_DIR%\LOG folder. These big files can easily fill a disk. Tea leaf readers at SAP use them to help solve nasty problems, but, typically, they go unused and stay useless, getting deleted when it’s discovered how much space they’re using (often when the DS job server stops running jobs for lack of disk space).

On a Windows job server, to automatically delete dump files older than, say, three days, create a batch or command file with the contents below, and schedule it to run periodically using any task scheduler:

 

REM Batch file to delete all .DMP files older than 3 days

REM in the SAP BusinessObjects Data Services log folder, %LINK_DIR%\log.

REM To change the retention period, change the value of the parameter

REM following the "/D" argument; “-3” = “older than 3 days”.

REM ETL Doctor, etldoctor.com, Jeff Prenevost, jeff@etldoctor.com, 2012-08-02

forfiles /P"%LINK_DIR%\log" /M*.DMP /D-3 /C"cmd /c del /q @path"

RIGHT() String Function

August 1st, 2012 § 0 comments § permalink

SAP BusinessObjects Data Services doesn’t have left() or right() functions, to return the left- or right-most X characters from a string. It’s easy to understand why left() was left out (ar, ar) – it’s just a variant of substr() where we set the starting position to “1” – but a right() function isn’t quite as trivial. It’s not hard to do the equivalent of a right() function with substr(), so let’s make it not-hard on ourselves:

ETL Doctor Custom Function

FN_ETLDR_RIGHT()

Parameters

$P_str_in varchar(8000) input
$P_num_chars int input
Return varchar(8000) output

# ETL Doctor

# etldoctor.com

# Jeff Prenevost

# jeff@etldoctor.com

# 2012-08-01

# Free software issued under the Gnu General Public License

# FN_ETLDR_RIGHT

# Returns the right-most $P_num_chars of input string $P_str_in.

Return substr( $P_str_in, length( $P_str_in ) – ($P_num_chars – 1) , $P_num_chars );

REVERSE() String Function

August 1st, 2012 § 0 comments § permalink

You know how when you look in a mirror, text is reversed?

Sometimes you need to do that to strings. Like, for instance, to get whatever characters follow the last of an unpredictable number of pipe symbols. That particular puzzle would require a reverse(), index(), substr(), and another reverse(). But it’s pretty hard without a reverse.

Here’s a simple function for SAP BusinessObjects Data Services 4 to reverse a string.

ETL Doctor Custom Function

FN_ETLDR_REVERSE()

Parameters

$P_str_in varchar(8000) input
Return varchar(8000) output

Local variables

$L_return varchar(8000)
$L_i int
$L_str_length int

# ETL Doctor

# etldoctor.com

# Jeff Prenevost

# 2012-08-01

# Free software issued under the Gnu General Public License

# FN_ETLDR_REVERSE()

# Returns the reverse of a string, varchar(8000)

# Parameters:

# $P_incoming varchar(8000)

$L_length = length( $P_str_in );

$L_return = ”;

$L_i = 0;

while ($L_i < $L_length ) begin

$L_i = $L_i + 1;

$L_return = substr( $P_str_in , $L_i, 1) || $L_return;

end

Return $L_return ;

Replacing All Control Characters En Masse

August 1st, 2012 § 0 comments § permalink

Joe, an ABAP coder I’m working with, strolls over and says the lines in the text file I’d given him were wrapping funny, that the data seemed to be misaligned to the columns in some records.

We’d originally wrapped each field in field delimiters, because in a “memo” field in the data, there was miscellaneous free text gobbledygook from over the years — tabs, line feeds, carriage returns, etc. – which would too easily collide with standard field and row delimiters like tabs and carriage returns. But Joe didn’t like the pipe-delimited, caret-wrapped fields in the file I sent him. And then… he didn’t like the file without the field delimiters, either. No problem: we’ll just strip out the junk for him.

Nothing complicated here – we just use replace_substr_ext(), looking for whatever control characters we think we need to nuke:

replace_substr_ext(INPUTFIELD,’/r’,’ ‘,NULL,NULL)

The above would, for instance, replace all the carriage returns with a single space (“/r” = “carriage return”, ASCII decimal 13).

If you’ve got three different characters, though, it gets ugly:

$crapless = replace_substr_ext(replace_substr_ext(replace_substr_ext($crapful,’/r’,’ ‘,NULL,NULL),’/n’, ‘ ‘,NULL,NULL),’/t’,”,NULL,NULL)

(You see nesting like this all the time with ifthenelse(), too, where one would hope to see decode().) Ugly code that works is still bad code. The above three levels of nesting is distasteful enough — what if we wanted to search for any of a dozen? Or, say, all the control characters intrinsically supported by the replace_substr_ext() function?

Hey… that would be a nice: a simple function to replace all of the “base set” of control characters. It’s basically what I need for Joe. I’m imagining other functions which would nuke any character not in an approved set, or any character in an externally-maintained set, but let’s do this simple one for now.

ETL Doctor Custom Function FN_ETLDR_REPLACE_SUBSTR_EXT_ALL_SPECIAL()

 

Parameters

$P_str_in varchar(8000) input
$P_str_replace varchar(1) input
Return varchar(8000) output

Local variables

$L_str_cleaned varchar(8000)
$i int
$len int

# FN_ETLDR_REPLACE_SUBSTR_EXT_ALL_SPECIAL varchar(8000)

# Replaces *all* the control characters listed below

# in $P_in_str with $P_replace_str. The function is just a wrapper

# for repeated calls to replace_substr_ext. Could, of course, expand

# the list of characters-to-be-replaced.

# Replaces all these:

# /a Bell (alert)

# /b Backspace

# /f Formfeed

# /n New line

# /r Carriage return

# /t Horizontal tab

# /v Vertical tab

# Replace /a Bell (alert)

$L_new_string = replace_substr_ext( $P_in_str , ‘/a’, $P_replace_str , NULL, NULL );

# Replace /b Backspace

$L_new_string = replace_substr_ext( $L_new_string , ‘/b’, $P_replace_str , NULL, NULL );

# Replace /f Formfeed

$L_new_string = replace_substr_ext( $L_new_string , ‘/f’, $P_replace_str , NULL, NULL );

# Replace /n New line

$L_new_string = replace_substr_ext( $L_new_string , ‘/n’, $P_replace_str , NULL, NULL );

# Replace /r Carriage return

$L_new_string = replace_substr_ext( $L_new_string , ‘/r’, $P_replace_str , NULL, NULL );

# Replace /t Horizontal tab

$L_new_string = replace_substr_ext( $L_new_string , ‘/t’, $P_replace_str , NULL, NULL );

# Replace /v Vertical tab

$L_new_string = replace_substr_ext( $L_new_string , ‘/v’, $P_replace_str , NULL, NULL );

Return $L_new_string ;

Dealing with flat files after they’ve been generated

March 27th, 2012 § 0 comments § permalink

This particular BODS job I’ve inherited generates a set of flat files across three folders. It’s a bit of a pain to manually collect all of them into a .ZIP file to distribute. I start thinking about how to handle flat files after they’ve been generated. It’s pretty common to want to take a set of flat files, compress them into a .ZIP file, and email them.

Let’s work it out.

(To be continued.)