You can avoid unique key violations by not having unique keys on your tables. And cars won’t crash into guard rails if we take the guard rails down.
[Newbie sidebar: a unique key on a table or equivalent is a column or set of columns in which any given set of values must never repeat. You can have multiple such sets of columns. One such set is designated the “primary key,” and, if you have others, they’re “alternate keys.”]
It’s fine for lots of blonde females named “Alice” to live on Apple Tree Lane, and, of course, hair color + first name + street name would make a crummy unique key for a PEOPLE table. But you’d better not have three records in your PEOPLE with the same Social Security number, say. The primary key can define what the table “is” – what each record represents – and ensures you don’t have duplicates (on whatever the table “is.”) Tables without primary keys usually don’t make much sense.
Unique key violations come back from database management systems, and cause Data Services jobs to bomb. Here’s an example of what a primary key violation looks like coming back from SQL Server:
New BODS developers tend to be flummoxed by these, it seems, prone to throwing their hands up and removing the keys to “make the problem go away.” The problem doesn’t go away, folks – you’ve still got duplicate records. The key is the guard rail: when a DBMS throws a unique key violation and causes your job to bomb, it’s doing you a favor. When you add primary keys to tables and your jobs start failing, that’s a good thing.
Let’s go through a couple examples and explore how to deal with all this, using “pure” Data Integrator techniques. In a separate article, we’ll explore the Match transform.
In this simple PEOPLE table, we start without a primary key defined in our Query transform, which means the table created via the template table object won’t have a primary key. That’s how it works.
Our opening move will be to set that SSN field right there as the primary key. (You’ll notice that Mr. Pristano & Ms. Chapman share the same SSN, and so we can predict that we about to get a primary key violation, which, again, is a good thing. When your bad code loads bad data, errors are your friends.)
We’ve declared the primary key in the query. Given how BODS works, if you leave “Drop and re-create” selected in that template table, you’ll get schema-changes-on-the-fly applied to the target, and the table will be created with a primary key.
If we didn’t enable “drop and re-create,” the data would load just fine, as the table in the DBMS would remain unchanged, without a primary key. BODS doesn’t guard against primary key violations merely by virtue of setting primary key columns in Query transforms. Inside of BODS, you can have all kinds of duplicate junk, columns designated as primary keys or not.
If, however, we enable “drop and re-create,” then the table will, indeed, get dropped and re-created, this time with a primary key, and then, when we try to load data in, the dbms will choke on the duplicate values in the SSN column:
Despite that the job bombed, the table did get re-created:
It’s indicated with a “key” symbol. (Imagine that.) Our job fails, and we can’t load data. We’re making progress.
To investigate various approaches to handling this, and get meaningful execution times, we’re going to need a bigger table.
The following dataflow writes a table of a million rows with three columns:
- ROW_ID, int, in which we just number the rows using gen_row_num()
- RAND, in which we call rand() * 100 and convert it to decimal(8,5)
- RAND_EXT, the same as RAND but calling rand_ext()
The GEN_RANDOM_NUMBERS query:
With a million rows, we’ll get lots of duplicates from rand() and some from rand_ext(), the weak and robust random number generators, respectively.
You’ll usually (though certainly not always) want to detect duplicates to report them out, filter them out, send them to a separate stream for advanced processing, etc. So let’s focus on detection and classification first.
The easiest way to detect duplicates is to declare the columns that are supposed to be unique as the primary key, in a Query transform, and then send them into a template table to let the dbms throw an error. You could, I suppose, catch that error and process it, using that as your “duplication detection technique.” But that’s brutal. Let’s actually code for the detection.
For approach 1, we’ll split or multiplex the record stream, with a BODS grouped query in the one branch in which we add a column that counts by the supposed key, and then immediately join the two streams to “apply” the counts-by-key to the main stream.
Because, right now, we’re only investigating various approaches to the detection of duplicates, I’m going to throw away the records after detection with Map Operation transforms set to discard everything. Essentially all the execution time will come from the upstream operations.
The “group by” query looks like this:
And the joining query:
Downstream from the joining query, we have a CT_BY_RAND_EXT column that we can use to distinguish duplicates from non-duplicates; duplicates will, of course, have a CT_BY_RAND_EXT value greater than 1.
With a million records, this takes 32 seconds on the second run, after first collecting statistics. (Collecting statistics for the first time, it runs in 44 seconds; neither collecting nor using collected statistics, it runs in, again, 44 seconds, which means that collecting statistics itself doesn’t slow it down, whereas the use of the collected statistics improve matters. Interesting, but I’m not going to investigate the why of that right now.)
The second time I ran it, I moved the join rank on the “group by” side of the join up to 100. It’ll have somewhat fewer records. This shortened the execution time to… 31 seconds.
Just to try to juice performance, what if we filtered-out all the group-by records except those where CT_BY_RAND_EXT > 1? That would give us a much smaller set of records in that join, and the join rank might have some teeth. Let’s do it with an outer join – if we did an inner join, we’d get only the records carrying duplicated RAND_EXT values, which isn’t quite what we want.
For approach 1A, our dataflow now looks like this:
I set the join rank of the FILTER_ONLY_DUPLICATES input to 100 to keep things the same. It ran in 24 seconds, a 25% improvement over where we started. With a hundred million rows, that might start to matter. After the outer join query we have field CT_BY_RAND_EXT which holds either a null (which means “good” – no duplicates), or a number greater than one, i.e., a duplicate.
You may have thought that the first STRAIGHT_SELECT_ALL query doesn’t do anything, and you’re right. Let’s drop it in approach 1B:
This accomplishes little. We’re still at 24 seconds. The Group By query is *not* pushed-down to the dbms, though you might have expected otherwise. It seems to me that a cleverer BODS might have noticed two independent reads against the database table and, thus, formed two independent SQL queries to be able to push down the grouping. But we can force the issue…
Approach 2: two push-down queries – one “straight,” one grouped – with a join in BODS.
If BODS refuses to make two push-down-able queries itself, we can force the issue:
I’ve got the “Show optimized SQL” window up so you can see that we’re pushing-down two queries, one of which is pushing-down the grouping to the DBMS. This approach runs in 19 seconds.
What might be nice is to push-down that join to the database. To do that, we could either setup two dataflows, staging the results of the group-by, and then joining them in a subsequent dataflow, or we could just use a Data Transfer transform, making sure to send the results of the group-by query back down into the same database:
We regressed to 21 seconds of execution time. What if, though, we bulk-loaded that Data Transfer transform table, disabled drop-and-recreate, and cranked-up the number of records committed at once? You know – speed up that whole in-line staging exercise?
With these tweaks, the dataflow takes 20 seconds.
For our third approach, we could just cheat and use a SQL transform to, obviously, push the whole shebang down:
BIG_TBL bt left outer join (
select rand_ext, COUNT(*) ‘ct_by_rand_ext’
group by rand_ext
on bt.rand_ext = d.rand_ext
it executes in 16 seconds. Not bad. The dataflow degenerates into this:
BIG_TBL bt left outer join (
select rand_ext, COUNT(*) ‘ct_by_rand_ext’
group by rand_ext
having COUNT(*) > 1
on bt.rand_ext = d.rand_ext
…runs in only 6 seconds.
In approach 3a, I modified the SQL query somewhat as follows:
,cast(‘duplicate’ as varchar(10)) as ‘flag’
where rand_ext in (
group by rand_ext
having COUNT(*) > 1
This query alone runs in only 3 seconds, but, unfortunately, it only provides the duplicates, not the non-duplicates. To get the non-duplicates, I could form the reverse query, in a separate SQL transform:
,cast(‘singles’ as varchar(10)) as ‘flag’
where rand_ext not in (
group by rand_ext
having COUNT(*) > 1
and then merge the results:
With this, we’re at 8 seconds, which is a little slower than the approach using joins.
Ugly dogs can run fast. We boosted performance by over 80%. But it’s a deal with the devil: no lineage, no impact, no “View where used” on the underlying tables, bad form all around. Never like to see SQL transforms.
Pushing-down HAVING Clauses Without SQL Transforms
It’s clear that the main performance improvement comes from pushing down the HAVING clause to the database. Can we do this in a Query transform? Not directly – though we have a GROUP BY tab in the BODS query editor, we don’t have any HAVING tab. But we can apply a sort of hack: use pushdown_sql() and a subquery containing HAVING. The dataflow for approach 4 looks like this:
The critical part is in the PUSH_DOWN_HAVING_DUPLICATES query:
BODS just takes the output of pushdown_sql() and blindly appends it to the WHERE clause of the SELECT query it pushes-down. The optimized SQL looks like this:
You can see the result of the pushdown_sql() function tacked-on to the end.
BODS has now pushed the join down, too. Not exactly the most elegant thing – explicit support in the Query transform for aggregated columns and HAVING clauses would be nice – but it works.
This dataflow executes… not at all. It throws an error:
Right. The generated SQL statement uses aliases, and it’s confused with that literal WHERE clause appendage. We need to use the same aliases BODS is using:
pushdown_sql( ‘ETLDR’, ‘“BIG_TBL”.rand_ext in (select rand_ext from big_tbl group by rand_ext having count(*) > 1)’)
The full optimized SQL of the whole dataflow now reads:
This executes in only 3 seconds (with no SQL transforms!). But can you spot the problem? We won’t get all the records – we’ll only get the duplicates. What we wanted was to do a left outer join of all the records to a sort of subquery – those RAND_EXT values having duplicates. But that’s not what we get with this approach – we get BODS pushing-down a regular join and then filtering the results down to only the duplicates.
Let’s solve this problem in approach 5, where we push-down HAVING clauses in two independent queries, one to get the duplicates, the other to get the non-duplicates:
This executes in only 8 seconds, a quarter of the original time, and preserves integrity by avoiding SQL transforms. The WHERE clause of the PUSH_DOWN_NOT_HAVING_DUPLICATES query just reverses the logic of the other, specifying that we only want records where we don’t have duplicates:
So now we’ve got fast code to detect duplicates. Each record on our desired key – RAND_EXT – has an attribute flagging it as being duplicated or not.
Sometimes, though, that’s not what you want or need…
Pick a Winner: GEN_ROW_NUM_BY_GROUP()
You don’t always need to identify all of a set of duplicated records. It may be sufficient to merely pick a winner out of a set of duplicated records and send the remainder elsewhere, or drop them.
The gen_row_num_by_group() function makes short work of this:
1) Optionally, sort your records by a) the columns that constitute the supposed key, and b) by anything else that would result in the “best” record within a set of duplicates to be the first record listed for that set. The sorting for “best” can get interesting and complex, and may require the addition of extra fields to flag conditions or calculate metrics explicitly to support it.
Example: the “best” record in a duplicated set may be the one with the most characters entered across three different attribute fields. You’d add a field — say, LEN_3_ATTR – to sums the lengths of the three attribute fields. Then you’d sort your record set by the key columns and your LEN_3_ATTR column, in descending order. When you number the rows-by-group, the “best” (according to your rule) in the each group would float to the top.
If there’s no rule for “best” record, skip sorting.
2) Add a column containing gen_row_num_by_group() and number the records by the key columns. Let’s call this column “ROW_NUM_BY_KEY.”
3) Split-off or filter-out records where ROW_NUM_BY_KEY > 1, taking only those where ROW_NUM_BY_KEY = 1 as your winners.
In approach 6, we code an example of this solution.
In SORT, we’re sorting by RAND_EXT and ROW_ID, both in ascending order. This implies that our rule for “best” in a set of duplicates on RAND_EXT is the record with the lowest ROW_ID.
In COL_AD_ROW_NUM_GRP, we add our ROW_NUM_BY_KEY column:
And in FILT_WINNERS, we filter on ROW_NUM_BY_KEY = 1. Done. This dataflow executes in 11 seconds.
Make a Winner: Grouping & Aggregating
You can always condense a set of records and remove duplicates by grouping on the columns that are supposed to be the key, if it’s OK to handle all the other columns with aggregate functions – AVG, COUNT, COUNT_DISTINCT, MAX, or MIN – or replace them with literals, or even just get rid of them. Just like with grouping in a query on a database system, all the columns that aren’t in your GROUP BY clause need to be wrapped in an aggregate function.
Sometimes, this is acceptable. Last week, for instance, I used this method, working with my current client (a large manufacturing concern). We were collecting data for material “issues” – materials (aka “parts”) being taken out of stock. What we wanted was one record per production order + material, with the quantity issued as an attribute. But it’s entirely possible for the same material to be issued multiple times for the same order, and so a “raw” read of the transactional table in question led to primary key violations. Did we want to add a transaction ID to the key? No – it didn’t matter. It was perfectly acceptable, in this business case, to summarize the issued quantities by production order + material.
This can feel quite similar to the gen_row_num_by_group() method, but isn’t at all the same. In the case of using gen_row_num_by_group() and sorting for the “best” record in a set, we don’t change incoming values – you can see “the same record” at the end as at the beginning (unless, of course, other code intervenes). When grouping to eliminate duplicates, you generally don’t.
Let’s look at a few variations on the theme. In approach 7, we group by our desired primary key, RAND_EXT. This guarantees uniqueness by the (desired) key. The only real question is what to do about whatever other columns are present.
“We don’t care about the other columns.”
Then, of course, you don’t have any problem – you just group by the desired key columns, which are the only columns in your query.
The GROUP BY is pushed-down to the database server, and BODS isn’t really doing anything. This runs in 6 seconds, but all we’re measuring is the database server performance.
I’m curious – how fast would BODS be at doing the grouping?
(The SQL transform just performs a SELECT * FROM BIG_TBL, preventing the GROUP BY from getting pushed-down.) This runs in a whopping 44 seconds.
Given that we’re not taking any columns other than those in the grouping, it would be equivalent to do a distinct selection on this columns. Would that be better?
BODS pushes the SELECT DISTINCT to the database server, which provides records, again, in 6 seconds. If we force BODS to do that distinct selection, though…
…. the dataflow runs in 27 seconds. Interesting that BODS would do distinct selection much faster than the equivalent with grouping.
“We need the other information.”
Then you need rules – business rules – for each of the other columns. You’ll often need to help business users get their head around the fact that there are multiple records per key, and that you need rules to handle each non-grouped column. In our simple case, we have multiple ROW_ID and RAND values per distinct RAND_EXT. Do we – does the business – want to…
- Take the maximum of each? Or the minimum?
- Take the maximum ROW_ID and the minimum RAND?
- Take the average ROW_ID and RAND?
- Set them to NULL (which is functionally the same as eliminating them)?
- Something else?
If the business can come by a rule like this for each non-grouped field, then you can use grouping. Sometimes, simple aggregate functions on each non-grouped field – MAX(), MIN(), AVG(), SUM() – will suffice. But often not. Don’t underestimate how difficult this may be in non-technical and political ways – how much consternation the existence of duplicate records may cause, what negotiations and analysis may be involved in arriving at the rules for what value a single record, condensed from many, should have for each of many attributes. As the rules for “the right values” grow in complexity, you’ll quickly reach a tipping point over which you’ll want to start using the Match transform, providing power (and complexity) far exceeding all we’ve discussed.
In approach 8, we group by RAND_EXT and apply a rule to take the minimum of the RAND and ROW_ID values.
The GRP_WITH_AGGS query:
This runs in 14 seconds; the whole thing is pushed-down to the database server. Running the GROUP BY and the calculation of aggregates in BODS itself runs in an unsurprising 51 seconds.
Note that we lose what the other RAND and ROW_ID values were – the ones other than the minimum-per-RAND_EXT. Grouping-and-aggregating is both a destructive operation – source records “go away” – and source values go away. With a SUM or AVG function, for instance, a record may appear in the target with values not present per se in the source.
Expanding the Key
Perhaps we’re simply mistaken about what the key should be.
Business Bob: “…and in the ROUTING table, what you’ve got is one record per order number and operation number.”
Developer Dave: “OK. (Quickly runs SELECT ORDERNUM, OPERNUM, COUNT(*) ‘CT’ FROM ROUTING GROUP BY ORDERNUM, OPERNUM HAVING COUNT(*) > 1). Well, Bob, looks here like we’ve got lots of order and operations numbers repeated many times. I mean, order ‘EAD8817′, operation ’70’, has 213 records.”
B. Bob: “What?! Oh, wait – we have alternate operations. Sorry. A given operation number can be repeated for multiple alternate operations.”
D. Dave: (hums while he writes SELECT ORDERNUM, OPERNUM, IsNull(ALTOPERNUM,”) ‘ALTOPERNUM’, COUNT(*) ‘CT’ FROM ROUTING GROUP BY ORDERNUM, OPERNUM, IsNull(ALTOPERNUM,”) HAVING COUNT(*) > 1). “Hey, we’re getting closer, Mr. Bob. But we’ve still got duplicates – order ‘EAD8817′, operation ’70’, no alternate operation – this we see five times.”
B. Bob: “That’s impossible.”
D. Dave: “Yeah, I feel your pain. Want to see them?”
B. Bob: “Yes, show me these. I don’t believe it.”
D. Dave: (types “SELECT * FROM ROUTING WHERE ORDERNUM = ‘EAD8817′ AND OPERNUM = ’70’ AND IsNull(ALTOPERNUM,”) = ”). “Well, here they are – you see them?” (scans the records). “Looks to me like they’re identical except for this FINDNUMBER column. See that?”
B. Bob: “They’re not supposed to repeating those. I was told they don’t do that.”
D. Dave: “Mmm-hmm.”
B. Bob: “OK, we’ll just need to include FINDNUMBER, then.”
Thus, you’d add the FINDNUMBER column to your key.
Real duplicate records problem if you really understand what the primary key really is.
Add an Artificial Key
A final approach is to add an identity columns-style artificial key. This just sweeps the problem under the rug, but it may be an appropriate tactical move. Add a “ROW_NUM” integer column, use the gen_row_num() function as its mapping, and declare it, rather than the real key, to be the primary key on the table. This doesn’t accomplish much in business terms, but for ETL, it allows you to save all the records for further processing or inspection. It should be a temporary workaround, or for support of cleansing efforts.
Although I’ve seen hundreds of tables without them, I believe almost all tables, including the ubiquitous “temporary” kill-and-fill tables that are dropped-and-recreated via BODS’s template table mechanism, should have primary keys. If you don’t know what the primary key is, then you likely don’t understand your data, and are running without some critical guard rails.
It may be that the existence of a duplicate is simply an unacceptable error condition, disqualifying entire sets of records. In that case, you need to detect the duplicates, as described above. It might be that that’s all you need to do – detect the condition, get the duplicates out of the stream, and set things up on the data quality front, with anywhere from simple reports to full-blown solutions like Information Steward, so the duplicates can be handled in the source systems.
In other cases, you might be able to immediately handle the duplicates. Perhaps you can use GEN_ROW_NUM_BY_GROUP() to pick the single best record out of a set of duplicates. Maybe you can you can use grouping and, if needed, aggregating to condense a set of duplicates to a single record. Perhaps you just don’t understand the primary key, and you need to add columns to it. As a temporary workaround, you can abandon the real primary key and just use a row number as a primary key.
When all this fails — for advanced de-duplication — you’ll want to investigate the Match transform, a subject for the future.