Delete Absent Rows/Update Existing Rows/Insert New Rows

A farily common scenario:  You maintain a table of data, we will say pricing data for a hosted online shopping cart.  You offer the ability for sellers to upload pricing information (a CSV/XML file) on a daily basis.  You want to allow the seller to 1) Remove prices for items not present in the file 2) Update prices for items already in the data table 3) Add new records that are not already in the pricing table.

Pretty straight forward but I have seen some poor solutions like:

  • Reading each line of the file using the programming language of choice and issuing 2 SQL commands per line (exists check + delete/update/insert) (very very bad)
  • Loading the entire file into a temp table and using a cursor to issue 2 SQL commands per line (very bad)

I think the problem stems from the fact that most developers don't know you can issue DELETE, INSERT, and UPDATE statements that include joins and exists checks.

For starters, I like to have a staging table with no constraints and usually all nvarchar columns (even for numeric/int columns).  This allows me to pound the data file into the staging table using the BULK INSERT capability of SQL Server.  There's no faster way to move a lot of data into the db.

Next you can run a series of 'validation' queries to ensure that required fields are present and to validate data types.  I will usually copy invalid rows to an 'error' table, delete them from the staging table, and then pull them from the error table for a logging purposes.

Use the scripts below to setup our little demonstration on your SQL Server:

SET NOCOUNT ON
GO
CREATE TABLE Item
(
  Id BIGINT IDENTITY PRIMARY KEY,
  Number NVARCHAR(256) NOT NULL,
  Price NUMERIC(9,2) NOT NULL,
  Description NVARCHAR(256) NOT NULL
)
GO
 
CREATE TABLE Item_Staging
(
  Number NVARCHAR(256),
  Price NVARCHAR(256),
  Description NVARCHAR(256)
)
GO
 
SET IDENTITY_INSERT Item ON
GO
INSERT INTO Item(Id, Number, Price, Description)
VALUES( 1, '00001', 1.00, 'Part One' )
INSERT INTO Item(Id, Number, Price, Description)
VALUES( 2, '00002', 2.00, 'Part Two' )
INSERT INTO Item(Id, Number, Price, Description)
VALUES( 3, '00003', 3.00, 'Part Three' )
GO
SET IDENTITY_INSERT Item OFF
GO
 
INSERT INTO Item_Staging(Number, Price, Description)
VALUES( '00001', 1.10, 'Part One' )
INSERT INTO Item_Staging(Number, Price, Description)
VALUES( '00002', 2.20, 'Part Two' )
INSERT INTO Item_Staging(Number, Price, Description)
VALUES( '00004', 4.40, 'Part Four' )
GO
SET NOCOUNT OFF
GO

This will create two tables, Item (our production table) and Item_Staging (...you get the point).  It will also add some test data.  For our purposes, assume we have already ran our validation routines on Item_Staging to remove any bad data. 

Now we are ready to delete records from the Item table that are not present in the Item_Staging table:

DELETE Item
FROM Item LEFT JOIN Item_Staging ON Item.Number = Item_Staging.Number
WHERE Item_Staging.Number IS NULL

Pretty simple?  We use the Number column to join the two tables and delete records based on the abscense of a Number in Item_Staging.

Next, let's update prices for all products in the Item table that also appear in Item_Staging (you could update any/all fields, we are just updating price for simplicity here)

UPDATE Item SET
  Price=Item_Staging.Price
FROM Item INNER JOIN Item_Staging ON Item.Number = Item_Staging.Number

 This time we can use an INNER JOIN to connect all matching records.

Finally, we add any records in Item_Staging that are not in Item:

INSERT INTO Item(Number, Price, Description)
SELECT S.Number, S.Price, S.Description
FROM Item_Staging S LEFT JOIN Item I ON S.Number = I.Number
WHERE I.Number IS NULL

 Similar to the DELETE, we use a LEFT JOIN this time looking for recrods not present in the Item table.

Some additional pointers for you:

  • I recommend dropping and re-creating any indexes on the production table prior to running the import.  You should get a performance boost since the DBMS will only have to update indexes once after the processes have completed
  • Wrap the process in a transaction (always recommended when performing more than one operation at a time)

The full script below should:

  • DELETE Item 00003
  • UPDATE prices for Items 00001 and 00002 to 1.1 and 2.2 respectively
  • INSERT Item 00004
  • DELETE all data from Item_Staging
SET NOCOUNT ON
GO
BEGIN TRAN
 
DELETE Item
FROM Item LEFT JOIN Item_Staging ON Item.Number = Item_Staging.Number
WHERE Item_Staging.Number IS NULL
 
UPDATE Item SET
  Price=Item_Staging.Price
FROM Item INNER JOIN Item_Staging ON Item.Number = Item_Staging.Number
 
INSERT INTO Item(Number, Price, Description)
SELECT S.Number, S.Price, S.Description
FROM Item_Staging S LEFT JOIN Item I ON S.Number = I.Number
WHERE I.Number IS NULL
 
TRUNCATE TABLE Item_Staging
 
COMMIT TRAN
GO
SET NOCOUNT OFF
GO

Please feel free to share your bulk import and processing stories via the feedback section below.

posted @ Friday, December 22, 2006 11:11 AM


Print

Comments on this entry:

# re: Delete Absent Rows/Update Existing Rows/Insert New Rows

Left by Willie Tilton at 6/20/2007 3:23 PM

My implementation:

/*
Source Table <maps to> Destination Table
TtblStgFocusEnrollment | tblPatient
IPMS Generated | PatientID
IVRSiteNumber | SiteID
IVRPatientNumber | PatientCode
IVRSubjectInitials | MI
IVRSCreenNumber | PatExternalID
EmbedRandNo | RandNumber
IVRSubjectDOB | DOB
*/
USE CORE

DECLARE @StudyID INT
SET @StudyID = 11

SELECT COUNT(*) CountBefore FROM tblPatient
WHERE SiteID IN (SELECT SiteID FROM tblSite WHERE StudyID = @StudyID)

SET NOCOUNT ON
GO

BEGIN TRAN

/* Patients that were imported that did not exist in IPMS (New Records) */
SELECT
(SELECT SiteID FROM tblSite WHERE s.IVRSiteNumber = tblSite.SiteNo AND tblSite.StudyID = @StudyID) AS SiteID,
s.IVRSiteNumber AS SiteNo,
IVRPatientNumber AS PatientNo,
IVRSubjectInitials AS Initials,
IVRScreenNumber AS PatientExtID,
IVRSubjectDOB AS DOB,
EmbedRandNo AS RandNo
FROM TtblStgFocusEnrollment s
LEFT JOIN tblPatient p
ON s.EmbedRandNo = p.RandNumber
WHERE p.RandNumber IS NULL

/* Records that will be updated */
SELECT
(SELECT SiteID FROM tblSite WHERE s.IVRSiteNumber = tblSite.SiteNo AND tblSite.StudyID = @StudyID) AS SiteID,
s.IVRSiteNumber AS SiteNo,
IVRPatientNumber AS PatientNo,
IVRSubjectInitials AS Initials,
IVRScreenNumber AS PatientExtID,
IVRSubjectDOB AS DOB,
EmbedRandNo AS RandNo
FROM tblPatient p
INNER JOIN TtblStgFocusEnrollment s
ON p.RandNumber = s.EmbedRandNo

/*
IVRSiteNumber | SiteID
IVRPatientNumber | PatientCode
IVRSubjectInitials | MI
IVRScreenNumber | PatExternalID
EmbedRandNo | RandNumber
IVRSubjectDOB | DOB
*/

-- Update our existing records
UPDATE tblPatient SET
SiteID = (SELECT SiteID FROM tblSite WHERE s.IVRSiteNumber = tblSite.SiteNo AND tblSite.StudyID = @StudyID),
PatientCode = IVRPatientNumber,
MI = IVRSubjectInitials,
PatExternalID = IVRScreenNumber,
DOB = IVRSubjectDOB
FROM tblPatient p
INNER JOIN TtblStgFocusEnrollment s
ON p.RandNumber = s.EmbedRandNo

-- Insert new records
INSERT INTO tblPatient(SiteID, PatientCode, MI, PatExternalID, DOB, RandNumber)
SELECT
SiteID = (SELECT SiteID FROM tblSite WHERE s.IVRSiteNumber = tblSite.SiteNo AND tblSite.StudyID = @StudyID),
IVRPatientNumber,
IVRSubjectInitials,
IVRScreenNumber,
IVRSubjectDOB,
EmbedRandNo
FROM TtblStgFocusEnrollment s
LEFT JOIN tblPatient p
ON p.RandNumber = s.EmbedRandNo
WHERE p.RandNumber IS NULL

TRUNCATE TABLE TtblStgFocusEnrollment

-- Add all PatientVisit record placeholders to newly added patients

CREATE TABLE #Visits
(
VisitId INT,
VisitOrder INT
)

INSERT INTO #Visits (VisitID, VisitOrder)
SELECT VisitID, VisitOrder FROM tblVisit
WHERE StudyID = @StudyID

INSERT INTO tblPatientVisit (PatientID, VisitID, VisitFailed, PatientVisitType, VisitOrder)
SELECT p.PatientID, v.VisitID, 0 AS VisitFailed, 'Standard' AS PatientVisitType, v.VisitOrder FROM tblPatient p
LEFT JOIN tblPatientVisit pv
ON p.PatientID = pv.PatientID
INNER JOIN tblSite s
ON p.SiteID = s.SiteID
CROSS JOIN #Visits v
WHERE StudyID = @StudyID
AND pv.PatientID IS NULL

DROP TABLE #Visits

-- Update existing PatientVisits with new patient information

-- TODO...

IF @@error <> 0
ROLLBACK TRAN

COMMIT TRAN

GO
SET NOCOUNT OFF
GO

SELECT COUNT(*) CountAfter FROM tblPatient
WHERE SiteID IN (SELECT SiteID FROM tblSite WHERE StudyID = @StudyID)

Your comment:



 (will not be displayed)


 
 
 
Please add 5 and 6 and type the answer here:
 

Live Comment Preview:

 
«July»
SunMonTueWedThuFriSat
293012345
6789101112
13141516171819
20212223242526
272829303112
3456789