In a recent blog post on database maintenance tips, I mentioned that one important facet of cleaning up the database is to remove records that we no longer need — those that don’t contribute value to the applications and users who use the database. This is an important maintenance process, but there are some equally important considerations when thinking about deleting any data from the database.
These considerations are driven by a few key questions that we need to ask:
Why are we deleting the data?
Performance? Disk space cost? Security? Organization, Simple lack of value? Let’s look at each of these reasons and look for alternatives to deleting the data in order to avoid losing any untapped future value stored in that data.
Performance
If we are thinking of removing data as a way to improve performance, can we instead use partitioning, archiving or indexing to achieve adequate performance while preserving the data? Can we tune the most expensive queries to reduce load on the system? Can we increase resources on the server or move the database to a more powerful server?
Disk space cost
If our purpose is to reduce the cost of disk space to store the data, can we partition the data and archive the older or less-used data to lower cost storage? Is compression an option on our hardware platform (eg, Oracle’s exadata platform)? Can we remove some indexes that are taking up space but not adding a performance boost?
Security
If we are seeking to remove the data to improve security by reducing the data footprint, can we leave the data there and achieve the level of security that we need by using an encryption scheme, virtual private database (Oracle), or another tighter access control scheme like label security?
Organization (reducing clutter)
If we are removing the data because we don’t want to see it — because the reports, queries, and dropdowns in the application screens are unwieldy, can we tag records as deleted instead of actually removing them and filter queries based on these tags? Can we create views for each table that filters these records and use synonyms to redirect apps to those views to minimize application changes?
Lack of value
If we are certain that the data really has no value, let’s get rid of it. This is the data that just sucks energy from the system and gets in the way. But when doing so, let’s be sure to do so subject to the considerations below. Even deleting valueless data can cause problems if we are not careful.
Considerations before deleting anything
Once we are convinced that the data has to go, we have to ask all of the following questions in order to create the proper process for deleting it.
Interfaces with external systems
Downstream and Upstream systems may break as a result of deleting data.
Downstream systems may contain supplemental data that will be left dangling if you delete a record from your system and do not also delete it (and the associated supplemental data) from the downstream system. This can cause applications to fail or worse, can cause invalid results to appear on reports.
Upstream systems may be subject to numerous problems as well. They may re-introduce the same records that you delete, or they may send child records that are associated with records that you deleted, causing interfaces to fail. Worse, without proper logging in your interfaces, errors like this can go undetected.
Of course, this problem can be recursive. Each of the upstream and downstream systems may have upstream and downstream systems of their own having the same potential risks and complications.
Constraints and Dependencies
Are there database constraints with triggers that would result in child data being deleted and do we want this? As we think about whether we want to delete older customer invoices for example, do we want to delete the order history at the individual item level? We may not want order history with respect to a customer after 7 years, but do we want to lose the information about the quantity of each item that was ordered over time? If we want to keep the item order counts but not the invoices, then we may need to store that data differently in order to be able to delete the invoices.
Or conversely, is there any data that would be left dangling because there are no integrity constraints defined to delete child data? If we delete old customer records for example, will there be customer address data, demographic data, or customer preference data left behind resulting in data inconsistencies and invalid reporting? This is just one reason why database integrity constraints are so important in a database design, but that is a topic for another post.
All these questions must be applied to the child data as well. If we delete this data, is there dangling parent data that would be left useless or meaningless without its detail and therefore should be deleted as well?
Retention Requirements
Does your organization have any retention or destruction requirements for this data? Better check with legal! Aside from your organization’s own data management policies, numerous regulations specify how long data of different classifications must be retained. For example, as mentioned in this article by USSignal, The Fair Labor Standards Act, the Bank Secrecy Act, the Payment Card Industry Data Security Standard (PCI DSS), the Health Insurance Portability and Accountability Act (HIPAA), and the Federal Information Security Management Act (FISMA), among others, all specify data retention requirements. Be careful to delete any data while you may be required to produce it in the event of a legal action.
The Deletion Process
So we have confirmed that we want to delete a set of data, we have confirmed what ancillary data needs to go with it and in what systems, and we have confirmed that we can legally delete the data. Now we have to think about how to do it safely. Here are some guidelines that we follow before hitting that delete button.
Script It
All delete commands must be scripted! All delete commands must be scripted! It was worth saying that twice. First, accidentally deleting records is an obviously bad thing, and accidental commands are much more likely when working in a command prompt in SSMS or SQLPlus than when we have carefully crafted the commands and placed them in a script with comments and logging.
Preview and Approval
A preview of any data to be deleted should be made available to the application/database owner with authority to approve the removal of the data. This can be done by issuing the same command (same criteria) that will be used to delete but as a simple select command. It can be presented in a csv file, spreadsheet, or a live query. And the preview should either detail all of the child and parent data that will be removed, or just the parent records and a written description of the child data that will be deleted along with it. The approver must be made aware of whether or not the delete process is reversible. This should be approved in writing before the data is actually deleted in the production environment.
Make it Repeatable
One reason to script the process is so that it is repeatable. The script should be based on criteria such that if additional data that matches that criteria is introduced after the initial delete, we can rerun the script and catch the additional data. This is also very useful if we have the ability to issue the commands in staging or test environments before doing so in production.
Do it on Staging/Test first
It is very possible that after deleting data, the user will realize that they did not account for all of the implications of doing so. Whenever possible, the exact delete process should be done in a staging or test environment before doing so in the production environment, with end user testing before the actual production delete.
Log it
Finally, we must log the delete. This means actually keeping a record of the records that were deleted. This does not mean keeping all of the data, just a record of a few key fields so the removal of the records can be traced should there be questions later. For example, if old invoices are removed, keep a record of the customer number, the invoice, number, and the date of those records that are removed. This can be done with simple select statements executed prior to the delete command using the same exact criteria.
Is it worth it?
That all sounds like a lot of work, and it is. But the implications of deleting data are significant and restoring or reproducing deleted data can be extremely difficult and time consuming or even impossible, so a thoughtful, diligent process is required and worth all of that work.