SQL in a snap

Written on October 4, 2019

Update 29/11/19

AWS have announced Fast Snapshot Restores which I cover in more detail in this followup post

Ditching read replicas

I came across an interesting proposal in an AWS whitepaper “Best Practices for deploying Microsoft SQL Server on AWS, in the section “Eliminating Active Replica licenses” it suggests using an EBS snapshots as an alternative method due to the licensing restrictions of SQL replicas.

Most production SQL instances would be clustered to provide high availability, but there are some licensing requirements that you need to take into account when doing this. I’m not a licensing expert so I would suggest you exercise your own good judgement and talk to a specialist to understand the particulars around your environment.

At the time of writing (Jan 2020), if you want to setup Microsoft SQL Server with Always-On Availability, you can at a minimum license the primary instance, the secondary instance does not need to be separately licensed in this Active/Passive scenario.

However, this really does limit what can be done with the secondary instance, you cannot offload backups or perform any read actions, for that you need to license the secondary instance and then it can be used as a true read replica. You could have a third instance as a read replica but the goal here was to reduce the number of licenses required to save cost.

For the scenario I was investigating, we needed access to the SQL data for reporting in a separate SQL data warehouse, which was already licensed, without having the additional expense of licensing another SQL server just to extract the data. We wanted to protect the primary instance from large daily extracts as it was already busy and I found the whitepaper above around using snapshots.

AWS Systems Manager for automation

The objective here was to provide a proof of concept to snap the core SQL database running on one instance and present this to a second instance running a separate SQL Server. We would then need to mount these new volumes and run a process to attach the database an import the delta changes to the data warehouse.

In order to complete this, we needed to make sure that the backup of the core SQL database was “application consistent” and that any in flight work SQL was doing was flushed to disk so that when we brought up the volumes on the other instance they would be in a healthy state.

Luckily AWS System Manager has a lot of this functionality ready to go!

Using the “Run Command” System Manager can connect to our instances and execute commands and also has the ability to run tasks against the AWS API’s. So this sealed the deal for us as we could have a single automation pipeline that could talk to Windows and orchestrate the AWS resources in one pane of glass.

So below is the basic flow of the automation pipeline:

  1. VSS Snapshot of EBS volumes attached to our core SQL instance
    • This is an AWS provided automation “Document” that is a PowerShell script that runs on the instance and completes the VSS tasks and then calls AWS API’s to create a EBS snapshot which specific tags, one of which is the Automation job unique id.
  2. Get the Snapshot Id’s from the previous task
    • You cannot specify the ID of an EBS snapshot as it is created, so we use the tags we created in the previous step to locate the correct snapshots for this job.
  3. Check the snapshots have finished creating
    • Depending on the amount of data that has changed since the last snapshot, it can take a bit of time to create a new snapshot. So here we let SSM poll until the snapshots are available.
  4. Create volumes from the new snapshots
    • In order to use the snapshots on our second instance we need to create new EBS volumes from these snapshots.
  5. Check volumes are created successfully
    • More on this later, but as snapshots are stored in S3 they need to be “rehydrated” back into EBS and this can take some time, so again we have SSM poll until the volumes are created successfully.
  6. Attach the new volumes to our data warehouse instance
    • Now that the volumes are created we need to attach them to our instance with a unique ‘device name’
  7. Get Windows to scan for new volumes
    • Without doing this the SQL script that follows would sometimes fail to work, so this just lets Windows know we have added some disks and it might want to do something about it.
  8. SQL import script
    • This uses the run command to execute a PowerShell script that runs Invoke-Sqlcmd to call a store procedure that does the delta import from our snapshot to our data warehouse.
  9. SQL cleanup script
    • We then run another PowerShell script to invoke a store proc to close connections to the snapshot volumes and detaches the database.
  10. Volume and snapshot cleanup
    • We then run a series of clean up tasks that detach the volumes, delete the volumes and delete the snapshots.

You can find the SSM documents and SQL stored procedures I used and some instructions on building this here.

Dehydration

One of the other reasons for building this proof of concept was to analyse the time it would take to backup and restore on a busy system.

As I mentioned above, EBS snapshots are stored in S3, so when you come to restore them there are instantly accessible but not all the blocks have been transferred from S3 to EBS so you can run into performance issues.

“For volumes that were restored from snapshots, the storage blocks must be pulled down from Amazon S3 and written to the volume before you can access them. This preliminary action takes time and can cause a significant increase in the latency of I/O operations the first time each block is accessed. Volume performance is achieved after all blocks have been downloaded and written to the volume.” taken from here.

This was going to be less than ideal for our ETL solution, but I wanted to test this out as we are looking at a delta copy of data and not doing table scans for the entire database. As we are primarily focused on the transaction logs which would be sequentially written, I was hopeful that the blocks would be the first to rehydrate from S3.

HammerDB

In order to benchmark this solution, I decided to us the open source tool HammerDB which is great for putting SQL server under some load. I ran the following experiments a number of times, with HammerDB configured for 200 warehouses.

Benchmark

The first table is the time it take to insert data into a fresh EBS volume on the same server. This is the control to establish the overhead of the movement of data from S3 to EBS.

# Transactions # Rows (Avg) Size (MB) (Avg) Time Taken (Avg)
2,000,000 78,269,604 5,185,270 0:07:33
100,000,000 217,246,539 14,185,112 0:18:30

The second table is the time taken when inserting data from a freshly minted volume off the snapshot

# Transactions # Rows (Avg) Size (MB) (Avg) Time Taken (Avg)
2,000,000 78,269,604 5,185,270 0:08:54
100,000,000 217,246,539 14,185,112 0:29:01

Observations

For 2M inserts and ~5GB of data:

  • Control: 0:07:33
  • From snap: 0:08:54
  • Difference: ~17% overhead from snap

For 100M inserts and ~14GB of data:

  • Control: 0:18:30
  • From snap: 0:29:01
  • Difference: ~57% overhead from snap

So it is not a linear problem with more data and this also proved we were well within our tolerance for the automation process and the daily rate of change was no where near these volumes.

Written on October 4, 2019