Buckup your data, scientifically

In the realm of IT infrastructure, data integrity and availability are paramount. Companies rely on databases to store critical information, and losing this data can result in substantial financial loss. Hence, implementing an effective backup policy is crucial. However, backups incur costs and come with inherent risks, such as accidental data deletion. Understanding how to balance these costs and risks can help IT departments optimize their backup strategies, ensuring data protection without overspending.

Backing up a hard drive disk imagined by AI

Variables and assumptions in backup policy modeling

To model and optimize backup policies, we need to consider several key variables:

Yearly Revenue ( $R$ ): The total revenue generated by the company annually. This value is used to estimate the financial impact of data loss.
Cost per Backup ( $C_b$ ): The expense incurred each time a backup is performed, including labor and resources.
Probability of Data Loss ( $P_d$ ): The likelihood of an event that necessitates the use of a backup within a year.
Probability of Accidental Deletion ( $P_a$ ): The risk of accidentally deleting the database during the backup process. Which we are going to explore later.

In creating our model, we make several assumptions to simplify and focus the analysis:

Uniform Revenue Distribution: We assume that the yearly revenue is evenly distributed throughout the year, simplifying the calculation of data loss value.
Fixed Backup Cost: The cost per backup is constant and does not vary with the number of backups performed.
Independent Events: The probability of data loss and accidental deletion are independent events.
Immediate Backup Restoration: In case of data loss or accidental deletion, the backup is restored immediately, and the associated revenue loss is based on the time since the last backup.
Linear Revenue Loss: Revenue loss due to data loss is linearly proportional to the time since the last backup.
Fixed Backup Frequency: We consider a fixed backup frequency (e.g., daily, weekly, monthly) for simplicity. The optimal frequency can be determined through optimization.
No Backup Failure: We assume that backups are always successful and can be restored without any issues.
No Data Growth: We do not consider data growth in this model. In practice, data growth should be factored into backup policies to ensure sufficient storage capacity.
Even more assumptions that we are going to ignore for simplicity, because reality is complex.

Given these variables and assumptions, we can create a mathematical model to find the optimal backup frequency that minimizes the total expected yearly cost.

Simple Case: Without Accidental Deletion

In the simpler scenario, we assume that accidental deletion does not occur during the backup process. The goal is to find the optimal number of backups ( $N$ ) per year, balancing the costs of performing backups against the potential revenue loss from data loss.

Total Backup Cost

The total backup cost is calculated by multiplying the number of backups performed per year by the cost of each backup:

$\text{Total Backup Cost per Year} = N \times C_b$

Expected Revenue Loss Due to Data Loss

The expected revenue loss is calculated based on the average time interval between backups. If backups are performed $N$ times a year, the average period between backups is $\frac{365}{N}$ days. The average data loss interval, in case of a data loss event, is half this period:

$\text{Average Data Loss Interval} = \frac{365}{2N}$

The revenue loss per data loss event is proportional to this interval:

$\text{Revenue Loss per Data Loss Event} = \frac{R}{2N}$

Considering the probability of a data loss event, the expected yearly revenue loss is:

$\text{Expected Yearly Revenue Loss} = P_d \times \frac{R}{2N}$

Total Expected Yearly Cost

The total expected yearly cost is the sum of the total backup cost and the expected revenue loss:

$\text{Total Expected Yearly Cost} = N \times C_b + P_d \times \frac{R}{2N}$

Charting the Cost Function

Let's use the last formula to plot the total expected yearly cost as a function of the number of backups per year, assuming 5M $of yearly revenue, 50$ of cost per backup, and 10% of probability of data loss within a year.

As we can see in the chart, there is a point where the total expected yearly cost is minimized. This point represents the optimal number of backups per year that balances the cost of performing backups with the potential revenue loss from data loss events.

Break-Even Analysis

To find the optimal $N$ , i.e. the optimal frequency of backups, we set the total backup cost equal to the expected revenue loss:

$N \times C_b = P_d \times \frac{R}{2N}$

This equation reflects the balance point where the cost of performing backups is equal to the potential loss from data loss events. Solving for $N$ :

$N = \sqrt{\frac{P_d \times R}{2 \times C_b}}$

Why is this the optimal frequency? Because it minimizes the total expected yearly cost, as shown in the chart above. This $N$ is the lowest point on the curve, where the cost of performing backups is balanced with the potential revenue loss from data loss events.

Also, from that formula we can understand that the higher the revenue, the higher the probability of data loss, or the lower the cost per backup, the more frequent the backups should be.

Example Calculation

Given:

Yearly revenue ( $R$ ) = $5,000,000
Cost per backup ( $C_b$ ) = $50
Probability of data loss ( $P_d$ ) = 10% (or 0.1)

Plug these values into the formula:

$N = \sqrt{\frac{0.1 \times 5,000,000}{2 \times 50}} = \sqrt{5,000} \approx 70.71$

Therefore, the optimal number of backups per year is approximately 71, that is once every 5 days.

The Human Factor: Including Accidental Deletion during a Backup

The previous scenario was uneventful, except for the main event of data loss. In the more complex scenario, we incorporate the human factor, by considering the probability of accidental database deletion during the backup process itself. This adds another layer of risk and potential cost.

Reality check: This is an example of another possible event, but other events may also make sense in reality. For example, the probability that a backup is corrupted, so we have to pull an older one increasing the data loss cost; or the probability of a partial data loss instead of a complete data loss. The model can be adapted to include these events, but it will increase the complexity of the model, and therefore exact algebraic solutions may not be feasible, so other methods like numerical optimization, or Monte Carlo simulations may be needed. Also, possibly the cost of data loss may be more complex to be realistic. For example, it can include not only the revenue loss but also the cost of reputation loss, or the cost of losing future revenue. These are all valid concerns, but for simplicity, we are going to ignore them.

Expected Cost Due to Accidental Deletions:

$\text{Expected Number of Accidental Deletions per Year} = N \times P_a$

$\text{Cost per Accidental Deletion} = \frac{R}{2N}$

$\text{Expected Yearly Cost of Accidental Deletions} = N \times P_a \times \frac{R}{2N} = P_a \times \frac{R}{2}$

Total Expected Yearly Cost:

The total expected yearly cost now includes the cost of accidental deletions:

$\text{Total Expected Yearly Cost} = N \times C_b + P_d \times \frac{R}{2N} + P_a \times \frac{R}{2}$

Break-Even Analysis:

Setting the total backup cost equal to the sum of the expected revenue loss and the expected cost of accidental deletions gives:

$N \times C_b = P_d \times \frac{R}{2N} + P_a \times \frac{R}{2}$

Simplifying further:

$N \times 50 = \frac{250,000}{N} + 25,000$

$50N^2 = 250,000 + 25,000N$

Solving the quadratic equation:

$2N^2 - 100N - 10,000 = 0$

Using the quadratic formula:

$N = \frac{100 \pm \sqrt{100^2 + 4 \times 2 \times 10,000}}{4} = 100$

Therefore, the optimal number of backups per year is 100, that is once every 3.65 days.

Conclusion

Optimizing backup policies in IT involves balancing the cost of performing backups against the potential financial loss from data loss and accidental deletion. In simpler scenarios without accidental deletion, the optimal backup frequency can be derived using straightforward algebraic formulas. It is not always possible to model all the complexities of real-world scenarios, but the effort to quantify and optimize backup policies, as any other process, can help organizations make informed decisions. For example, in this case we discovered that introducing the human error, the optimal backup frequency is higher, because the cost of accidental deletion is higher than the cost of data loss itself. Which may be counterintuitive at first, but it makes sense when we think about it.