Unbreakable uptime. A practical guide to high availability and disaster recovery

Published: Feb 07, 2025

Business insights

Unbreakable uptime. A practical guide to high availability and disaster recovery

Key takeaways

Downtime can lead to lost revenue, reputational damage, and legal risks, and businesses with on-premise setups are fully responsible for risk mitigation.
What can go wrong will go wrong. You need to be proactive and provide for hardware failures, cyberattacks, natural disasters, and other threats to data integrity and system uptime.
Our expert explains how implementing tailored high availability (HA) and disaster recovery (DR) strategies, including distributed data centers, recovery plans, and network solutions, help ensure resilience and continuity.

Downtime due to data loss or other reasons is a nightmare for any data-driven business, from managing fleets and logistics to ensuring personal safety and security. With cloud deployment, the service provider handles risks, but in on-premise setups, the responsibility falls on the business. Imagine running a fleet management company, only to wake up one morning and find that your tracking platform is down. Vehicles are stranded, customers are calling, and your team is scrambling to recover lost data. Every hour lost is revenue lost, clients lost, and, in some cases, legal trouble knocking on your door. No wonder businesses are demanding 99.999% availability—because even a few minutes of downtime can have severe consequences.

In this article, I’m providing some clarity and sharing expert recommendations and practical steps to achieve the highest possible service uptime with high availability (HA) and disaster recovery (DR) strategies.

How can downtime and data loss cost you customers and revenue?

I'd like to start by sharing a real customer case (leaving them anonymous to avoid reputational risks) that, actually, inspired me to write this article. Let’s hear them out:

My organization had a high-tech data center. We had recently upgraded our server fleet and fully expected it to be reliable. After all, what could go wrong when you’ve invested heavily in modern hardware?
But then the flash floods hit, and our entire ground-floor infrastructure was wiped out. Every single piece of data stored internally was gone—including our entire Navixy On-Premise tracking history. We feared that our customers had permanently lost their records. Only by sheer luck did one of our sysadmins find a six-month-old backup on an external drive, allowing us to recover a fraction of the data. But the more recent tracking records? They were irretrievably lost.
If only we had thought earlier about geographical distribution of data storage, regular backups and database replication, we would have recovered much faster and not lost so much valuable information.

This real-world incident highlights an undeniable truth—downtime and data loss can have devastating consequences for any business.

When using on-premise software, you are solely responsible for maintaining your servers. This gives you complete control over your and your customers’ data—an advantage often required by regulatory policies or security considerations. However, with that control comes full accountability. If disaster strikes and your infrastructure isn’t resilient enough, the responsibility for service downtime, data recovery, and potential financial loss or reputational damage falls entirely on you.

Let’s break down the potential consequences and impact for telematics-related business.

For telematics service providers (TSPs)

Reputational damage – Clients expect real-time, uninterrupted tracking services. Frequent outages erode trust, pushing customers to switch to more reliable competitors.
Financial loss – Losing clients due to unreliable service means a direct hit to your revenue.
Legal liability – If clients suffer financial loss due to missing tracking data, they may hold you accountable, leading to potential lawsuits or compensation claims.

For fleet operators

Payroll disruptions – Without access to tracking data, businesses lose insight into driver shifts, task completion, and work hours, making payroll processing chaotic and error-prone.
Accounting and expense tracking nightmares – Fuel consumption records, route logs, and operational expenses tied to fleet activity can disappear, forcing businesses to reconstruct financial records manually—a time-consuming, costly, and (again) error-prone process.
Vehicle maintenance chaos – Losing maintenance logs can lead to missed servicing schedules, resulting in unexpected breakdowns, higher repair costs, and compliance issues with safety regulations.

For system integrators

Reputational and business risk – A poorly managed infrastructure can result in client system failures, prolonged downtime, or critical data loss. This damages your credibility as a trusted provider and increases client churn and dissatisfaction.
Financial and contractual penalties – Many system integrators operate under strict SLAs (service level agreements). Failing to meet uptime guarantees can result in penalties, loss of contracts, or even legal disputes.
Legal and compliance liabilities – If an integrator fails to ensure data security, redundancy, or retention policies, clients might face regulatory fines, audits, or lawsuits.
Operational burden and crisis management – System downtime leads to urgent troubleshooting, recovery efforts, and escalated customer support requests, straining resources and increasing operational costs.
Lost future business opportunities – Enterprise clients demand reliable, always-available solutions. If an integrator lacks a strong, well-tested infrastructure, they risk losing big contracts to competitors who can guarantee better uptime.

Beyond immediate financial impact

Regulatory non-compliance – Many industries, including logistics and security services, must comply with strict data retention laws. If a business loses tracking records, it can face fines, audits, or even operating restrictions.
Operational disruptions – Vehicles can be stranded mid-route, shipments delayed, and customer SLAs broken, leading to cascading failures in supply chains and logistics planning.

As you can see, the list is long and, believe me, it’s just a tip of an iceberg of troubles a downtime can result in, the most obvious outcomes, if you will.

What can cause the damage?

Understanding the risks that can lead to downtime and data loss is the first step in preventing them. When you know what can go wrong, you can take proactive measures to protect your infrastructure, minimize disruptions, and ensure business continuity.

Here are the key risk factors you might want to consider:

Hardware failures

Even the most advanced infrastructure is vulnerable to hardware malfunctions. Server crashes, network device failures, or storage system breakdowns can lead to critical service disruptions. Power issues, such as UPS failures or short circuits, can further increase the risk of unexpected downtime.

Software vulnerabilities

Operating system bugs, outdated software, and security loopholes create unstable environments that can lead to failures. Poorly maintained or misconfigured software increases the risk of system crashes, data corruption, and performance issues.

Human errors and malicious actions

Many system failures result from human mistakes—misconfigurations, accidental deletions, or improper maintenance. Additionally, internal threats such as sabotage, unauthorized access, or deliberate data tampering can compromise system integrity.

Network disruptions

Connectivity issues, ISP outages, routing failures, or network hardware malfunctions can disconnect users from your platform, rendering tracking and operational data inaccessible.

Natural and man-made disasters

Unpredictable events like earthquakes, floods, and hurricanes can physically damage infrastructure, while fires, explosions, or industrial accidents can destroy entire data centers. Businesses relying on a single-location setup are particularly vulnerable to such risks.

Cyberattacks and security threats

Malicious actors constantly target on-premise infrastructures with DDoS attacks, ransomware, and data breaches. A lack of proper security measures can lead to data loss, unauthorized access, and prolonged system outages.

What is high availability and disaster recovery?

Identifying potential risks is the first step. The next is ensuring your systems can withstand disruptions and recover quickly when failures occur. This is where high availability (HA) and disaster recovery (DR) come into play. Both are essential strategies in IT infrastructure, designed to minimize downtime and data loss. Let’s break them down.

High availability (HA)—ensuring continuous operation

High availability refers to a system’s ability to remain operational with minimal downtime. It ensures critical applications and data are always accessible, even in the event of hardware failures or unexpected disruptions.

This is typically achieved through redundancy and failover mechanisms—for example, if one server fails, the system automatically switches to a backup. The availability level is measured as a percentage, with businesses often targeting 99.99% or higher (also known as "four nines" or "five nines" availability), equating to only a few minutes of downtime per year.

Maintaining high availability is crucial for business operations, customer trust, and service reliability.

Disaster recovery (DR)—restoring operations after failure

Disaster recovery focuses on restoring IT systems and data following a major disruption—whether due to hardware failure, cyberattacks, or natural disasters. Unlike high availability, which aims to prevent downtime, disaster recovery ensures a rapid return to normal operations after an incident.

A strong DR strategy includes:

Regular backups to prevent data loss
Data replication across multiple locations
A structured recovery plan to minimize downtime and disruption

Without an effective DR plan, businesses risk significant data loss, operational delays, and financial consequences.

Why do both matter?

While high availability keeps systems running smoothly, disaster recovery ensures quick restoration after an outage. Together, they provide a resilient infrastructure that safeguards business continuity, data integrity, and long-term success.

Geared up with this knowledge, let’s figure out how to choose the right HA/DR strategy for your particular business.

Key consideration when choosing the right HA/DR strategy for your business

The right approach to high availability and disaster recovery depends on the specifics of the local infrastructure. There is no one-size-fits-all solution—each implementation is a process that requires careful planning, adaptation, and expertise.

For example, in a local data center, maintaining high availability relies on robust hardware redundancy and physical infrastructure monitoring. On the other hand, cloud environments such as AWS depend on virtualized failover mechanisms, making hardware issues less of a concern.

Since on-premise deployments come with unique challenges, it’s important to look at real-world scenarios to understand how businesses adapt their HA/DR strategies based on infrastructure risks. At this point, I want to get back to the case I mentioned at the beginning of the article. Remember? Our client, confident in their modern, well-equipped data center, never expected a flash flood wiping out their entire infrastructure. They faced irretrievable data loss, operational chaos, and a scramble to recover whatever they could from outdated backups.

This experience was a turning point. Determined to never face such a crisis again, they overhauled their HA/DR strategy to build a system that could withstand even the most severe disruptions. Below are the key considerations that shaped their new approach.

1. Geographical distribution—reducing single points of failure

The backbone of their strategy was deploying two geographically distributed data centers in different regions. The primary site handled all customer operations, while the secondary site remained on standby, continuously replicating databases, backend services, and other system components.

In the event of a failure at the primary site, the secondary site was prepared to take over. This setup ensured an uninterrupted user experience with minimal downtime, providing resilience against both hardware failures and external threats.

2. Disaster recovery plan—preparing for worst-case scenarios

A well-defined, regularly tested recovery plan was at the heart of their DR strategy. The plan outlined exact steps to restore IT infrastructure and resume operations in case of failure. Regular testing and updates ensured all recovery procedures remained effective and aligned with evolving risks.

To improve fault detection and failover efficiency, businesses often implement an intermediate arbiter and a dedicated monitoring system to detect failures and trigger automatic transitions. However, in this case, automated failover was not an option. Instead, they relied on manual data center switching based on alerts from the monitoring system. While this introduced a slight delay, it kept downtime to just a few minutes when executed properly.

3. Network accessibility—ensuring seamless connectivity

Ensuring continuous user access is critical for any HA/DR solution. One of the biggest challenges in a multi-data center environment is maintaining network continuity when switching between sites.

The ideal approach is to migrate the IP address from the old server to the new one, allowing the system to seamlessly redirect traffic. However, IP address migration is not always feasible, especially in geographically distributed setups like our client’s.

To address this, their disaster recovery plan included a domain name system (DNS) reconfiguration. When switching data centers, they updated the DNS A-record to point to the new server’s IP address. This allowed users and devices configured with the domain name to continue functioning normally. However, devices relying on a hardcoded IP address temporarily went offline and had to be manually reconfigured. Fortunately, the number of such devices was minimal, making this issue manageable.

I hope these help you understand what you might want to consider when approaching your HA/DR initiatives. But let’s go look at how we practically implemented this strategy.

How was the HA/DR solution implemented in practice?

With extensive experience helping clients implement high availability and disaster recovery (HA/DR) solutions, I’ve encountered a wide range of challenges and opportunities. Each business comes with its own infrastructure, risks, and operational needs, requiring a tailored approach to ensure resilience.

Through this experience, I have identified three critical areas that play a decisive role in building an effective HA/DR strategy:

Database maintenance
Nginx and Java services
Network and connectivity

These areas extend beyond technical considerations—they also encompass organizational and operational factors. A well-structured HA/DR solution isn’t just about keeping systems running; it must also be aligned with business objectives, scalable for future growth, and practical for day-to-day operations.

Back to our case, here’s how we approached it.

To provide high availability, the client needed to use the services of another data center located in another area of the same city. This is not the best practice as normally we would recommend storing the data in some other city or in a cloud storage, but the solution is still reliable.

First, it was necessary to prepare the servers. These should be servers with sufficient resources to ensure stable operation of the platform (so these can be similar resources as on the existing servers). Our client had two of them - one for a database and the other for the platform application.

What software did we use?

The most critical thing in high availability and disaster recovery is to maintain the database integrity and accessibility. There are many known methods and solutions for this, the basic one being MySQL cluster.

This is a commonly used and reliable solution, but our client resorted to the third party solution from Percona, which is called XtraDB Cluster. This is a specialized solution designed for high availability of databases that provides synchronous replication and automatic switching of nodes, so it is fully applicable in this case. Additionally it is more convenient in terms of configuration and maintenance than the other solutions (including a regular MySQL cluster).

The full installation guide for XtraDB Cluster is provided on Percona website: https://docs.percona.com/percona-xtradb-cluster/8.0/apt.html

It is important to note that a regular solution implies having three replication nodes for high availability. But it is also common to go with two main nodes and a so-called Galera Arbitrator—this is not a full-fledged cluster node but a lightweight cluster member that is responsible for transaction cache and determining the nodes data consistency. By implementing an arbitrator, the cost of maintaining the cluster is reduced as the minimum number of data-hoarding nodes can then be reduced to two. This was sufficient for our client.

Having deployed the cluster, the client ensured the most important thing—the security of all telematics and business data. The main task was solved.

Next, it was essential to take care of the platform services. To do this, the following simple steps were taken:

Install Java and Nginx software on the server.
Copy the Java services, web server and website directories to the new server.
Configure the services to operate.

Java services must be enabled to run as systemd services.

Systemd is the service initialization and management subsystem in Linux. It is built into modern distros and does not require specific installation. The configuration files for Navixy services are already present in the copied services directories, and all that was required was to enable the services to operate by creating symbolic links to /etc/systemd/system/, and then by running a simple command:

systemctl enable api-server sms-server tcp-server

This initiates the services but does not launch them instantly. They must be launched only in case of the main server failure, and the command is different:

systemctl start api-server sms-server tcp-server

The Java services setup is a common operation and is not hard to implement, but our support is always available to provide assistance with such matters.

In regards to the website, no specific customizations were needed. Nginx could simply be started with existing configs.

How do the addresses get switched?

It is important to note that in the case of our client, the backup data center was geographically far from the current one and was served by another telecom company, so migration of the existing IP address in case of disaster was not a feasible option. For this reason, switching the domain to a new IP address was included in the recovery plan.

The entire switchover takes place on the domain authority side. Of course, this is not an instantaneous operation that is fully dependent on the service provider, so it may cause additional delays during recovery, but it was the only reliable way to retain the user access in case of the outage.

Since all of our client's platform configurations were done using the domain name, everything is expected to work the same even if the IP address is changed, without the need to revise any of the Navixy services configurations. In this case, problems may be only with devices configured on IP, but our client did not have any of such—the whole fleet was Teltonika pointed to domain.

How does the licensing work?

The final challenge was the issue of licensing. The license is applicable only for one server, and when trying to run a duplicate instance, the key becomes corrupted. Additionally, the license key is dynamic and is changed over time, so saving it once and for all is not possible.

For our client, since constant database replication was implemented, the issue of licensing was solved, as both databases contained the same authentication data. In case of a service outage at one site and an urgent launch at the other, for the authentication system it would look like a regular restart of the platform. The license would remain in effect. For additional verification, it is possible to query the key variable in the database on both servers and compare the result - it should be identical. The query looks like this:

SELECT * FROM google.variables WHERE var='fingerprint';

The result is a license key hash—it is expected to be the same on both servers.

Additionally, it was agreed that we would reissue the license key whenever necessary. Licensing issues are always of the highest priority for us, so a timely response is guaranteed.

Final word: How regular maintenance and monitoring prevents downtime and ensures reliability

While the implementation of high availability and disaster recovery is a complex practice in itself, it is not enough to simply set it up and leave it unattended. High availability architectural solutions require constant maintenance and monitoring to function effectively. That is why our client has a staff of qualified technicians to ensure the maintenance of the entire server infrastructure in good working order, as well as duty engineers responsible for monitoring and taking emergency measures. Additionally, the client performs regular DR tests of failover and recovery procedures, which is also essential to ensure that they work as expected in the event of a real failure.

Maintenance and monitoring are critical components of high availability and disaster recovery solutions. They ensure that the system remains reliable and effective in the face of potential failures, and enable proactive intervention to prevent or minimize the impact of downtime. Regular maintenance and monitoring also help to ensure compliance with regulatory requirements and industry standards for data protection and business continuity.

We hope that our client's successful experience described here will encourage you to consider the high availability and disaster recovery options available in your infrastructure. As noted earlier, this should be a locally tailored solution, depending on your capabilities and the resources available to you. This way you can not only maintain customer access in all unpredictable situations, but also maximize customer confidence in your business and meet the most demanding requirements for data availability and security.

Is your infrastructure protected against unexpected failures?

Contact our team today to discuss how to strengthen your HA/DR strategy and keep your business with Navixy On-premise running—no matter what.

author Viktor T., Field Expert, Navixy On-premise

← Previous article Next article →

Fleet management

Video telematics

GPS asset tracking

Field service

Scalable. Flexible. Advanced.

IoT Logic

GPS hardware

Mobile apps

Admin panel

Marketplace

API and SDK

MQTT

Blog

Knowledge base

Help center

What’s new

Roadmap

Marketing collateral