A server does not require maintenance in the shape of oil changes and tyre pressure checks but in many ways owning computer equipment that act as a server is a bit like owning a vehicle. Driving a car does not involve engineering skills, and running a server does not require you to be a hardware engineer or software developer.
With a car you need to get some basics right, those oil changes and tyre pressure checks plus the occasional service. In the same way a server needs routine maintenance, and it is not just a matter of paying monthly hosting fees. Your server requires a lot of regular maintenance, much more than a vehicle in fact.
You don’t need to be an engineer to do this maintenance but you should know that a server that runs 24/7 serving millions of clients will need a server maintenance plan. It rarely involves physical wear in tear, you’re not going to hear your server squeaking away as it searches for files. But on a software level there is a layer of wear and tear. Let’s take a look.
Why you need a server maintenance plan
The moving parts in your server often last a lifetime, nobody opens up hard drives and oils their bearings, for example. At worst you may need to replace a fan or two but even these rarely give up the ghost. However, servers do incur “mileage” in a software sense of the word.
Over time, your server will build up large repositories and records, including cache files which can slow down transaction rates. The fragmentation of SQL tables over time is an issue too. As transaction volume builds old server settings may no longer be valid and the software of your server will become a soft target for attackers. Finally, both HDDs and SSDs eventually degrade, though this happens over a long period of time.
What happens when server degradation occurs? Well, at best you can suffer from slower server performance, which can cause glitches in your workload and lead to unhappy customers. At worst you can face heavy data corruption and data loss, or data theft due to hacking. Thankfully most of the server maintenance problems we pointed out can be managed away using a server maintenance plan.
Server maintenance plans: an introduction
We said earlier servers are not like cars, they don’t need physical maintenance, but that in many ways servers are in fact like cars in that they do need software maintenance. Just like your car some maintenance tasks will be urgent and need frequent attention, while others need only an annual review. You won’t check your engine and lights every month, but you will check your tyre pressure at least once a month, for example. Let’s look at some intervals for server maintenance:
Daily server checks
There are a bunch of things you need to check every day when you are responsible for a server maintenance plan. First, check updates including your virus scanner’s database and other critical software updates which can prevent zero-day attacks.
In fact you should closely look at vulnerability statements made by software and hardware vendors so that you can patch your servers against attacks. Also watch your security logs for evidence of intrusion attempts so that you have the opportunity to block these users.
Weekly server checks
Less frequently you should verify that your backups are working. It’s not necessary to do this on a daily basis as it is not that likely backups will be required when at the same time your backups have suddenly stopped working. Nonetheless, a weekly check is essential.
Another check you should do weekly or even every two weeks is your disk usage. Again, disk usage rarely suddenly changes so it’s not something you need to check every day. However running out of disk space can mean that your server breaks down. Watch for problems like accounts that are stale and outdated temp files.
Monthly server checks
We recommend that you optimize your database every two months, database fragmentation occurs at a rate of up to 5% per month and over time the fragmentation will really hit performance. Tuning individual applications is also important because unoptimized apps can hurt performance.
However, because traffic levels can vary a lot it can be useful to limit application tuning to once every two months so that you can get a good measure of app load levels.
Real-time server checks
We’ve listed plenty of points you need to check every day, but some checks must be done in real time. In other words, throughout the day. These server health data points can signal when the load is spiking and noticing problems early can help prevent a complete server failure: downtime is costly.
Most of these factors are easy to check using a server monitoring tool, in fact, you could even get automated alerts. You can check, for example, CPU and overall server temperature, the health of your RAID volumes and load factors including the number of open network connections.
How checks turn into a plan
So what is a server maintenance plan in reality? Well, a maintenance plan is simply a fixed schedule which outlines which of the above checks are done in a real-time, daily or monthly basis. Doing it is not that hard: though large operators will have in-house technicians smaller businesses can rely on remote staff or another company to do it.
But if you are all on your own don’t despair: you can build your own server maintenance plan, and it is not difficult at all.
Building a server maintenance plan
A good starting point is to classify your maintenance activities according to what you are trying to achieve with the activity and to move from there. In this article, we will split it into three areas.
First, we’ll look at the action you need to take to respond when there is an emergency, call it an emergency response plan. These include steps such as getting alerts when there is an emergency, and the ability to rapidly restore service when something does go wrong.
Next, we will consider steps you should take that can prevent emergencies from occurring in the first instance. For example, you can pro-actively do security checks, analyse performance numbers and check the usage of your server resources.
Finally we will look at some actions that act as a type of insurance in case you experience a server problem. These activities, including auditing your backups and doing fail-over checks will make sure you can rapidly restore your server if the need arises.
Responding to emerging problems: what you need to look out for
Different vehicles have different points of failure: a rocket has likely failure points that are very different from those on a racing bike. In the same way different servers have different root causes for failure: the reasons why a mail server could fail is very different from the reasons why a web server will fall over.
For this reason we can’t suggest a single plan that tells you exactly what you need to monitor to make sure you respond quickly in an emergency. Instead, we’ll guide you in the right direction by outlining what you should consider instead. We will use a web server as a typical example.
Problems with server capacity and user demand
Your server is not built to manage unlimited demand: it has a capacity limit. Sometimes demand can rise unexpectedly, perhaps someone sent out a wildly popular email to a million people or something on social media triggers demand. This can cause memory overload, disks that can’t respond and a server which does not serve pages.
Similarly, in environments where hosting is shared some users can run applications which draw an enormous amount of resources. In fact, some users can intentionally abuse server resources by not watching the amount of server load they generate.
Finally, sometimes server overload is caused by coding errors. Scripts that are not well written can cause memory to leak and other problems with resources. As part of your server maintenance plan you must watch out for both scripts and users who exploit more than their fair share of server resources, while simultaneously keeping an eye on over server utilisation.
Server attacks and malware
We live in an age where server attacks are incredibly common. These can come in several different shapes. For example, bots can try to brute force entry into your machine and the thousands of simultaneous queries this involves will cause capacity issues. A successful attack can lead to unauthorised access to your machine.
Malware is another big threat, software injections via undisclosed and unpatched vulnerabilities can allow hackers to gain entry to your machine, again giving unauthorised access and potentially leading to your server being used as a staging site for attacks on other machines.
Aside from the risks of unauthorised access including data loss and capacity issues, these attacks can lead to a loss of reputation: in other words, your server can be excluded from search engine results and you will find that your traffic drops precipitously. Watch out for attacks as part of your server maintenance plan.
Errors and failures
Servers are highly connected devices: both internally on a hardware and software basis and externally. Watch out for network problems, including broken connections to database backends or other apps that your server relies on.
Hardware is another point you need to watch, ensure that your RAID volume stays healthy for example and watch key indicators such as CPU and chassis temperature. Finally, if a redundant power supply fails – replace it immediately, and likewise with RAID volume issues.
In essence you need to monitor server statistics on all levels: network traffic, utilisation, loads and more so that you can notice when something is unusual. Only then can you investigate further. However it helps to have a plan that you can put into place when you notice an emergency situation developing.
Preventative maintenance: the key to avoiding problems
We’ve outlined what you need to be on the look for when it comes to monitoring emerging problems, but prevention is better than the cure. Again, it depends slightly on what server you are running, but let’s look at some of the preventative maintenance you can add to your server maintenance plan where the server in question is a database server.
Defragment and check indexes and integrity
Databases involve an enormous volume of read and write operations which need to be handled quickly, as a result a database can become fragmented. Delete queries in particular can lead to fragmentation which is why it is important to regularly optimize tables in your database to reduce the fragmentation that causes performance problems and which reduces free space.
Likewise, your preventative server maintenance plan should regularly do an index analysis, optimizing the indexes which MySQL is so reliant on. MySQL has an Analyze function which you should run on a monthly basis to ensure that MySQL can always find data fast. Analyze streamlines indexes and will make sure that queries are quickly executed.
Database integrity can be an issue, MySQL sometimes loses track of data sets as a result of database crashes and other app errors. Weekly checks of database integrity can prevent queries from failing as it provides MySQL with an opportunity to fix errors.
Check disk health and space
Just like database integrity, you can’t take disk health for granted. Always make sure you check your server logs because this is where you will find notices of HDD and RAID errors. These errors offer an indication of looming hard drive or RAID volume failure, giving you the opportunity to replace a drive before it brings down your server.
It’s not unknown for a server to fall over because it has run out of drive space. You must leave room for your database to increase in size, for backups to take place and for large database transactions to get processed. Free up space by removing temporary files, backups which are no longer relevant and other stale data.
Cluster efficiency is important, database clusters should sync efficiently if you want to prevent slow running queries and database errors. Again, early detection is key as it can prevent a costly database crash.
Scrutinise SQL logs
Your MySQL server will log errors when it finds table corruption or problems with indexes. Auditing your logs will ensure that you get an early warning of possible database failure: an error-filled log is a sure warning sign.
Slow queries are another point to watch out for. Aside from highlighting overall performance issues it also indicates which specific queries are causing performance problems, allowing you to tweak these to improve server performance.
Finally, a monthly health check on your server speeds will give you a record to go back on so that you can detect when your server is starting to experience bottlenecks. You can then fix these bottlenecks more easily before more serious issues emerge.
Overall you will need a degree of server management experience to really understand what it is about server performance that can throw up a red flag, indicating that a potential problem is approaching. Whether you run a web server, a DB server or something else, preventative maintenance is key.
Disaster recovery: building a plan to get up and running
Preventative plans are key to avoiding disaster, but even the best-run server environments occasionally face disasters. How do you respond? Clearly, the most important objective is getting things running again.
With a thoroughly thought out disaster recovery plan you can be up and running in a minute or less. Turnaround that is this quick is not necessary for every use-case, some websites owners will see no great harm if their site is down for an hour or two. For others, every minute of downtime is lost revenue.
There are a wide range of options that can minimize downtime. These include high availability clusters which are great at ensuring business continuity. Hardware with fault tolerance including redundant power supplies can work alongside fail-over mirrors to ensure that hardware failure never results in long downtime.
Crucial to disaster recovery: your backups
Some of the points we mentioned in the previous paragraph are expensive to implement, and outside the reach of many website operators. But one point is crucial to a sane server maintenance plan. It’s to do with your backups.
First, make sure your backups are in fact completing every day. Check for errors and ensure your backup tool reports the right status. Next, you need to check that your backups can be restored: can you retrieve the data, is there any corruption? Always monitor your available disk space as this is a prime reason for backups to fail. Finally, do a test run on the recovery process to verify how long it takes and whether it succeeds in the first instance. Watch out for unexpected glitches such as problems with connectivity that could make a recovery difficult.
Settling on your recovery plan
Finally, in deciding how you want to set up your recovery plan and on how much you invest you should carefully think through your application’s requirements. Start by thinking about how much downtime you can tolerate: how quickly do you need to restore services before the damage becomes intolerable?
Next figure out what plans, software and finally what hardware you need to get your disaster recovery plan in place. In doing so you can match the trade-offs you can accept, against those you cannot accept. But whatever you do always ensure your check and verify your backup strategy.
No comment yet, add your voice below!