Fault finding is an incredibly subjective topic, I can’t think of the perfect fault finding process but there are a number of hints and best practices that a network engineer will pick up over time. From my recent TSHOOT studies, Cisco defines the fault finding process as:
- Define the Problem
- Gather Information
- Analyse Information
- Eliminate Possible Problem Causes
- Formulate a Hypothesis
- Test Hypothesis
- Solve the Problem
This is a good position to start discussing my thoughts about fault finding issues in a network.
I try and make sure I understand the issue that the user or system has presented, for example, if the fault is too vague then what other details would be useful. Useful details might be:
- Specific IP Address/Hostname
- Specific Times, i.e., is this in a scheduled back up period or were there other network issues at the time of the fault
- Is it access to a particular service or general connectivity issues
- Is it intermittent or has it been an issue for some time
- Are there many sites experiencing this fault or is it just this site
Sometimes you will know that there will be only delay in trying to get more detailed fault finding information so you will have to rely on your instinct and experience.
Once the “people” side of the fault definition is good enough it’s time to use technical tools.
You need to isolate the fault.
This can be easy, but usually difficult, especially if there are multiple faults compounding the root issue. Often you will be in information overload and you will need to rapidly analyse useful data to make the correct decisions to ultimately restore the normal operation of the service.
Having a thorough understanding of the network and how it behaves under normal conditions is very valuable knowledge for isolating the fault. This understanding can come from many different sources:
- Detailed Network Documentation
- Base lining Network Management Tools
- Commissioning/Maintenance Documentation
- Experience and history with the network
There will be times where the documentation is wrong, missing or confusing. If you have a decent NMS then this stage will be easier, you will be able to utilise graphs, Top N reports, Baselining and all kinds of magic that I don’t usually have. If you are stuck with bad doco and broken Network Management tools you will have to drop into diagnostic commands (i.e. Ping, traceroute etc) and then looking at individual devices (i.e. show log, show proc cpu history etc etc). I will detail ways to look through tool outputs in a future posts but in the interest of keeping this brief I’d like to keep with the general process.
The key to take away from here is the more useful data you can collect and analyse the better your fault finding output will be. There have been many times where a wild “shoot from the hip” will at least yield a work around for the fault but they are not the way to go about fault finding in groups or in serious outages.
Often the following questions must be asked:
- “Can I get a quick fix to get this up and running?”
- “Do I have the time to try and understand the issue or is the goal to get the system working again? “
- “Is this a reoccurring issue?”
There are no universal correct answers for the above questions. Depending on your organisation and its process there is nothing that can be said here to justify the correct answers.
It is very important to understand what the business expectations are for these situations and for you to understand who has the approval for any required outages. These issues are best to be understood before the fault and if the process is vague or undefined then I strongly advise your immediate supervisor/manager needs to make the call.
Hopefully at this point you have a good idea of where the issue is, hopefully what the issue is or at least a work around to get services working and you have the appropriate approvals to begin resolving the fault.
If your fix doesn’t work, this is a good time to go through the issues with a colleague if you haven’t already done so to see if there is anything you are missing. Given the criticality of the fault you may need to escalate to the vendor or to a service partner if it is outside your level of skill.
With time and experience your fault finding skills will improve and you will be able to find the root cause of issues with greater speed and accuracy. This is where you need to understand your limits and know when to ask for help and when to keep on pushing through to find the fix.
Hopefully you get to the point where the service is working at an acceptable level and it may only be a temporary fix. Be sure to advise the people you work with of the change. Often an email is all that is required to make sure the team is aware of the issue and the fix, sometimes other alerts are useful such as a Message of the Day Banner (MOTD). If this is a root issue with a standard configuration make those changes to the standard documentation and inform the users of that documentation.
The point of this post is to not tell you how to pass TSHOOT or to create bullet proof fault finding voodoo – this is what I do and it works for me, so, take some time out to think about how you fault find and then see if anyone else could benefit from your insight.
My next few posts will centre around how to isolate faults, identify useful fault finding data and list out some protips I have picked up over the years.