Troubleshooting Examples

Troubleshooting Surprises

Things aren't always as they may at first appear...

Unlikely Culprate

While working a support contract for a geographically disperse government ministry, we had no technicians on site nor did we have accurate network topology. It was primarily an NT network running on about a dozen LANs connected by various links. The network runs TCP/IP and has a DHCP server at each site.
I received a call from a remote office where the user reported she could that she cannot access a remote VAX system which she uses only a couple times per month.
The problem could be anything from the user to the mainframe. Where to start?

Here is how it went:
- have user ping VAX hostname -- fail
- have user ping VAX IP -- fail
- have user traceroute to VAX IP -- fail
- ping the VAX from HO (head office) -- pass
- have user ping 127.0.0.1 -- pass
- confirm user is logged into the NT Domain
- user pings her router -- fail
- user pings local PC's in her office -- fail
- There is no access beyond the NIC. NIC & hub both have green lights.
- I have seen bad cables that still provide link signals. There are no extra cables on site, so we start simple...
- changed the port the PC used in hub -- fail
- change to a known good port by unplugging a working systems cable (user knew:) -- fail
- have user run ipconfig (windows IP config utility). Shows IP of 0.0.0.0 (should have done that sooner, or ping hostname)
- user tries the Renew IP option -- fails, cannot renew
- at this point I am informed another user has similar problems (always ask questions, early!)
- after a few more questions it turns out they were both away, with their PC's powered off most of the week.
- check the DHCP pool -- there are still numbers available
- look closer at the remote DHCP server to see if we can renew the IP from the server side.
- the c: drive on the server was full. There was therefor no room to write DHCP scope usage, so there was no way for the DHCP server to track IP allocation, it quit handing out numbers.
- cleared some room on the server and the problem was solved.

Takes Two To Tango

In a similar environment as above, a user called saying he could not check his email.

- have user try access Internet -- fail (the email server was outside their gateway)
- user cannot access LAN drives
- I ping his PC -- fail
- he pings my PC -- fail
- user pings other local PC's -- pass
- Now I realize this is a sub office. The LAN drives are on a server at a nearby shared site.
- I traceroute to his PC -- fail
- he traceroutes to my PC -- fails but fails at a different IP than from my side.
- other user tries with same results. This is office wide.
- called organization that maintains WAN. They confirm my traceroute & ping results.
- at this point they took over the call. They later reported that there is a radio bridge between the two local offices, and the bridge at the far side was not powered up. Turns out the power cord had been dislodged.

Power Struggle

Several weeks ago I was getting periodical failiers on my file server. The scsi card would scroll errors across the console about non-reentrant errors and disk 0 problems. The server would take about 10 minutes to shutdown as it couldn't access the scsi chain. The server had a very old (7 or 8yrs) Future Domain SCSI HA card, so I figured it has outlived it's life span.

My workstation had a Adapter 2948u card. I put an ATA100 card and drive in the workstation and migrated the Adaptec card and attached drive & 2 cd scsi components to the file server.

Yesterday I found I couldn't ssh to the server. Moving to the console I saw the same scsi errors scrolling across the screen. I figured, the drive must be failing since it was a different host adaptor (scsi card). I painfully got the thing to shutdown properly and on bootup noticed 2 scsi devices were missing, the cd and the migrated drive from my workstation. So it wasn't the drive and it wasn't the card, as these were different than the card and drive that had originally been suspects. What could cause this? Was something frying my scsi cards?

Opening the case I was reminded of the crammed quarters in there. The server has 3 scsi drives, 1 ide, 1 scsi cdrom, 2 ide cds and a scsi tape drive. On a hunch I disconnected a handful of power splitters, disabling the 2 IDE cdroms. I powered it up and all the scsi stuff was there again. It appears the power supply couldn't cut it and instead of failing, it just dropped 1 or 2 of the leads.

I also learned that although a tape backup is good, it is of little value if you have no hard drive to restore to :-0

So life is good again, although I am rethinking my current layout of keeping my home directory on the file server.... if the "/home" mount fails only root can log on.

original document created by Pete Nesbitt, March 2000
update August 2001