While working a support contract for a geographically disperse government ministry, we had no technicians on site nor did we have
accurate network topology. It was primarily an NT network running on about a dozen LANs connected by various links. The network runs
TCP/IP and has a DHCP server at each site.
I received a call from a remote office where the user reported she could that she cannot access a remote VAX system which she uses
only a couple times per month.
The problem could be anything from the user to the mainframe. Where to start?
Here is how it went:
- have user ping VAX hostname -- fail
- have user ping VAX IP -- fail
- have user traceroute to VAX IP -- fail
- ping the VAX from HO (head office) -- pass
- have user ping 127.0.0.1 -- pass
- confirm user is logged into the NT Domain
- user pings her router -- fail
- user pings local PC's in her office -- fail
- There is no access beyond the NIC. NIC & hub both have green lights.
- I have seen bad cables that still provide link signals. There are no extra cables on site, so we start simple...
- changed the port the PC used in hub -- fail
- change to a known good port by unplugging a working systems cable (user knew:) -- fail
- have user run ipconfig (windows IP config utility). Shows IP of 0.0.0.0 (should have done that sooner, or ping hostname)
- user tries the Renew IP option -- fails, cannot renew
- at this point I am informed another user has similar problems (always ask questions, early!)
- after a few more questions it turns out they were both away, with their PC's powered off most of the week.
- check the DHCP pool -- there are still numbers available
- look closer at the remote DHCP server to see if we can renew the IP from the server side.
- the c: drive on the server was full. There was therefor no room to write DHCP scope usage, so there was no way for the DHCP server
to track IP allocation, it quit handing out numbers.
- cleared some room on the server and the problem was solved.
In a similar environment as above, a user called saying he could not check his email.
- have user try access Internet -- fail (the email server was outside their gateway)
- user cannot access LAN drives
- I ping his PC -- fail
- he pings my PC -- fail
- user pings other local PC's -- pass
- Now I realize this is a sub office. The LAN drives are on a server at a nearby shared site.
- I traceroute to his PC -- fail
- he traceroutes to my PC -- fails but fails at a different IP than from my side.
- other user tries with same results. This is office wide.
- called organization that maintains WAN. They confirm my traceroute & ping results.
- at this point they took over the call. They later reported that there is a radio bridge between the two local offices, and the
bridge at the far side was not powered up. Turns out the power cord had been dislodged.
Several weeks ago I was getting periodical failiers on my file server. The
scsi card would scroll errors across the console about non-reentrant errors
and disk 0 problems. The server would take about 10 minutes to shutdown as it
couldn't access the scsi chain. The server had a very old (7 or 8yrs) Future
Domain SCSI HA card, so I figured it has outlived it's life span.
My workstation had a Adapter 2948u card. I put an ATA100 card and drive in
the workstation and migrated the Adaptec card and attached drive & 2 cd scsi
components to the file server.
Yesterday I found I couldn't ssh to the server. Moving to the console I saw
the same scsi errors scrolling across the screen. I figured, the drive must
be failing since it was a different host adaptor (scsi card). I painfully got
the thing to shutdown properly and on bootup noticed 2 scsi devices were
missing, the cd and the migrated drive from my workstation. So it wasn't the
drive and it wasn't the card, as these were different than the card and drive
that had originally been suspects. What could cause this? Was something
frying my scsi cards?
Opening the case I was reminded of the crammed quarters in there. The server
has 3 scsi drives, 1 ide, 1 scsi cdrom, 2 ide cds and a scsi tape drive. On a
hunch I disconnected a handful of power splitters, disabling the 2 IDE
cdroms. I powered it up and all the scsi stuff was there again. It appears
the power supply couldn't cut it and instead of failing, it just dropped 1 or
2 of the leads.
I also learned that although a tape backup is good, it is of little value if
you have no hard drive to restore to :-0
So life is good again, although I am rethinking my current layout of keeping
my home directory on the file server.... if the "/home" mount fails only root
can log on.