PDA

View Full Version : Dropouts in web access to JANET and BBC...


richardw
01-03-2008, 13:53
I've been having a weird problem for the past few months. I'm at a loss to know what's going on, but more importantly I'm at a loss to know how to effectively communicate the problem to VM.

What I'm seeing is dropouts in web access to certain sites, in particular the BBC and to web servers on JANET (the UK academic network).

This happens several times a day (almost without fail it'll happen at least once between 5pm and 6.30pm). It happens at weekends and on weekdays. It happened this morning (Saturday) at just after 11am.

I get "cannot connect to web server" for about five to ten minutes of retrying, then it suddenly starts working fine again. Access to other web sites on the Internet is fine during these intervals.

I work at a University, we have several web servers running on several machines, I run one of them. Without fail they *all* become inaccessible at the same time. Without fail the BBC website becomes inaccessible at the same time (I'm referring to their basic web page, not iPlayer for any of the streaming stuff). I found this morning that other sites on JANET (in this instance Nottingham University) become inaccessible to me at the same time.

During the affected interval I can SSH to the web server I run at my University, I can login, monitor the logs and watch the web server serving requests from elsewhere across JANET and around the world. I can ping www.bbc.co.uk (http://www.bbc.co.uk).

So basic network connectivity seems to be OK, the web servers are up and working. It's just port 80 traffic between them and my cable modem which is being affected. And I reiterate, other web sites I frequent hosted elsewhere around the world work fine during these periods.

My basic question is -- who at VM can I get this information to for further investigation? And how? I really don't want to pay 50p/min to be told to reboot my cable modem, and then that the problem is fixed because I can see www.bbc.co.uk (http://www.bbc.co.uk)! I feel that from the symptoms above there's obvious something quite subtle going on and that it's somewhere between NTL and JANET, and that it's not obvious that it's my end.

I'm at my wits end.

ceedee
01-03-2008, 14:18
Next time it happens, try to get tracerts to discover where the connection is falling apart and then submit them to TS via their dedicated newsgroup?

Good luck!

richardw
01-03-2008, 15:43
I've tried traceroutes, they show that the basic network connection between my home PC and web servers at work is fine.

This is supported by my being able to be logged in to the web server via SSH, actively doing stuff at the very same time my HTTP connections are failing to reach the very same machine (and the others at work, and presumably the others on JANET and the BBC).

It seems to be port specific, that's what's most confusing. traceroute gets through, SSH gets through, ping gets through, HTTP doesn't.

Is there a port-specific version of traceroute? (I don't see how there can be from my rudimentary understanding of how traceroute actually works).

dev
01-03-2008, 16:22
i've had a similar issue but with mail (pop3) as opposed to the web. accessing my mail from a uni (so via janet) server can sporadically die yet BBCs site is always working for me. i just leave it a couple of minutes and then it usually works then, strangely enough, while i was at uni, never once was ntl/vm's mail server not work.

techyguy4
01-03-2008, 17:30
Try using the following dns servers. (OPEN DNS)
Primary :208.67.222.222
Secondary :208.67.220.220

good luck.

tomo
02-03-2008, 16:55
I also work for a University, located in London, on the JANET network. And to be honest I never have any issues with connectivity. I can work from home from time to time and SSH and SSL sessions stay active for hours.

That being said I have seen an issue occur twice in the past 12 months when basic connectivity is fine (ICMP works) but TCP/UDP connectivity to a random selection of places (not just BBC or JANET) does not work.

When this occurs I give VM Tech Support a ring, and provided that the person answering has got some clue (once it required a second phone call) I usually explain that


the basic network path is there, ie. ICMP ping/tracert is ok,
TCP and UDP to random destinations appears to be dropped,
the destinations that are non-responsive are constantly non-responsive during the problem, those that work, do work consistantly
Using a different DNS server or IP addresses does not change matters (on both occasions the primary VM DNS server was a non-responsive destination)
I've restarted the Cable Modem
I've removed my router and connected my laptop directly to the cable modem
those sites are down are OK from the same laptop using an independent internet connection (GPRS/3G) - so it's not my laptop at fault.
I'm attached to UBR 2 in Watford ;)


The VM tech has a look around for scheduled maintenance, checks my connection, pronounce all is OK. But usually within 10mins it appears to be cleared. Although no one ever admits anything.

I suspect that someone does a restart of soemthing upstream of me that has got a messed up network stack. There's nothing reported in the CM logs to indicate that the CM has had to renogiate.

Hope this helps in someway.

dev
02-03-2008, 18:30
Try using the following dns servers. (OPEN DNS)
Primary :208.67.222.222
Secondary :208.67.220.220

good luck.

DNS has nothing to do with it, the problem is the packets aren't getting to the janet network

Stuart
02-03-2008, 18:49
Try using the following dns servers. (OPEN DNS)
Primary :208.67.222.222
Secondary :208.67.220.220

good luck.

If DNS was not working, the whole connection would be affected, not just HTTP.

Joxer
02-03-2008, 22:32
It would appear from your spelling of traceroute and your use of ssh that you may well be using a unix based system in which case (a quick look at the man page revceals) that traceroute does offer port selection via the -p option however as far as I know unix based traceroute is (or was) udp based so it may not help much.

---------- Post added at 22:32 ---------- Previous post was at 22:26 ----------

Just had another thought - if it's your server could you set up a test page on a non standard port and see if you can access that? Not sure how useful it may be but it would narrow it down to a port or service issue. How about ftp? does that work ok?

danmed
02-03-2008, 23:33
As techguy4 said, try those DNS settings. i have had similar problems recently but with a wider range of sites randomly not being accessible.. OpenDNS did the trick for me and everything has been fine since (changed over on wednesday or thursday of last week)

Stuart
02-03-2008, 23:40
As techguy4 said, try those DNS settings. i have had similar problems recently but with a wider range of sites randomly not being accessible.. OpenDNS did the trick for me and everything has been fine since (changed over on wednesday or thursday of last week)

Check the post.


During the affected interval I can SSH to the web server I run at my University, I can login, monitor the logs and watch the web server serving requests from elsewhere across JANET and around the world. I can ping www.bbc.co.uk.

So basic network connectivity seems to be OK, the web servers are up and working. It's just port 80 traffic between them and my cable modem which is being affected. And I reiterate, other web sites I frequent hosted elsewhere around the world work fine during these periods.


Unless he is ssh'ing to the IP address (thus bypassing DNS altogether) then it is NOT a DNS issue. Switching to OpenDNS will not help.

Also, from the second post reporting problems with JANET:


Using a different DNS server or IP addresses does not change matters (on both occasions the primary VM DNS server was a non-responsive destination)


So, using Open DNS would not help.

Joxer
03-03-2008, 00:01
Stuart C: Good points well made, if only everyone would read posts before replying, though I am guilty of that myself.

danmed
03-03-2008, 01:44
Apologies to that man.

Cobbydaler
03-03-2008, 07:01
It would appear from your spelling of traceroute and your use of ssh that you may well be using a unix based system in which case (a quick look at the man page revceals) that traceroute does offer port selection via the -p option however as far as I know unix based traceroute is (or was) udp based so it may not help much.

You could always try tcptraceroute (http://freshmeat.net/projects/tcptraceroute/)...

richardw
03-03-2008, 08:55
Thanks for the replies guys!

Just had another thought - if it's your server could you set up a test page on a non standard port and see if you can access that? Not sure how useful it may be but it would narrow it down to a port or service issue. How about ftp? does that work ok?

Yes that's a decent idea. FTP is blocked by our University firewall(s) but an unprivileged port should get through.

I've tried traceroute/tracert from Windows, Mac OSX and Linux at home, all with the same outcome.

You could always try tcptraceroute (http://freshmeat.net/projects/tcptraceroute/)...

Thanks, I'll give that a go.

The problem happened again this morning, I managed to get a tracert to my server and the BBC while it was happening, then another once it had clear up.

Here's the route to my server (I've blanked out my server name for reasons paranoia! Not that I don't trust folk here, but you know...). The first trace is during the problem, the second once it had cleared up:

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\Documents and Settings\Richard>tracert xxx.le.ac.uk

Tracing route to xxx.le.ac.uk [143.210.xx.xxx]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms agate.skynet [192.168.1.1]
2 9 ms 15 ms 18 ms 10.227.172.1
3 57 ms 62 ms 39 ms leic-t2cam1-b-v109.inet.ntl.com [82.3.34.197]
4 11 ms 18 ms 6 ms cpc3-ches1-3-1-cust233.lutn.cable.ntl.com [82.3.
32.233]
5 9 ms 14 ms 36 ms nth-bb-b-so-130-0.inet.ntl.com [213.105.172.41]

6 16 ms 14 ms 12 ms tele-ic-1-as0-0.inet.ntl.com [62.253.184.2]
7 16 ms 15 ms 14 ms 212.250.14.34
8 22 ms 21 ms 12 ms so-1-3-0.lond-sbr4.ja.net [146.97.33.9]
9 21 ms 21 ms 17 ms so-2-1-0.leed-sbr1.ja.net [146.97.33.29]
10 26 ms 23 ms 26 ms EMMAN-N1.site.ja.net [146.97.42.10]
11 22 ms 39 ms 27 ms uol3-gw-v320.emman.net [195.195.228.114]
12 19 ms 20 ms 21 ms uol3-gw-5-2-r.emman.net [194.82.121.178]
13 26 ms 21 ms 25 ms 143.210.7.1
14 40 ms 19 ms 22 ms xxx.le.ac.uk [143.210.xx.xxx]

Trace complete.

C:\Documents and Settings\Richard>tracert xxx.le.ac.uk

Tracing route to xxx.le.ac.uk [143.210.xx.xxx]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms agate.skynet [192.168.1.1]
2 23 ms 21 ms 31 ms 10.227.172.1
3 18 ms 20 ms 33 ms leic-t2cam1-b-v109.inet.ntl.com [82.3.34.197]
4 11 ms 12 ms 14 ms cpc3-ches1-3-1-cust233.lutn.cable.ntl.com [82.3.
32.233]
5 13 ms 11 ms 13 ms nth-bb-b-so-130-0.inet.ntl.com [213.105.172.41]

6 38 ms 12 ms 29 ms tele-ic-1-as0-0.inet.ntl.com [62.253.184.2]
7 27 ms 12 ms 22 ms 212.250.14.34
8 19 ms 12 ms 17 ms so-1-3-0.lond-sbr4.ja.net [146.97.33.9]
9 17 ms 17 ms 19 ms so-2-1-0.leed-sbr1.ja.net [146.97.33.29]
10 21 ms 40 ms 19 ms EMMAN-N1.site.ja.net [146.97.42.10]
11 33 ms 18 ms 32 ms uol3-gw-v320.emman.net [195.195.228.114]
12 19 ms 25 ms 22 ms uol3-gw-5-2-r.emman.net [194.82.121.178]
13 21 ms 48 ms 22 ms 143.210.7.1
14 37 ms 48 ms 29 ms xxx.le.ac.uk [143.210.xx.xxx]

Trace complete.
And here's the trace to the BBC, again first during problem, second once clear:

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\Documents and Settings\Richard>tracert www.bbc.co.uk

Tracing route to www.bbc.net.uk [212.58.253.73]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms agate.skynet [192.168.1.1]
2 8 ms 8 ms 10 ms 10.227.172.1
3 12 ms 9 ms 40 ms leic-t2cam1-a-v109.inet.ntl.com [82.3.34.69]
4 26 ms 9 ms 12 ms cpc3-ches1-3-1-cust233.lutn.cable.ntl.com [82.3.
32.233]
5 17 ms 19 ms 40 ms lee-bb-a-so-120-0.inet.ntl.com [213.105.172.17]

6 30 ms 24 ms 39 ms pop-bb-b-so-100-0.inet.ntl.com [62.253.185.238]

7 15 ms 27 ms 40 ms tele-ic-2-as0-0.inet.ntl.com [62.253.184.6]
8 20 ms 15 ms 15 ms ntl-ge2-8.prt0.thdo.bbc.co.uk [212.58.239.217]
9 15 ms 17 ms 17 ms 212.58.238.129
10 17 ms 19 ms 19 ms 212.58.239.222
11 18 ms 43 ms 16 ms www4.cwwtf.bbc.co.uk [212.58.253.73]

Trace complete.

C:\Documents and Settings\Richard>tracert www.bbc.co.uk

Tracing route to www.bbc.net.uk [212.58.253.73]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms agate.skynet [192.168.1.1]
2 12 ms 43 ms 13 ms 10.227.172.1
3 16 ms 11 ms 10 ms leic-t2cam1-a-v109.inet.ntl.com [82.3.34.69]
4 26 ms 18 ms 29 ms cpc3-ches1-3-1-cust233.lutn.cable.ntl.com [82.3.
32.233]
5 13 ms 10 ms 30 ms lee-bb-a-so-120-0.inet.ntl.com [213.105.172.17]

6 27 ms 24 ms 17 ms pop-bb-b-so-100-0.inet.ntl.com [62.253.185.238]

7 17 ms 18 ms 18 ms tele-ic-2-as0-0.inet.ntl.com [62.253.184.6]
8 13 ms 18 ms 15 ms ntl-ge2-8.prt0.thdo.bbc.co.uk [212.58.239.217]
9 22 ms 23 ms 51 ms 212.58.238.129
10 17 ms 16 ms 19 ms 212.58.239.222
11 30 ms 21 ms 36 ms www4.cwwtf.bbc.co.uk [212.58.253.73]

Trace complete.

C:\Documents and Settings\Richard>A few things to draw from those I think. First off the DNS resolve seems to work OK during the problem. Second, the routes aren't changing between working and not working intervals. Thirdly the routes only share the first four hops in common (agate is my Linksys router, presumably 10.227.172.1 in my UBR?).

I did a trace to an unaffected site during the problem:

Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\Documents and Settings\Richard>tracert www.beyond3d.com

Tracing route to beyond3d.com [74.200.65.90]
over a maximum of 30 hops:

1 <1 ms <1 ms <1 ms agate.skynet [192.168.1.1]
2 27 ms 29 ms 14 ms 10.227.172.1
3 38 ms 14 ms 10 ms leic-t2cam1-a-v109.inet.ntl.com [82.3.34.69]
4 26 ms 22 ms 29 ms leic-t3core-1a-ge-017-0.inet.ntl.com [195.182.17
9.245]
5 56 ms 12 ms 10 ms lee-bb-a-so-120-0.inet.ntl.com [213.105.172.17]

6 25 ms 17 ms 18 ms bre-bb-b-so-200-0.inet.ntl.com [213.105.175.26]

7 19 ms 17 ms 18 ms telc-ic-1-as0-0.inet.ntl.com [62.253.185.74]
8 26 ms 21 ms 25 ms tge2-3.fr4.lon.llnw.net [195.66.226.133]
9 40 ms 43 ms 20 ms ve5.fr3.lon.llnw.net [69.28.171.137]
10 99 ms 89 ms 82 ms tge7-2.fr3.lga.llnw.net [69.28.171.125]
11 114 ms 97 ms 96 ms ve2002.fr4.lga.llnw.net [69.28.171.202]
12 125 ms 124 ms 107 ms tge2-3.fr4.iad.llnw.net [69.28.171.153]
13 * 100 ms 110 ms readnews.defender.tge1-4.fr4.iad.llnw.net [69.28
.158.174]
14 109 ms 102 ms 102 ms colo12c.iad.keepitsecure.net [69.65.112.54]
15 102 ms 132 ms 108 ms cerberus.fileburst.net [74.200.65.90]

Trace complete.

C:\Documents and Settings\Richard>
Which is interesting because it differs from the other two at hop 4 (cpc3-ches1-3-1-cust233.lutn.cable.ntl.com on the affected route).

Dai
03-03-2008, 09:03
I'm watching this with some interest. I don't have any advice to offer right now but I have been suffering the same problem with one particular site for some time now.
Again, when the site times out I can still ping and traceroute but can't get http access. It's definitely not a DNS issue as the name still resolves.
This clears itself normally after a few minutes although on a couple of occasions the time-out has lasted as long as 20 minutes.

richardw
23-03-2008, 09:34
Well I posted to the support newsgroup and got a few "we'll look into it" replies. But it's still happening. Intermittent, random, localised problems like this can be very hard to track down so it doesn't surprise me that VM support won't have the time to solve it.

Guess I might have to look for an alternate entry point to the Internet, as that's the simplest solution as far as I can see. Which is a shame, as I've had good service from Diamond Cable/NTL/Virgin Media over the years. But this problem plus traffic shaping plus dial-up speeds in the evening adds up to more irritation than I feel I have the right to be paying for. :(