Problems and Technical Issues with Rosetta@home

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home

To post messages, you must log in.

Previous · 1 . . . 296 · 297 · 298 · 299 · 300 · 301 · Next

AuthorMessage
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 109972 - Posted: 3 Nov 2024, 11:23:57 UTC - in response to Message 109971.  

Over the last 90 min, the Validator backlog has dropped by over 100k. Looks like it's dropping by around 35k per hour (when the Validators were down completely, the rate of increase was roughly 12k per hour).
It's taken 16 hours since the Validators were restarted, but we're starting to get some significant falls in the backlog- and looking at my systems pendings, they've actually started to drop too.
*fingers crossed*

I looked a few hours ago and my 132 pending had dropped to 80 and now I've arrived home it's already further down to just 31.
Backlog down to 370k so it's all looking good now. My fears from yesterday have largely been allayed.

Err... backlog to validate - nil
ID: 109972 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 109974 - Posted: 4 Nov 2024, 0:28:08 UTC - in response to Message 109972.  

Over the last 90 min, the Validator backlog has dropped by over 100k. Looks like it's dropping by around 35k per hour (when the Validators were down completely, the rate of increase was roughly 12k per hour).
It's taken 16 hours since the Validators were restarted, but we're starting to get some significant falls in the backlog- and looking at my systems pendings, they've actually started to drop too.
*fingers crossed*

I looked a few hours ago and my 132 pending had dropped to 80 and now I've arrived home it's already further down to just 31.
Backlog down to 370k so it's all looking good now. My fears from yesterday have largely been allayed.

Err... backlog to validate - nil

Not quite sure what's happening atm, but the validation backlog is up at 10k, but I don't think it's stopped working - just not quite keeping up for some reason.
The weirdness continues
ID: 109974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,755,265
RAC: 22,849
Message 109975 - Posted: 4 Nov 2024, 7:01:21 UTC - in response to Message 109974.  
Last modified: 4 Nov 2024, 7:03:42 UTC

Not quite sure what's happening atm, but the validation backlog is up at 10k, but I don't think it's stopped working - just not quite keeping up for some reason.
The weirdness continues
26k now.
The server has had issues for months now. I'm wondering if this is a symptom of those issues as they progressively worsen?

Someone there really needs to take a close look at the system logs to see just what is going on- WTF does the server keep crashing? And why is it now having so much trouble Validating work?
I'm thinking it's time to it to be replaced- a decade and a half is a very long time in computer hardware development.
Grant
Darwin NT
ID: 109975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,555,377
RAC: 6,312
Message 109976 - Posted: 4 Nov 2024, 8:21:28 UTC - in response to Message 109975.  

Someone there really needs to take a close look at the system logs to see just what is going on- WTF does the server keep crashing? And why is it now having so much trouble Validating work? I'm thinking it's time to it to be replaced- a decade and a half is a very long time in computer hardware development.


As i said a lot of time ago, we don't know if the server page is updated.
If not, the hw and (above all) the os/sw are very old.
Ubuntu 16....
ID: 109976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 109978 - Posted: 4 Nov 2024, 17:42:44 UTC - in response to Message 109976.  
Last modified: 4 Nov 2024, 17:47:55 UTC

Someone there really needs to take a close look at the system logs to see just what is going on- WTF does the server keep crashing? And why is it now having so much trouble Validating work? I'm thinking it's time to it to be replaced- a decade and a half is a very long time in computer hardware development.

As i said a lot of time ago, we don't know if the server page is updated.
If not, the hw and (above all) the os/sw are very old.
Ubuntu 16....

Just being 'old' isn't the worst thing in the world.
Being old and having failure issues every few weeks is a sign that if you don't fix this stuff, it's going to fail altogether.
Which will inevitably result in someone asking whether they can afford the time and trouble to update it all or whether they should go in another direction entirely.
I'm not sure how convinced I am they'll update the hw & sw to continue here tbh

In the meantime, I think all tasks have just run out, so we'll soon see if the validation backlog (currently 59k) will start to edge back down again

Edit: Just checked and no-one in my team has <any> tasks pending validation. Am I just lucky? Or is the backlog not real?
ID: 109978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 109979 - Posted: 4 Nov 2024, 21:27:06 UTC - in response to Message 109978.  

In the meantime, I think all tasks have just run out, so we'll soon see if the validation backlog (currently 59k) will start to edge back down again

Edit: Just checked and no-one in my team has <any> tasks pending validation. Am I just lucky? Or is the backlog not real?

Well, that changed quick.
Validation backlog back down to nil and 700k tasks have popped up
We live to crunch another day
ID: 109979 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,555,377
RAC: 6,312
Message 109980 - Posted: 5 Nov 2024, 8:14:45 UTC - in response to Message 109978.  

Just being 'old' isn't the worst thing in the world.

Not for servers exposed costantly to the internet.
Security fixes, bugfix, support are fundamental (if you care about the project).
There is also the performance factor: do you see the difference of a recente file system (ZFS 2.5) and old one (0.7 - if true)?


Which will inevitably result in someone asking whether they can afford the time and trouble to update it all or whether they should go in another direction entirely.
I'm not sure how convinced I am they'll update the hw & sw to continue here tbh

+1
ID: 109980 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,555,377
RAC: 6,312
Message 109981 - Posted: 5 Nov 2024, 9:47:15 UTC - in response to Message 109979.  

Validation backlog back down to nil and 700k tasks have popped up
We live to crunch another day


And another day with over 46k wus pending validation... :-(
ID: 109981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 109982 - Posted: 5 Nov 2024, 17:13:06 UTC - in response to Message 109981.  

Validation backlog back down to nil and 700k tasks have popped up
We live to crunch another day

And another day with over 46k wus pending validation... :-(

Yes, and now 88k
But I just looked through my team's tasks again and it's the same as a few days ago.
A high figure showing on the server status page, but none of my team have <any> tasks awaiting validation.

Is this 2 coincidences in a row? I'm certainly confused.
ID: 109982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bryn Mawr

Send message
Joined: 26 Dec 18
Posts: 390
Credit: 12,073,013
RAC: 4,827
Message 109983 - Posted: 5 Nov 2024, 19:19:48 UTC - in response to Message 109982.  

Validation backlog back down to nil and 700k tasks have popped up
We live to crunch another day

And another day with over 46k wus pending validation... :-(

Yes, and now 88k
But I just looked through my team's tasks again and it's the same as a few days ago.
A high figure showing on the server status page, but none of my team have <any> tasks awaiting validation.

Is this 2 coincidences in a row? I'm certainly confused.


You are the lucky one.

The problem appears to have started for me at 02:00 GMT, for the next hour I have about 50% pending and since then I’ve only had 5 validated out of nearly 100 completed.
ID: 109983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1232
Credit: 14,270,985
RAC: 1,405
Message 109984 - Posted: 5 Nov 2024, 22:30:34 UTC - in response to Message 109982.  

Validation backlog back down to nil and 700k tasks have popped up
We live to crunch another day

And another day with over 46k wus pending validation... :-(

Yes, and now 88k
But I just looked through my team's tasks again and it's the same as a few days ago.
A high figure showing on the server status page, but none of my team have <any> tasks awaiting validation.

Is this 2 coincidences in a row? I'm certainly confused.

Could it mean that the validator processes for some operating systems correctly, but not for some others?
ID: 109984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 109985 - Posted: 5 Nov 2024, 22:46:02 UTC - in response to Message 109983.  

Validation backlog back down to nil and 700k tasks have popped up
We live to crunch another day

And another day with over 46k wus pending validation... :-(

Yes, and now 88k
But I just looked through my team's tasks again and it's the same as a few days ago.
A high figure showing on the server status page, but none of my team have <any> tasks awaiting validation.

Is this 2 coincidences in a row? I'm certainly confused.

You are the lucky one.

The problem appears to have started for me at 02:00 GMT, for the next hour I have about 50% pending and since then I’ve only had 5 validated out of nearly 100 completed.

I've just looked at your pending tasks and I'm amazed at the backlog.
I just returned 8 tasks and, while they didn't validate immediately, it only took 20-30 minutes, not 20hrs!
I'm almost apologetic about my success. I've done nothing to warrant it, certainly.
Definitely some strange and inexplicable business going on.
ID: 109985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,755,265
RAC: 22,849
Message 109986 - Posted: 6 Nov 2024, 4:35:01 UTC - in response to Message 109984.  
Last modified: 6 Nov 2024, 4:39:58 UTC

Could it mean that the validator processes for some operating systems correctly, but not for some others?
Most likely a disk issue, if your results are on the disk that is having issues, then you get stuck with all the pending's. If you're lucky and they're on those that are OK, then it's no problems for the database to read & Validate them & then transition the result & then remove it.
To me it's looking more and more like a dodgy drive in an array issue (or if it's a hardware RAID controller, then the disk(s) might be OK but the controller might be having issues with a channel or two...).

All wild speculation on my part.


Unfortunately the site that provides the BOINC graphs has been having issue, but it's come back up and it shows the Validation backlog this time hit 125k, but is now falling at roughly 40k per hour.
Grant
Darwin NT
ID: 109986 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,755,265
RAC: 22,849
Message 109987 - Posted: 6 Nov 2024, 10:47:32 UTC

Validator backlog is back with a vengeance, and boinc-process host is officially dead again on the Server Staus page.





Oh, and for an idea of how much better CPUs have become over the years, the Xeon E3-1280 v5 in the graph below is 200Mhz slower & has 3 GB/s less memory bandwidth than the E3-1270 v6 CPU being used for the database server here at Rosetta (so close enough for there to be bugger all difference in performance between them).



The EPYC 4124P has the same thread & core count as the Xeon E3-1280 v5, the same TDP rating, but double the performance.
And the EPYC 4564P, a bit over double the power, but with 8 times the performance....
Grant
Darwin NT
ID: 109987 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 1 Dec 05
Posts: 1994
Credit: 9,555,377
RAC: 6,312
Message 109989 - Posted: 6 Nov 2024, 16:08:29 UTC - in response to Message 109987.  
Last modified: 6 Nov 2024, 16:12:58 UTC

The EPYC 4124P has the same thread & core count as the Xeon E3-1280 v5, the same TDP rating, but double the performance.
And the EPYC 4564P, a bit over double the power, but with 8 times the performance....


I don't know if the problem is the cpu. I think much more to the hd/ssd systems.

And I continue to consider that the sw/os is important as the hw
If you scroll this page, you can see how much the file system of the R@H server is old
And here, here, etc, some ideas about optimization of file system resources
ID: 109989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,755,265
RAC: 22,849
Message 109991 - Posted: 7 Nov 2024, 6:47:25 UTC - in response to Message 109989.  

I don't know if the problem is the cpu. I think much more to the hd/ssd systems.
So do i, however a couple of new systems could replace all of the existing systems, provide much better performance, and use less power.
They could spend days, weeks, months (and money) sorting out exactly what is dying on the current system, or just one new half-decent system to replace the existing problem hardware and sort out the old one at leisure & keep it for emergencies/ other needs.


And I continue to consider that the sw/os is important as the hw
If you scroll this page, you can see how much the file system of the R@H server is old
Actually extremely old.
There have been plenty of performance updates over the years, let alone security-based ones, that would make it worthwhile upgrading to the current releases IMHO.
Grant
Darwin NT
ID: 109991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2121
Credit: 41,179,074
RAC: 11,480
Message 109998 - Posted: 8 Nov 2024, 0:56:32 UTC

Boinc-process server is back and validation seems to be working, with a 330k backlog to work through
ID: 109998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JLDun
Avatar

Send message
Joined: 31 May 08
Posts: 7
Credit: 68,063
RAC: 447
Message 109999 - Posted: 8 Nov 2024, 6:00:25 UTC

Getting some "transient https errors" in attempting to download some tasks.
ID: 109999 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,755,265
RAC: 22,849
Message 110000 - Posted: 8 Nov 2024, 8:50:46 UTC - in response to Message 109999.  

Getting some "transient https errors" in attempting to download some tasks.
No issues with your net connection in general?
Looked at my Event log, and no signs of issues with uploads or downloads.
Grant
Darwin NT
ID: 110000 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1677
Credit: 17,755,265
RAC: 22,849
Message 110001 - Posted: 8 Nov 2024, 8:52:18 UTC - in response to Message 109998.  

Boinc-process server is back and validation seems to be working, with a 330k backlog to work through
This time it appears to be doing well straight off- the backlog is almost cleared and all of my Pendings have already cleared.
Grant
Darwin NT
ID: 110001 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 296 · 297 · 298 · 299 · 300 · 301 · Next

Message boards : Number crunching : Problems and Technical Issues with Rosetta@home



©2024 University of Washington
https://www.bakerlab.org