Chaos in Rosetta@Home???

Message boards : Number crunching : Chaos in Rosetta@Home???

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 62711 - Posted: 2 Aug 2009, 6:01:13 UTC

ralph is definitely used. our standard procedure is to send all jobs through ralph first before running them on Rosetta. If there are some people in our group skipping this and may be causing problems, they shouldn't and I'll make sure they don't do it again.

joseps, the code signing error was a simple human error. ralph uses a different signature than R@h so all new apps have to be code signed (you can't test this on ralph, you just have to do it right the first time and verify the signature). I since fixed our code signing script to make sure the signature files always get overwritten to make sure incorrect copies get overwritten with the correct signature (and verified). We do not have the resources to have backup servers just sitting idle and they wouldn't have helped much with this latest mishap anyway.
ID: 62711 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sid Celery

Send message
Joined: 11 Feb 08
Posts: 2127
Credit: 41,266,340
RAC: 8,573
Message 62718 - Posted: 2 Aug 2009, 13:35:44 UTC - in response to Message 62711.  

Ralph is definitely used. Our standard procedure is to send all jobs through Ralph first before running them on Rosetta. If there are some people in our group skipping this and may be causing problems, they shouldn't and I'll make sure they don't do it again.

So this is really just a discipline problem.

It seems to me this kind of thing can be solved if people are made aware of the limits of their authority, and updates can't go live unless they're signed off first by someone who does have that authority, their capability and availability to rectify any unexpected issue that arises.

That covers everything, doesn't it?

A hard lesson this time, but the consequences are much bigger than the problem, so it has to be done.
ID: 62718 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 62722 - Posted: 2 Aug 2009, 15:29:05 UTC - in response to Message 62718.  

Ralph is definitely used. Our standard procedure is to send all jobs through Ralph first before running them on Rosetta. If there are some people in our group skipping this and may be causing problems, they shouldn't and I'll make sure they don't do it again.

So this is really just a discipline problem.

It seems to me this kind of thing can be solved if people are made aware of the limits of their authority, and updates can't go live unless they're signed off first by someone who does have that authority, their capability and availability to rectify any unexpected issue that arises.

That covers everything, doesn't it?

A hard lesson this time, but the consequences are much bigger than the problem, so it has to be done.



Sounds like a hard core strictly enforced protocol needs to be enacted and enforced. Make RAH off limits except to just a few key personnel that can check the work before it is released. I'm guessing this was a serious work load for the IT people to fix and it was a egg on the face for the project. Let's hope this was a lesson that will not be repeated any time soon.
ID: 62722 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile robertmiles

Send message
Joined: 16 Jun 08
Posts: 1233
Credit: 14,284,221
RAC: 1,121
Message 62820 - Posted: 7 Aug 2009, 16:48:46 UTC - in response to Message 62640.  

We are still in debug mode for our minirosetta application. There is a small memory leak and a 2 fold slow down in performance. The slow down was caused by a recent refactoring of the hydrogen bond energy code.



Maybe that small memory leak is responsible for the problems I've been seeing lately with the total memory in use by processes as reported by Windows Task Manager being significantly less than the total physical memory in use, and whenever the total physical memory in use gets much above 50%, all programs that run in 32-bit mode slowing down significantly on both of my computers.

I've currently decided to handle the problem by telling BOINC that it can use no more than 40% of the memory on either computer, even when it isn't in use. This makes such problems slower to appear, but does not stop them entirely. Restarting the boinc.exe program more often helps too.

There's also the possibility that the versions on BOINC on both these computers have trouble using more than 50% of the memory to run workunits in 32-bit mode. Another reason to hurry up the availability of application programs that run in 64-bit mode, and to give future versions of BOINC separate control of how much memory can be used in 32-bit mode.
ID: 62820 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 22,732,248
RAC: 3,460
Message 62834 - Posted: 8 Aug 2009, 11:55:54 UTC

Another discipline problem is that new versions are being implemented without posting them to the Rosetta Application Version Release Log. This has happened several times in the past few weeks. Version 1.91 is still not there.

It's very important that this is done so that those of us who subscribe to that thread get an email, and can update our firewalls.

There may be other reasons that this is important, but that's why it's important to me.
ID: 62834 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2

Message boards : Number crunching : Chaos in Rosetta@Home???



©2024 University of Washington
https://www.bakerlab.org