Message boards : Number crunching : Chaos in Rosetta@Home???
Author | Message |
---|---|
Emigdio Lopez Laburu Send message Joined: 25 Feb 06 Posts: 61 Credit: 40,240,061 RAC: 0 |
Good morning. After all the issues during this July... ,my impresion is that, actually, Rosetta@home is a true chaos. I hope that this chaos is only in the "IT part" and not in the "science" part. Perhaps I,m wrong but this is my particular impresion. Hopefully all the issues will be solved soon. Regards. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Here is what caused our current issues. I personally wouldn't call it chaos but a simple mistake and issues that arise with most large scale software development projects. A developer/scientist in the lab accidentally updated the R@h application using the wrong signature file for the database which is unfortunately our largest input file. The update happened during the weekend and no one was around to fix the problem (I personally was on a backpacking trip with my family otherwise I would have immediately dealt with the problem). This caused all jobs to fail and hammered our servers. Our servers are still struggling to keep up with scheduler requests and download/uploads. Coincidentally, a very large code checkin was made to introduce symmetric folding to our minirosetta application and unfortunately there was a bug that caused a 10-fold slow down. Before catching this bug, the R@h app was updated so we had to revert to the previous application version as a quick fix. To make sure this doesn't happen again we are planning to implement a quick benchmark test on Ralph for every application update that will test various protocols for performance and speed. We are still in debug mode for our minirosetta application. There is a small memory leak and a 2 fold slow down in performance. The slow down was caused by a recent refactoring of the hydrogen bond energy code. |
googloo Send message Joined: 15 Sep 06 Posts: 133 Credit: 22,729,581 RAC: 3,310 |
Nobody's updating the Rosetta Application Version Release Log either. |
Emigdio Lopez Laburu Send message Joined: 25 Feb 06 Posts: 61 Credit: 40,240,061 RAC: 0 |
Hi, David. First of all I should like to thank you for your explanations. I appreciate it. As this has been discused before in other threads, it should be a good idea transmit this information to all the volunteers; perhaps in the main page of R@H. Not everybody goes into the forums and not evereybody will read this thread, I suppose. I do not understand the science behind this project; I only work as an IT professional not related with protein folding. But let me give you a couple of advices (without understand your "business"!): - Never, never perform a change in the software/hardware just before a weekend. If something fails, nobody could attend and fix it. - You must to build a Pre-Production environment to test the changes. As I said, I dont understand your environment/software and so on. I give you this advices with a total humility but I think that someone of your team should take some actions. The price with these errors is high if you want to maintain thousands of volunteers working for you. Thanks again. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 23 |
I have one wu stuck in upload which presimably will resolve itself. When I finally got a new wu down, it crashed after 6 seconds, Mini 1.88. 30/07/2009 21:01:21 rosetta@home [sched_op_debug] Reason: Unrecoverable error for result lr13_seq_score12_F_rlbd_1a68_IGNORE_THE_REST_DECOY_14592_2633_0 (Incorrect function. (0x1) - exit code 1 (0x1)) Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I have one wu stuck in upload which presimably will resolve itself. When I finally got a new wu down, it crashed after 6 seconds, Mini 1.88. be sure to post the error message section part of this message over in the 1.88 thread so they know. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
I have one wu stuck in upload which presimably will resolve itself. When I finally got a new wu down, it crashed after 6 seconds, Mini 1.88. These get solved in v1.90. Apparently one tiny mistake had a snowball effect on the whole project. |
Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0 |
Well, for about two days, it was a little chaotic here. A bunch of us are running around trying to figure out where the bug is and how to fix etc. But be assured that the chaos is not in the science part and hopefully just temporary. I'll post something more detailed on what we've tried to solve this problem. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
Hi, David. Your advice is great and I agree completely with both points. 1. we will make it a point never to do an update during the weekend or end of the week. 2. we do have a pre production environment - Ralph@home. But this problem was caused by user error . The signature file was accidentally copied over from Ralph when the standard protocol should automatically create the correct signature file. The 10x slow-down wasn't caught by our internal unit tests and benchmark tests but we are going to modify the tests to make sure it will get caught in the future. There has recently been a very large rate of code development. |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
I have one wu stuck in upload which presimably will resolve itself. When I finally got a new wu down, it crashed after 6 seconds, Mini 1.88. The "tiny" mistake was a very very very detrimental one unfortunately. |
Henry Huff Send message Joined: 31 May 06 Posts: 6 Credit: 2,298,502 RAC: 0 |
Yes something is wrong. I have a total of 4 computers running Rosetta@home and all have been having trouble uploading and downloading as well as getting new tasks. A frequent message is internet access ok - project servos may be down. This has been gong on for over a week. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
Yes something is wrong. I have a total of 4 computers running Rosetta@home and all have been having trouble uploading and downloading as well as getting new tasks. A frequent message is internet access ok - project servos may be down. Perhaps the answer lies here Message 62660 - all services were temporarily shut down to add more web servers. |
Gen_X_Accord Send message Joined: 5 Jun 06 Posts: 154 Credit: 279,018 RAC: 0 |
In regards to David E's explanation (thank you for that by the way) I have added Ralph@home too. Maybe helping with the early development can help prevent screw ups like this in the future. |
Chilean Send message Joined: 16 Oct 05 Posts: 711 Credit: 26,694,507 RAC: 0 |
|
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Should put a link on the Rosetta home page pointing to Ralph and then we can get more volunteers to help with pre production testing. I run Ralph to help find any bugs in the program. Little is known about Ralph with the exception of occasional mention in the boards here. |
joseps Send message Joined: 25 Jun 06 Posts: 72 Credit: 8,173,820 RAC: 0 |
Here is what caused our current issues. I personally wouldn't call it chaos but a simple mistake and issues that arise with most large scale software development projects. I turned off my 5computers when I went on vacation. When I return today, I can not upload work. Need work units to run computers. joseps |
joseps Send message Joined: 25 Jun 06 Posts: 72 Credit: 8,173,820 RAC: 0 |
Hi, I do not know servers, research operation or it's management. Rosetta@home has become very big and draws large volunteer crunchers worldwide. Some kind of preventive maintenance should be implemented to make sure that distributed computing is not interrupted. If possible, a backup server or whatever should be available. And no one person should be doing work/checking alone . There should be at least two people working together counter checking/discussing each move before a move is carried out. This is done to prevent any break in the operation. I used to run a large production plant 3 shifts operation and I make sure that 2-3 engineers discuss an action before implementing it.No one person is fail proof. I love Rosetta. I just want to volunteer my 2 cents worth of idea. If I am barging in or out of line, I am very very sorry. I'll just shut my big mouth. joseps I turned off my 5computers when I went on vacation. When I return today, I can not upload work. Need work units to run computers. joseps |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
I have to agree, this coding/signature problem should have been avoided in the first place with double checking of the code or signatures. Projects should be always alpha/beta tested on Ralph before coming over here to Rosetta. When the major errors have been worked out then bring the tasks to here for running. Then only very odd errors will show up. The group of users should be higher, but the technology problems are driving some the big users away. Perhaps Rosetta is now to big for just the group that is running it now. |
j2satx Send message Joined: 17 Sep 05 Posts: 97 Credit: 3,670,592 RAC: 0 |
Hi, David. They have ralph@home to test on, but they don't use it. |
Message boards :
Number crunching :
Chaos in Rosetta@Home???
©2024 University of Washington
https://www.bakerlab.org