Message boards : Number crunching : minirosetta 2.05
Previous · 1 · 2 · 3 · 4 · 5 . . . 10 · Next
Author | Message |
---|---|
Aroundomaha Send message Joined: 11 Sep 08 Posts: 14 Credit: 55,623,619 RAC: 0 |
For the past two days my Windows 7 machine has been bombing with occasional blue screen of death crashes. I ran the Microsoft debugger and it points to an issue with minirosetta 2.05. --------- enclosed debug information ----------------- 3: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* MULTIPLE_IRP_COMPLETE_REQUESTS (44) A driver has requested that an IRP be completed (IoCompleteRequest()), but the packet has already been completed. This is a tough bug to find because the easiest case, a driver actually attempted to complete its own packet twice, is generally not what happened. Rather, two separate drivers each believe that they own the packet, and each attempts to complete it. The first actually works, and the second fails. Tracking down which drivers in the system actually did this is difficult, generally because the trails of the first driver have been covered by the second. However, the driver stack for the current request can be found by examining the DeviceObject fields in each of the stack locations. Arguments: Arg1: fffffa800afb3320, Address of the IRP Arg2: 0000000000000eae Arg3: 0000000000000000 Arg4: 0000000000000000 Debugging Details: ------------------ IRP_ADDRESS: fffffa800afb3320 CUSTOMER_CRASH_COUNT: 1 DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT BUGCHECK_STR: 0x44 PROCESS_NAME: minirosetta_2. CURRENT_IRQL: 2 LAST_CONTROL_TRANSFER: from fffff8000285fb95 to fffff80002875f00 |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,262,530 RAC: 19,111 |
Hi, At last I have received enough WUs of this type for check. My output - still there are problems with checkpointing. In difference from version 2.03 the information about "CPU time at last checkpoint" is displayed now correctly that gives the chance to BOINC client to switch between projects, but after restart calculation still starts from the beginning. Here a task example which I watched: 8gbnnotyr_3gbn_2iug_9Jan2010_16915_7_0 Before restart it has been used 0:33 hour CPU time, 27 models done, after restarting another 1:27 hour and 72 more models are calculated. But apparently in the report 72 models counted after restarting are mirrored only, 27 models do not suffice, also the task was completed with Validate error. Here another example: 8gbnnotyr_3gbn_1ijt_9Jan2010_16915_1_0 The same results - in report there are only models counted after restarting and Validate error too. For matching here the task of this type which was computing without breaks: 8gbnnotyr_3gbn_1woj_9Jan2010_16909_12_0 Without interruption 2 hours of CPU result to 94 models (compare with 72 and 67 in the previous cases in the same 2 hours of CPU time) and Validate state = Valid The difference just corresponds somewhere to 0.5 hours of CPU time, and so much time passed before restartings |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,262,530 RAC: 19,111 |
Please don't presume that the information from the Project Team is an inaccurate description and that your memory observations are a new and permanent condition for all to enjoy going forward. As Sarel points out, they introduced a new type of work unit which has a new low-memory phase to execution. And so you are only going to see the lower memory usage when that specific type of task is being worked on. And this new type of work unit was introduced in prior versions, so the actual delta to v2.05 is small. Since this new type of work is a current area of review, you may see a high concentration of this type of work for a period of time. But it doesn't mean we can presume more then was stated. Yes, here I was mistaken. Simply with new version 2.05 some time in the beginning i recieve ONLY the new types of WU using few RAM. From what I have come to a (wrong) conclusion. But now some WUs of old types come, and for them memory usage about same have as in version 2.03. |
Evan Send message Joined: 23 Dec 05 Posts: 268 Credit: 402,585 RAC: 0 |
https://boinc.bakerlab.org/rosetta/result.php?resultid=310901552 This one stalled twice at about 5 hrs 35 mins but was running for over 9 hours. I restarted boinc and it then stalled again in the same place. |
Mike_Solo Send message Joined: 16 Nov 09 Posts: 2 Credit: 67,261 RAC: 0 |
Soooo... this new version hangs too often. 2.0.3 was much more stable. It hangs on my 2xAthlonMP 2800 as well on the Intel E8400 so the CPU is not the issue. I think 15% of tasks stuck in the middle consuming >200 Megs of RAM but no CPU. I'm thinking to leave Rosetta for a while until new version ready as tired of kicking off broken tasks every morning :( |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Looks like Mike Solo has 3 machines: One WinXP using BOINC version 6.10.18 One WinXP using BOINC version 6.10.18 One WinServer 2003 using BOINC version 6.10.18 Rosetta Moderator: Mod.Sense |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,262,530 RAC: 19,111 |
2 more tasks of type *gbnnotyr* with the same result - by operation without stops all work normally, but if during calculation there was a break - results befo a break disappear, and the task is ended with validate error. Total i have: 2 WU handled without stops, seems all of them is OK: https://boinc.bakerlab.org/rosetta/result.php?resultid=310752146 https://boinc.bakerlab.org/rosetta/result.php?resultid=311145245 And 3 WU with a break in processing, all were completed with validate errors: https://boinc.bakerlab.org/rosetta/result.php?resultid=310935403 https://boinc.bakerlab.org/rosetta/result.php?resultid=310946429 https://boinc.bakerlab.org/rosetta/result.php?resultid=311163725 P.S. Last from these 3(id 311163725) it has been stopped at the very beginning of operation, still before 1st checkpoint has been written. However after restarting its processing all was completed with validate error. So it is possible validate errors in this type of WUs are not linked with checkpoints directly and these are 2 different bugs. |
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
Thanks! We'll have a look at this as soon as possible and let you know what we find. Best, Sarel. 2 more tasks of type *gbnnotyr* with the same result - by operation without stops all work normally, but if during calculation there was a break - results befo a break disappear, and the task is ended with validate error. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
In the last week I've had to abort 11 tasks on W7 because the tasks are hung consuming 0% CPU time. I was hoping that the combination of upgrading to the latest BOINC and the new 2.05 version of R@h would fix the problem but no: it continues as before. Tasks on Mac OS X seem to be unaffected by this problem. Until there's some indication this problem is fixed I'm not getting any more tasks for W7. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Task: 311103842 Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00003_0000018_07.pdb_00003.pdb.JOB_16835_29 ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file AdeB |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Here's another Validate error, it didn't seem to have any problems running. Edit/ This was on 64bit linux. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283574991 8gbnnotyr_3gbn_1s68_9Jan2010_16915_22_0 # cpu_run_time_pref: 14400 ====================================================== DONE :: 37 starting structures 14469.9 cpu seconds This process generated 37 decoys from 37 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish Validate error__Done__14,470.06 ========================================================================= Edit/ added this. This one was on linux 32bit, again didn't seem to have a problem. Very low credits. 8gbnnotyr_3gbn_1opd_9Jan2010_16915_42_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=283817716 # cpu_run_time_pref: 14400 ====================================================== DONE :: 8 starting structures 12134.6 cpu seconds This process generated 8 decoys from 8 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down cleanly ... called boinc_finish Success__Done__12,135.35__28.60__4.61 |
Admin Send message Joined: 13 Apr 07 Posts: 42 Credit: 260,782 RAC: 0 |
Validate Error on Win7, successfully completed by a wingman on win xp https://boinc.bakerlab.org/rosetta/result.php?resultid=311128874 name: 8gbnnotyr_3gbn_1iuk_9Jan2010_16915_131_0 |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 16,102 |
About time I updated my recent fault lists. I've had several errors under 2.03, but only this under 2.05: On Intel T5500 laptop running W7 and Boinc 6.10.18 Outcome Validate error 8gbnnotyr_3gbn_2onu_9Jan2010_16909_17_0 # cpu_run_time_pref: 28800 Note: On several occasions the following line appears: No heartbeat from core client for 30 sec - exiting Edit: Wingman running XP also received a validate error on apparently successful completion. |
MVeiga Send message Joined: 15 Oct 07 Posts: 1 Credit: 2,448,806 RAC: 4,828 |
Hi guys, let me just tell you. If youre using Windows7 the beta version 6.10.24 or even the new beta 6.10.29 is much more stable. Ive used a lot of time the beta 6.10.24 and i had no problem at all with rosetta. For me its much more stable than 6.10.18 in windows7 of course. Anyway its just my case. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,262,530 RAC: 19,111 |
Task: 311103842 I too had a same error in this type of WU: https://boinc.bakerlab.org/rosetta/result.php?resultid=310238605 And on 2nd computer processing this WU - too: https://boinc.bakerlab.org/rosetta/result.php?resultid=310471681 The truth it was still version 2.03, therefore I did not write about it, but above an example of the same error and to versions 2.05. |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,262,530 RAC: 19,111 |
Here's another Validate error, it didn't seem to have any problems running. Seems only one problem with that WU - it has restart too (may be swith to another project?) and bug related with it.
I too have such example: https://boinc.bakerlab.org/rosetta/result.php?resultid=311202691 Claimed credit=54.35 vs Granted credit = 1.83 (about 30 times lower) And I even can tell what exactly with it have occurred: Usually in this type of WUs model settle up very fast, nearby 1 or several minutes on 1 model. This task started as - approximately for 15 minutes 13 models have been calculated (on ~500 steps in each) , but about 14th something has occurred, calculation has not stopped on 500th step, and proceeded much longer, I saw as the counter have passed for 40000 steps, and did not look any more further(i think all was about 60000-70000 steps total). I was already think to abort this task since thought that calculation has gone in cycles, but in 5 hours (instead of several minutes) calculation of 14th model all the same was completed. I.e. 13 models were considered about 15 minutes, and 14th about 5 hours. From here from such small stake-in Granted credit - since they are calculated proportionally to quantity of models. (If not this 14th model, for 5 hours it would be calculated about 300 models instead of 14 and Granted credit would be close to Claimed credit). I think too most was and in your taks... P.S. Quite probably that it NOT an error, but a feature of algorithm - if it finds something interesting more detail calculation of this model probably starts. It is desirable for specifying for scientists responsible for this type of WUs. |
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
Hello, based on the reports of validator issues, David Kim has now fixed the validator. He also asked me to remind people that credit is granted based on the client's claimed credit, regardless of validator results. Let us know if you see more such problems. Thanks, Sarel. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 16,102 |
Thanks for the information Sarel - and David for the fix. No further errors today, but a cursory check has revealed I haven't re-booted my desktop since Dec 15th! I'm sure I've had various updates since then, but that's a ridiculous amount of uptime for me... Back in 5... ;) |
Link Send message Joined: 4 May 07 Posts: 356 Credit: 382,349 RAC: 0 |
credit is granted based on the client's claimed credit, regardless of validator results. Does that not apply only to results with compute errors or validate errors? . |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,262,530 RAC: 19,111 |
hellotheworld wrote:
Oxfez wrote: One of my tasks has "meatballed" too: I have another "meatball" too. Task: https://boinc.bakerlab.org/rosetta/result.php?resultid=311361747 Some screenshots: http://s001.radikal.ru/i193/1001/1f/cffd2181b53b.jpg http://i073.radikal.ru/1001/d9/c87d3083bfb9.jpg http://s41.radikal.ru/i094/1001/8e/a86dfd3a7d6a.jpg Plus about last 2 hours of computation(or ~20 steps) there were no changes in Energy or RMSD at all. (I did not do more screenshots since further varied nothing except CPU Time and Steps count) I do not think that it is an error in the software, but probably weak place in the scientific algorithm itself, so it is necessary to address it not to programmers, but scientists. |
Message boards :
Number crunching :
minirosetta 2.05
©2024 University of Washington
https://www.bakerlab.org