Message boards : Number crunching : minirosetta 2.03
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,324,975 RAC: 3,637 |
I get an issue on this workunit during CPU benchmarks: To me, this one looks like SOMETHING won't run correctly during the CPU benchmarks, but the application is able to recover afterwards. Since the CPU benchmarks aren't run very often, this problem won't be seen very often either. |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
broker_idealclose_kic_in20_hb_t312__IGNORE_THE_REST_16513_879_0, task 305259567, failed on Windows 7 with a Compute Error after about an hour. Setting up checkpointing ... Setting up graphics native ... FNAME: native.pdb FNAME: ss_core_native.pdb FNAME: ss_core_native_radical.pdb FNAME: native_notails.pdb FNAME: native.pdb BOINC:: Worker startup. Starting watchdog... Watchdog active. CLOSING with IDEALIZATION CLOSING with IDEALIZATION CLOSING with IDEALIZATION CLOSING with IDEALIZATION Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x5B202020 read attempt to address 0x5B202020 Engaging BOINC Windows Runtime Debugger... Followed by pages of W7 debug info |
SekeRob Send message Joined: 7 Sep 06 Posts: 35 Credit: 19,984 RAC: 0 |
Just for try out, set 1 hour run time pref in home profile [because Rosetta is on a small share], saved, received 2 with about a 1 hour run time, but now this one is running 21/12/2009 12:23:40 rosetta@home [checkpoint_debug] result broker_idealclose_hb_t293__IGNORE_THE_REST_16362_82629_0 checkpointed 1:25 CPU time 1:26 Elapsed, and on 28 percent. It's checkpointing regularly, so don't consider this a bad task. Why this long one in-between? Mini 2.03 release. PS: First 2 validated, this long one now Pending Validation at 2.11 hours... twice as long from specified. Coelum Non Animum Mutant, Qui Trans Mare Currunt |
robertmiles Send message Joined: 16 Jun 08 Posts: 1233 Credit: 14,324,975 RAC: 3,637 |
Just for try out, set 1 hour run time pref in home profile [because Rosetta is on a small share], saved, received 2 with about a 1 hour run time, but now this one is running At least partly because the test for whether to end the task normally occurs only at the end of a decoy, so if the last decoy it started run significantly longer than expected, you'd be likely to exceed your time preference. Also, I remember some discussion about setting the minimum allowed time preference longer than one hour, so you might want to check for signs that it's actually now set longer. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
I don't believe any change in minimum runtime was made robert. But the minimum amount of useful work is one model (or "decoy"). If that takes longer then an hour, then that does happen and is normal for this project. The watchdog is still there keeping everyone in line if need be. With such a short runtime you should expect the % completed to vary widely. BOINC's expectations and Rosetta's estimates will have trouble settling in when one task completed in 45 min. and the next in 2 hours. Rosetta Moderator: Mod.Sense |
SekeRob Send message Joined: 7 Sep 06 Posts: 35 Credit: 19,984 RAC: 0 |
The present test standing is: 306433095 279395159 21 Dec 2009 13:45:43 UTC 21 Dec 2009 14:53:27 UTC Over Success Done 3,366.56 17.47 15.56 306421142 279383866 21 Dec 2009 12:39:42 UTC 21 Dec 2009 13:49:54 UTC Over Success Done 3,503.61 18.18 15.74 306417911 279381434 21 Dec 2009 12:22:40 UTC 21 Dec 2009 13:33:02 UTC Over Success Done 3,542.00 18.38 16.13 306408510 279371956 21 Dec 2009 11:33:58 UTC 21 Dec 2009 12:43:56 UTC Over Success Done 3,594.54 18.65 16.79 306393079 279357285 21 Dec 2009 10:11:58 UTC 21 Dec 2009 12:26:53 UTC Over Success Done 7,661.61 39.76 40.86 306381783 279347432 21 Dec 2009 9:11:43 UTC 21 Dec 2009 11:38:12 UTC Over Success Done 3,444.83 17.88 15.05 306365927 279333133 21 Dec 2009 7:44:11 UTC 21 Dec 2009 9:07:33 UTC Over Success Done 3,497.34 18.15 16.63 Looks like it's pretty well figured out that an hour is an hour most of the times. I do appreciate that there is a non-deterministic element and if just incidental, a good project to act as filler when on a shutdown schedule. From 4 O'clock it's power-off. Coelum Non Animum Mutant, Qui Trans Mare Currunt |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Workunit 281289676 failed on Windows 7: it appeared to hang (not using processor time) and had to be aborted. It was successfully completed by a wingman on XP. |
AdeB Send message Joined: 12 Dec 06 Posts: 45 Credit: 4,428,086 RAC: 0 |
Task: 309276026 Workunit: homopt_nat2.t370_.t370_.IGNORE_THE_REST.S_00003_0000009_04.pdb_00003.pdb.JOB_16836_1 stderr out: ... AdeB |
coturnix Send message Joined: 8 Oct 09 Posts: 4 Credit: 760,915 RAC: 0 |
Task: 309472876 Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00002_0000023_04.pdb_00002.pdb.JOB_16835_15 ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
Same error as AdeB reported ERROR: [ERROR] Error opening RBSeg file 'native_0001_2.pdb.loopfile' ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish </stderr_txt> Task 309256017 Mac OS X10.6 |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Had these two error within seconds of each other. homopt_cstmc_1.t308_.t308_.IGNORE_THE_REST.S_00002_0000618_00037.pdb.JOB_16846_1 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282054734 ERROR: [ERROR] Unable to open constraints file: /work/tex/projects/cm/benchmark/cross_filt/t308_/t308_.aln_list_mike_chosen_bestaln.alns.combined.csts ERROR:: Exit from: src/core/scoring/constraints/ConstraintIO.cc line: 332 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish ======================================================================================= homopt_nat2.t312_.t312_.IGNORE_THE_REST.S_00022_0000017_04.pdb_00022.pdb.JOB_16828_5 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282085871 ERROR: [ERROR] Error opening RBSeg file 'S_00002_0000020_0_0_030.pdb_00002.pdb_00002.pdb.loopfile' ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish |
Mad_Max Send message Joined: 31 Dec 09 Posts: 209 Credit: 26,262,530 RAC: 19,111 |
Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly. For example WUs named "ha_notyr_..." - after several hours of computing "CPU time at last checkpoint" stays "---" (none). If I restart(or shut down) computer (or BOINC client only) while such WU running - all results are lost and after restart computation starts from the very beginning. Here examples of such tasks: https://boinc.bakerlab.org/rosetta/result.php?resultid=308985993 https://boinc.bakerlab.org/rosetta/result.php?resultid=309233711 And so they look from BOINС Manager: And a part of other tasks write about checkpoints, but similar actually them do not do it (or most likely do, but after restarting them do not use it). It looks so: "CPU time at last checkpoint" looks correct (only for some minutes less, than total CPU Time), BUT before exiting from BOINC I look "Show graphics", we will accept there is displayed that 38 models are calculated already. After restarting counting of models starts with 0,1,2.. and so on. Аnd then reporting to the server is referred less models, than has been considered before restarting(for exsample only 20) - similar only that has been computed after the last restarting. It is a significant problem for me, because i NEED to restart this computer from time to time, and it always shut down when i sleep (because of noise). Besides similar too most occurs at automatic switching between projects if BOINС calculate some projects simultaneously (for example R@H and E@H). While I have troubleshot so: has exposed small "CPU target time" (2 hours at present instead of 10 which used earlier) to reduce possible losses of useful calculations to a minimum. But I think, it only partial and is far not the best solution... P.S. I running minirosetta 2.03, BOINC 6.10.18 on Windows XP SP3. |
Snags Send message Joined: 22 Feb 07 Posts: 198 Credit: 2,888,320 RAC: 0 |
9gbnnotyr_3gbn_2bk8_9Jan2010_16860_5_0 Completed successfully... 2713 models! It ran for the entire 10 hour target run time without stopping despite having a 60 minute switch interval, plenty of work on board from other projects and a STD and resource share which leads most 10 hour rosetta units to break at least once, usually twice before finishing up. I don't have checkpoint flags enabled and I'm running BOINC 6.2.18 so I have no information on checkpoints. I posted a similar report (more than 100 models, no switching) about a different type of WU over on Ralph. Snags |
svincent Send message Joined: 30 Dec 05 Posts: 219 Credit: 12,120,035 RAC: 0 |
These homopt_nat2.t* models are causing real problems. On Mac i get a bunch erroring out immediately, e.g. Task 309606751 Task 309641678 gave this same ERROR: [ERROR] Error opening RBSeg file 'S_00002_0000022_0_0_00009.pdb_00001.pdb_00002.pdb.loopfile' ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish while Task [url=https://boinc.bakerlab.org/rosetta/result.php?resultid= 309628562]309628562[/url] failed like this Options::initialize() End reached ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file On Windows 7 I still get a bunch that have to be aborted as they're hanging but not taking up CPU time Task [url=https://boinc.bakerlab.org/rosetta/result.php?resultid= 308815202] 308815202[/url] Task [url=https://boinc.bakerlab.org/rosetta/result.php?resultid= 308815003] 308815003[/url] Task [url=https://boinc.bakerlab.org/rosetta/result.php?resultid= 308586455] 308586455[/url] |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2130 Credit: 41,424,155 RAC: 16,102 |
A disappointing validate error on my W7 laptop: ha_notyr_3gbn_2hpj_6Jan2010_16806_6_1 And a Compute Error on my Vista desktop: ha_notyr_3gbn_1.gz_6Jan2010_16806_2_1 ERROR: data And another 2 for the laptop: homopt4.t367_.t367_.IGNORE_THE_REST.S_00006_0000013_0_0_00086.pdb_00004.pdb_00006.pdb.JOB_16819_17_1 homopt4.t328_.t328_.IGNORE_THE_REST.S_00001_0000007_0_0_0_0020.pdb_00001.pdb_00001.pdb.JOB_16816_23_1 Both showing: ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file |
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
Thanks, Sid! The error you report in ha_notyr_3gbn_1.gz_6Jan2010_16806_2_1 is due to some scripting bug, which I've now fixed. Thanks for the bug report! Sarel. A disappointing validate error on my W7 laptop: |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
Two more over night one errored, the other odd and i'm not impressed. --------------------------------------------------------------------- homopt4.t290_.t290_.IGNORE_THE_REST.S_00002_000.pdb.JOB_16809_13_0. https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282226265 ERROR: [ERROR] Error opening RBSeg file 'S_00001_0000038_04.pdb_00001.pdb.loopfile' ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483 BOINC:: Error reading and gzipping output datafile: default.out called boinc_finish ===================================================================== This ran for over 8hrs none stop didn't let other tasks run, the last model seems to have taken four hours. My run time is set at 4hrs. ha_notyr_3gbn_2oeb_8Jan2010_16808_18_0 https://boinc.bakerlab.org/rosetta/workunit.php?wuid=282041421 BOINC:: CPU time: 29392.7s, 14400s + 14400s[2010- 1-10 18:36:45:] :: BOINC InternalDecoyCount: 87 ====================================================== DONE :: 2 starting structures 29392.7 cpu seconds This process generated 87 decoys from 87 attempts ====================================================== called boinc_finish Over__Success__Done__29,394.75__69.44__9.95 |
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
Hello, Thanks for your comments! In this sort of WU trajectories are typically not very long, but successful trajectories are extremely rare (credit is allocated per time spent computing, so you get credit for the work regardless of success). Accordingly, don't worry about turning off the computer in the middle of one of these trajectories. The most you're going to lose is a couple of minutes of computation. Thanks, Sarel. Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly. |
frederick corse Send message Joined: 7 Oct 05 Posts: 10 Credit: 1,545,999 RAC: 0 |
Hello I ran 9gbnnotyr_3gbn_3bfm_9jan2010_16860_ 11 it ran for 14399.07 secs and had 844 decoys the most i ever saw was 100 regards |
Sarel Send message Joined: 11 May 06 Posts: 51 Credit: 81,712 RAC: 0 |
That's right. Again, this is a different sort of simulation (see https://boinc.bakerlab.org/forum_thread.php?id=4477&nowrap=true#64838 for details). In these runs many of the trajectories are cut short early on because they are unlikely to yield useful results. Credit for runs is allocated for computational time and we need to know how many times simulations were started on your computers and those are reported as decoys. The amount of information that is sent back to our servers per triaged trajectory is very small though to limit bandwidth loss. Hello |
Message boards :
Number crunching :
minirosetta 2.03
©2024 University of Washington
https://www.bakerlab.org