Message boards : Number crunching : who do some tasks show two results?
Author | Message |
---|---|
JStateson Send message Joined: 7 May 07 Posts: 15 Credit: 4,061,331 RAC: 0 |
For example, this has two "DONE" sections, the first one is 18200 cpu seconds the second has 30749 cpu seconds. Only the 30749 showes up. Also, why is the claimed credit so high compared to the granted. The ratio of 73.9 to 7.9 is almost an order or magnitude. thanks for looking |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
In a nutshell, you had a long-running model there. Only one model completed after about 10 hours of CPU time. And since everyone else happens to get models that take 10x less time to complete, it just looks to the credit system as though you have a slow machine. The second problem is one I believe I've seen before as well. Some tasks seem to have two "done sections" as you called them. And it seems as though the credit system only sees one of them. I've asked the Project Team to look in to this issue to see if they can determine the cause. Rosetta Moderator: Mod.Sense |
Aegis Maelstrom Send message Joined: 29 Oct 08 Posts: 61 Credit: 2,137,555 RAC: 0 |
The second problem is one I believe I've seen before as well. Some tasks seem to have two "done sections" as you called them. And it seems as though the credit system only sees one of them. I've asked the Project Team to look in to this issue to see if they can determine the cause. Hi Mod, probably you remember this thread of mine. Obviously it is not a really widespread bug but I hope it will be fixed. Best Wishes for all of you. a.m. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Yes, Aegis. Thanks for digging up the link. As you say, it is quite rare. I've mentioned it to the Project Team, but it doesn't sound like a root cause has been identified as of yet. Rosetta Moderator: Mod.Sense |
Aegis Maelstrom Send message Joined: 29 Oct 08 Posts: 61 Credit: 2,137,555 RAC: 0 |
Hi All, unfortunately the problem seems to be not as rare as I thought. Once again two DONE sections. This task : stderr out <core_client_version>6.2.19</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 21600 # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 20372.3 cpu seconds This process generated 8 decoys from 8 attempts ====================================================== BOINC :: Watchdog shutting down... # cpu_run_time_pref: 21600 ====================================================== DONE :: 1 starting structures 23460.2 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... called boinc_finish </stderr_txt> ]]> Claimed credit 61.61. Granted 5.80 (only 1 decoy seen). Can the team check if all the generated decoys sit in the database, not only one? :S I would add that - as this is my second attack of this bug - at least some points added would be highly appreciated... Take care, a.m.@Poland P.S. Unfortunately I got no valid RALPH tasks. When I had time to babysit, the database spewed only some wrong WUs... :/ |
JStateson Send message Joined: 7 May 07 Posts: 15 Credit: 4,061,331 RAC: 0 |
Possibly - and this is really a guess, it might be that this bug is caused by not running the CPU at 100% utilization. I have been reading over at einstein that they identified a problem in checkpointing. When the CPU is descheduled by the < 100% rule, when it is resumed, it could not find the checkpoint (there was none) and exited. Possibly something like this happened on rosette and the task simply started from scratch. I am pretty sure that I had been setting my systems to %95 utilization. I have since changed back to 100% after reading the warning at einstein I posted this problem. You can also look for the thread "solved" there also which relates to the %100 requirement. |
Aegis Maelstrom Send message Joined: 29 Oct 08 Posts: 61 Credit: 2,137,555 RAC: 0 |
Possibly - and this is really a guess, it might be that this bug is caused by not running the CPU at 100% utilization. Nah, I don't think that's that - I have a 100% CPU utilization set. But I think you are right it is some kind of checkpointing problem. Look on my previous post: it looks like it made a checkpoint after 8 decoys before the six hours for the WU and then the WU run longer than scheduled time, just because this decoy took some more time... I don't know why the scheduler thought it could make an additional decoy within time... I'm not even sure if this CPU time number is perfectly correct... :/ Maybe there is something with the preferences (now I have 6 hrs for WU, 3 hrs for rotation - but I'm not sure if there actually was a QMC unit to rotate with, so maybe it was run all the time without the brake...). Or maybe it is a different kind of bug. Certainly this is quite annoying - not only see that your machine has a quite limited crunching power but then it is "robbed" here and there. :] However, the transfer of generated but "ignored" results into the results database is most important. |
Aegis Maelstrom Send message Joined: 29 Oct 08 Posts: 61 Credit: 2,137,555 RAC: 0 |
O.K., I think I've catched this error while it was just happening. It's a different machine, Pentium IV with BOINC 5.10.15 I've waited for this Rosetta task to finally get done after circa 6 hrs of work. Finally the task got halted after 5:52 of runtime and finishing 15th model and the other project started. However, to my surprise, the task has not been sent to the server - it was still waiting for some more crunching! I wanted to complete it and see the results, so I have halted other tasks and started this WU. Then, it attempted to crunch... but from the model one, step probably one. The stage was named "urk", whatever it means, and everything looked like an error. The graphics seemed to be wrong as well - firstly nothing, only lines of energy and RMSD, then the picture was moved so one could see only a part of the protein, and then it got O.K. The progress dropped to 58%. I've switched off the client and started writing this bug report. I am pasting here the stderr.txt of this WU: OINC:: Initializing ... ok. [2009- 2-25 11:47:10:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing core... Initializing options.... ok Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Setting database description ... Setting up checkpointing ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 21600 BOINC:: Initializing ... ok. [2009- 2-25 15:28:41:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing core... Initializing options.... ok Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Setting database description ... Setting up checkpointing ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 21600 BOINC:: Initializing ... ok. [2009- 2-25 17:21:57:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing core... Initializing options.... ok Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Setting database description ... Setting up checkpointing ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 21600 Continuing computation from checkpoint: chk_chk1_FastRelax__S_1ist_1_00010_fa ... success! Continuing computation from checkpoint: chk_chk2_FastRelax__S_1ist_1_00010_fa ... success! Continuing computation from checkpoint: chk_chk3_FastRelax__S_1ist_1_00010_fa ... success! Continuing computation from checkpoint: chk_chk4_FastRelax__S_1ist_1_00010_fa ... success! Continuing computation from checkpoint: chk_chk5_FastRelax__S_1ist_1_00010_fa ... success! BOINC:: Initializing ... ok. [2009- 2-25 19:27:52:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing core... Initializing options.... ok Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Setting database description ... Setting up checkpointing ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 21600 Continuing computation from checkpoint: chk_chk1_FastRelax__S_1ist_1_00012_fa ... success! Continuing computation from checkpoint: chk_chk2_FastRelax__S_1ist_1_00012_fa ... success! Continuing computation from checkpoint: chk_chk3_FastRelax__S_1ist_1_00012_fa ... success! Continuing computation from checkpoint: chk_chk4_FastRelax__S_1ist_1_00012_fa ... success! Continuing computation from checkpoint: chk_chk5_FastRelax__S_1ist_1_00012_fa ... success! Continuing computation from checkpoint: chk_chk6_FastRelax__S_1ist_1_00012_fa ... success! Continuing computation from checkpoint: chk_chk7_FastRelax__S_1ist_1_00012_fa ... success! ====================================================== DONE :: 1 starting structures 21149.3 cpu seconds This process generated 15 decoys from 15 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC:: Initializing ... ok. [2009- 2-25 20:57:32:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Registering options.. Registered extra options. Initializing core... Initializing options.... ok Loaded options.... ok Processed options.... ok Initializing random generators... ok Initialization complete. Setting WU description ... Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev26003.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Unpacking WU data ... Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip <unzip> <-oq> <../../projects/boinc.bakerlab.org_rosetta/hw_mamaln_t290_3.loopbuild_SAVEALLOUT.1lop_.mtyka.boinc_files.zip> <-d./> Firstarg=true; pp=-d./ firstarg: <-d./> End of unzipping. Setting database description ... Setting up checkpointing ... BOINC:: Worker startup. Starting watchdog... Watchdog active. # cpu_run_time_pref: 21600 As you may see, one table informing about 15 decoys has already been generated. I suppose that after some (mal)crunching the second table would be generated and hopefully both of them would be reported - but the first one would be ignored. Something obviously went wrong and I would like to report this WU properly and show generated results (I don't want them to be wasted, they are pretty good). I have to turn off this machine anyway so I can wait, but could you help me how should I transfer the results? Should I edit the stderr.txt file and cut off the latest lines? How could I force BOINC to just send back existing 15 decoys? Mod.Sense? Anybody? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
How could I force BOINC to... In general, trying to force things results in disappointment and frustraition. What Rosetta version is the task you are describing? The graphic always takes a minute or so to look right as a task starts. But I agree, if you had 15 models done, it should not have been starting back at model 1. Did it ever show a status of uploading and 100%?? Sometimes BOINC interrupts tasks just as they are ending. So they reach 100% (or 99.5555% rounds up to 100%) but aren't quite finished. The task records it's last checkpoint and then would end, but BOINC wants to schedule another project as soon as it sees the checkpoint taken. But it should later return to the task (even if forced by you suspending other tasks) and finish out normally. Rosetta Moderator: Mod.Sense |
Aegis Maelstrom Send message Joined: 29 Oct 08 Posts: 61 Credit: 2,137,555 RAC: 0 |
How could I force BOINC to... You are probably right. :):) However, my BOINC manager refuses to negotiate and a flattery does not work neither. ;) What Rosetta version is the task you are describing? Mini 1.54. The manager 5.10.45 but as I've said - the original double results error has been seen on later managers as well. The graphic always takes a minute or so to look right as a task starts. But I agree, if you had 15 models done, it should not have been starting back at model 1. Hmm... what really bugged me was some misplacement of the folding protein. It was seen only in one quarter, the rest was out of the box where it supposed to be. But it's just graphics any way. And yet there is this "urk" thing...
No, as far as I remember it didn't - it was like 98,8% I guess. I wanted to make my base safe and send it back - that's why I halted other tasks and made BOINC return to this WU and finish it. The worst thing I'd have suspected would be crunching another decoy (however, it was obvious there is no time within set 6 hrs for that). To my surprise, I've seen this Model 1 - and then this table with results in the output file. Best from Warsaw, a.m. |
Aegis Maelstrom Send message Joined: 29 Oct 08 Posts: 61 Credit: 2,137,555 RAC: 0 |
Alright, I couldn't wait longer for some advice. I haven't engineered the out file, just accepted the loss and finished the WU. Here you have it, 15 results ignored, the second bracket with 1 result accepted. I hope the remaining 15 results (IMVHO they looked nice) went to the database and are scientifically used. If not, well, please correct this bug in the future. Best Regards from Warsaw, a.m. |
Message boards :
Number crunching :
who do some tasks show two results?
©2024 University of Washington
https://www.bakerlab.org