Message boards : Number crunching : Granted Credit taking forever....
Previous · 1 · 2 · 3 · Next
Author | Message |
---|---|
Path7 Send message Joined: 25 Aug 07 Posts: 128 Credit: 61,751 RAC: 0 |
Hello all, The Wu's with validate state: Workunit error - check skipped; now have credit granted. Claimed credit = Granted credit. Thanks team. I do wonder if these Wu's still have any scientific value. Path7. |
Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0 |
I actually do still use these data. I just need to figure out a way to make them give me back fewer but better structures in the future. |
P . P . L . Send message Joined: 20 Aug 06 Posts: 581 Credit: 4,865,274 RAC: 0 |
The jobs that are canceled are the ones created IO problem for the server. DEK and I thought if we remove the job, it would stop the validator server from processing it. But turned out it didn't. So we'll have to wait for the server to finish processing the rest of the data. Are those the ones i saw that were up to 11 MB result files? |
Mark Brown Send message Joined: 8 Aug 09 Posts: 21 Credit: 602,685 RAC: 0 |
Hello all, Not all have credit: https://boinc.bakerlab.org/rosetta/result.php?resultid=280016038 |
Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0 |
The jobs that are canceled are the ones created IO problem for the server. DEK and I thought if we remove the job, it would stop the validator server from processing it. But turned out it didn't. So we'll have to wait for the server to finish processing the rest of the data. Yep. they are large because those are often large protein complexes. Plus we needed to save the full cartesian coordinates for this system. As for how credit is handled here, let me check with DEK. |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 0 |
Those 0 work units were granted credit at some point today. They are strange ones. The work units only ran for about an hour and they had really weird names and the graphics were strange looking too, maybe that is why they show up like they do. Right, mine too, most of them had actually ran much longer than half an hour however. Let's hope the team can bring out useful science from them. Thanks |
Mark Brown Send message Joined: 8 Aug 09 Posts: 21 Credit: 602,685 RAC: 0 |
Those 0 work units were granted credit at some point today. They are strange ones. The work units only ran for about an hour and they had really weird names and the graphics were strange looking too, maybe that is why they show up like they do. I'm backing up again. The pending I understand, but why so many 0 credits. Task ID 280068534 Name 1STF.bound.mppk.min.pdb_dock_score12_ddg.xml_yfsong_14675_2669_0 Workunit 255383041 CPU time 7406.906 Outcome Success Client state Done Validate state Workunit error - check skipped Claimed credit 19.27640105732 Granted credit 0 |
Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0 |
I'm see some odd behavior on the server side. Half of the jobs came back with just a quarter of the results compared to the other half. and the good and bad ones alternate in the file name order. Somehow those jobs are not giving the results back and not given credit either. I wonder if it a bug in the validator. Let me spend some time today to dig a little deeper. Here is the number of results I get back for each sub-batch. 188812 1A0O 47621 1ACB 231038 1AHW 45743 1ATN 231584 1AVW 43186 1AVZ 229673 1BQL 48359 1BRC 233888 1BRS 47867 1BVK 229795 1CGI 49298 1CHO 228705 1CSE 46636 1DFJ 229493 1DQJ 47060 1EFU 223912 1EO8 48390 1FBI 219990 1FIN 45040 1FQ1 213427 1FSS 45614 1GLA 195231 1GOT 47590 1IAI 223155 1IGC 47341 1JHL 224621 1MAH 44385 1MDA 236081 1MEL 48276 1MLC 203274 1NCA 43052 1NMB 231571 1PPE 47207 1QFU 228290 1SPB 48206 1STF 226570 1TAB 48645 1TGS 236948 1UDI 49967 1UGH 229513 1WEJ 43105 1WQ1 219482 2BTF 48810 2JEL 234686 2KAI 48160 2PCC 230399 2PTC 48777 2SIC 231341 2SNI 48764 2TEC 224725 2VIR 47338 3HHR 233194 4HTC |
Gen_X_Accord Send message Joined: 5 Jun 06 Posts: 154 Credit: 279,018 RAC: 0 |
I actually do still use these data. I just need to figure out a way to make them give me back fewer but better structures in the future. Or build a new validator server that can handle these intense work units. I'm sure there would be few opinions as to exactly which processors and memory you should choose to build a new on too. |
Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0 |
I actually do still use these data. I just need to figure out a way to make them give me back fewer but better structures in the future. DEK and I thought about this too. But since the validator is spending most of the time on reading and merging the date, the bottle neck is on the disk IO. Adding another server wouldn't help much since the data is stored on the same file system. The only way to improve the rate of processing is to add another file system and divide the validator to work on mulitple file systems. This is a lot harder to do and has a high potential to screw up the entire R@H server. So we decided not to do that at this moment. |
Michael H.W. Weber Send message Joined: 18 Sep 05 Posts: 13 Credit: 6,672,462 RAC: 0 |
Hello all, Not for me. I hooked up my new AMD 955 BE (4x 3,2 Ghz) to Rosetta@home on 9th of September. Since that time, I have returned 220 WUs, the machine is processing 24/7 for your project. So far, only 18 (!!!) jobs have been handled by the server - the rest is set to "pending". For an additional 4 jobs, credit was set to ZERO for no obvious reason. Those 4 tasks are: https://boinc.bakerlab.org/rosetta/result.php?resultid=279920775 https://boinc.bakerlab.org/rosetta/result.php?resultid=279914807 https://boinc.bakerlab.org/rosetta/result.php?resultid=279861161 https://boinc.bakerlab.org/rosetta/result.php?resultid=279861159 None of these was cancelled on my side. I would really like to know what is going on here. I was wondering whether it might have something to do with the operating system which I use (it is Win XP Pro x64)? Are there more strict homogenous redundancy validation checks enabled such that these WUs are only validated correctly when processed by another 64 bit Win XP? That might cause significant slow down during the validation process. If you cannot solve this problem quickly, please let me know ASAP because in that case I will have to move my systems to a more productive project due to limited electricity funds. Unlike other DC projects you have RALPH as a good testing environment to make sure no such problems occur in the productive Rosetta@home environment. In the future, please make better use of that. If you do not have enough processing power with RALPH, please also let me know such that I can put some systems on that project (then at least I know I have to expect issues). Michael. President of Rechenkraft.net e.V. http://www.rechenkraft.net - The world's first and largest distributed computing association. We make those things possible that supercomputers don't. |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
Do I spy a second validator, rah_validator_mini on server bk1? Good luck with that as I can't see a lot of catching up with the existing validation. |
Michael G.R. Send message Joined: 11 Nov 05 Posts: 264 Credit: 11,247,510 RAC: 0 |
The TeraFLOPS estimate on the frontpage is down to 9, which probably means that nobody's WUs are getting validated right now. I suspect that they'll fix it soon and it will process the backlog. |
tiger Send message Joined: 16 Jul 06 Posts: 17 Credit: 1,083,385 RAC: 0 |
For a project that aspires to reach 150 Tflops, I think a new attitude is needed. One does not just accidentally stumble upon success, no matter what the goal is. The TeraFLOPS estimate on the frontpage is down to 9, which probably means that nobody's WUs are getting validated right now. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
The project seems to be more focused on the scientific outcome rather than the IT credits and terraflops performance. It seems that there is one major IT person in the project trying to keep up with it all and then there are others that help him. They are doing their best with what people they have. Hopefully they will learn that they need an more active IT approach to keep this project rolling smoothly. The science results will only come from those that stay or new people that join and crunch, but if IT troubles drive them away its a big loss for the science that this project is working on. For a project that aspires to reach 150 Tflops, I think a new attitude is needed. One does not just accidentally stumble upon success, no matter what the goal is. |
Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0 |
The validator server has pretty much cleaned up the job that created the IO problem. Now it is catching up with the rest of jobs. It's a little hard to estimate how long that is going to take, but hopefully the worst is over. I agree that we need to somehow balance our effort between science and IT. I'm still relatively new to this team and still feeling my way through the IT part of the project.Hopefully over time, I'll be able to help DEK on this. |
LizzieBarry Send message Joined: 25 Feb 08 Posts: 76 Credit: 201,862 RAC: 0 |
For a project that aspires to reach 150 Tflops, I think a new attitude is needed. One does not just accidentally stumble upon success, no matter what the goal is. I agree, Greg, and that's exactly as it should be too. I'm here because I believe in the project, not because I believe in the volume of credits it offers. The science results will only come from those that stay or new people that join and crunch, but if IT troubles drive them away its a big loss for the science that this project is working on. Agreed again, but if someone was genuinely driven away by the slowness of awarding credits, that would be quite facile. At some point the current issues will clear up and everyone who stayed will be rewarded for their persistence (and those who walked away won't). That seems quite equitable to me. The validator server has pretty much cleaned up the job that created the IO problem. Now it is catching up with the rest of jobs. It's a little hard to estimate how long that is going to take, but hopefully the worst is over. Thanks Yifan. Let's hope so. Though I note the bk1 and bk2 servers aren't running right now. Part of the problem or part of the solution? |
David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0 |
you can ignore the server status page for now. I stopped the non-minirosetta daemons and fired up more assimilators and validators for the minirosetta jobs. 8 assimilators and 4 validators are running on bk1 and bk2. The load on these servers is very high and we're doing what we can with what we have. The only issue is pending credits. Users will just have to wait a bit longer for their credits to be awarded as our system catches up. The more important issue is that our work unit generators continue to make new work and on that front we're doing fine. |
Yifan Song Volunteer moderator Project developer Project scientist Send message Joined: 26 May 09 Posts: 62 Credit: 7,322 RAC: 0 |
It's part of the solution. DEK rearranged the validator servers a bit. They are just temporarily not showing properly on the webpage. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
you can ignore the server status page for now. I stopped the non-minirosetta daemons and fired up more assimilators and validators for the minirosetta jobs. 8 assimilators and 4 validators are running on bk1 and bk2. The load on these servers is very high and we're doing what we can with what we have. Kind of bouncing back and forth with the various servers these days it seems. Fighting between work generation and then some unchecked code and now the validators. Being that things supposedly happen in 3's (so to speak) the problems should theoretically be over. (knock on wood, fingers crossed and all that) Hope to see some stability in the project before the year ends....good luck keeping up with it all. you are doing a good job for one or two people. |
Message boards :
Number crunching :
Granted Credit taking forever....
©2024 University of Washington
https://www.bakerlab.org