Granted Credit taking forever....

Message boards : Number crunching : Granted Credit taking forever....

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Path7

Send message
Joined: 25 Aug 07
Posts: 128
Credit: 61,751
RAC: 0
Message 63292 - Posted: 11 Sep 2009, 23:55:10 UTC

Hello all,
The Wu's with validate state: Workunit error - check skipped; now have credit granted.
Claimed credit = Granted credit.
Thanks team.
I do wonder if these Wu's still have any scientific value.

Path7.
ID: 63292 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 63293 - Posted: 12 Sep 2009, 1:00:39 UTC

I actually do still use these data. I just need to figure out a way to make them give me back fewer but better structures in the future.
ID: 63293 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 63294 - Posted: 12 Sep 2009, 1:14:13 UTC - in response to Message 63291.  

The jobs that are canceled are the ones created IO problem for the server. DEK and I thought if we remove the job, it would stop the validator server from processing it. But turned out it didn't. So we'll have to wait for the server to finish processing the rest of the data.


Are those the ones i saw that were up to 11 MB result files?

ID: 63294 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mark Brown

Send message
Joined: 8 Aug 09
Posts: 21
Credit: 602,685
RAC: 0
Message 63295 - Posted: 12 Sep 2009, 1:14:14 UTC - in response to Message 63292.  

Hello all,
The Wu's with validate state: Workunit error - check skipped; now have credit granted.
Claimed credit = Granted credit.
Thanks team.
I do wonder if these Wu's still have any scientific value.

Path7.


Not all have credit:
https://boinc.bakerlab.org/rosetta/result.php?resultid=280016038
ID: 63295 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 63297 - Posted: 12 Sep 2009, 1:27:55 UTC - in response to Message 63294.  

The jobs that are canceled are the ones created IO problem for the server. DEK and I thought if we remove the job, it would stop the validator server from processing it. But turned out it didn't. So we'll have to wait for the server to finish processing the rest of the data.


Are those the ones i saw that were up to 11 MB result files?


Yep. they are large because those are often large protein complexes. Plus we needed to save the full cartesian coordinates for this system.
As for how credit is handled here, let me check with DEK.
ID: 63297 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Trotador

Send message
Joined: 30 May 09
Posts: 108
Credit: 290,158,311
RAC: 277,620
Message 63302 - Posted: 12 Sep 2009, 7:28:12 UTC - in response to Message 63290.  

Those 0 work units were granted credit at some point today. They are strange ones. The work units only ran for about an hour and they had really weird names and the graphics were strange looking too, maybe that is why they show up like they do.


Right, mine too, most of them had actually ran much longer than half an hour however. Let's hope the team can bring out useful science from them.

Thanks
ID: 63302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mark Brown

Send message
Joined: 8 Aug 09
Posts: 21
Credit: 602,685
RAC: 0
Message 63304 - Posted: 12 Sep 2009, 12:19:43 UTC - in response to Message 63302.  

Those 0 work units were granted credit at some point today. They are strange ones. The work units only ran for about an hour and they had really weird names and the graphics were strange looking too, maybe that is why they show up like they do.


Right, mine too, most of them had actually ran much longer than half an hour however. Let's hope the team can bring out useful science from them.

Thanks


I'm backing up again. The pending I understand, but why so many 0 credits.

Task ID 280068534
Name 1STF.bound.mppk.min.pdb_dock_score12_ddg.xml_yfsong_14675_2669_0
Workunit 255383041
CPU time 7406.906
Outcome Success
Client state Done
Validate state Workunit error - check skipped
Claimed credit 19.27640105732
Granted credit 0

ID: 63304 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 63305 - Posted: 12 Sep 2009, 19:11:23 UTC
Last modified: 12 Sep 2009, 19:13:21 UTC

I'm see some odd behavior on the server side. Half of the jobs came back with just a quarter of the results compared to the other half. and the good and bad ones alternate in the file name order. Somehow those jobs are not giving the results back and not given credit either. I wonder if it a bug in the validator. Let me spend some time today to dig a little deeper.

Here is the number of results I get back for each sub-batch.
188812 1A0O
47621 1ACB
231038 1AHW
45743 1ATN
231584 1AVW
43186 1AVZ
229673 1BQL
48359 1BRC
233888 1BRS
47867 1BVK
229795 1CGI
49298 1CHO
228705 1CSE
46636 1DFJ
229493 1DQJ
47060 1EFU
223912 1EO8
48390 1FBI
219990 1FIN
45040 1FQ1
213427 1FSS
45614 1GLA
195231 1GOT
47590 1IAI
223155 1IGC
47341 1JHL
224621 1MAH
44385 1MDA
236081 1MEL
48276 1MLC
203274 1NCA
43052 1NMB
231571 1PPE
47207 1QFU
228290 1SPB
48206 1STF
226570 1TAB
48645 1TGS
236948 1UDI
49967 1UGH
229513 1WEJ
43105 1WQ1
219482 2BTF
48810 2JEL
234686 2KAI
48160 2PCC
230399 2PTC
48777 2SIC
231341 2SNI
48764 2TEC
224725 2VIR
47338 3HHR
233194 4HTC
ID: 63305 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Gen_X_Accord
Avatar

Send message
Joined: 5 Jun 06
Posts: 154
Credit: 279,018
RAC: 0
Message 63309 - Posted: 12 Sep 2009, 22:47:43 UTC - in response to Message 63293.  

I actually do still use these data. I just need to figure out a way to make them give me back fewer but better structures in the future.


Or build a new validator server that can handle these intense work units. I'm sure there would be few opinions as to exactly which processors and memory you should choose to build a new on too.
ID: 63309 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 63310 - Posted: 13 Sep 2009, 6:42:02 UTC - in response to Message 63309.  
Last modified: 13 Sep 2009, 6:42:23 UTC

I actually do still use these data. I just need to figure out a way to make them give me back fewer but better structures in the future.


Or build a new validator server that can handle these intense work units. I'm sure there would be few opinions as to exactly which processors and memory you should choose to build a new on too.


DEK and I thought about this too. But since the validator is spending most of the time on reading and merging the date, the bottle neck is on the disk IO. Adding another server wouldn't help much since the data is stored on the same file system. The only way to improve the rate of processing is to add another file system and divide the validator to work on mulitple file systems. This is a lot harder to do and has a high potential to screw up the entire R@H server. So we decided not to do that at this moment.
ID: 63310 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Michael H.W. Weber
Avatar

Send message
Joined: 18 Sep 05
Posts: 13
Credit: 6,672,462
RAC: 0
Message 63312 - Posted: 13 Sep 2009, 11:52:17 UTC

Hello all,
The Wu's with validate state: Workunit error - check skipped; now have credit granted.
Claimed credit = Granted credit.

...

Path7.

Not for me. I hooked up my new AMD 955 BE (4x 3,2 Ghz) to Rosetta@home on 9th of September. Since that time, I have returned 220 WUs, the machine is processing 24/7 for your project. So far, only 18 (!!!) jobs have been handled by the server - the rest is set to "pending". For an additional 4 jobs, credit was set to ZERO for no obvious reason. Those 4 tasks are:

https://boinc.bakerlab.org/rosetta/result.php?resultid=279920775
https://boinc.bakerlab.org/rosetta/result.php?resultid=279914807
https://boinc.bakerlab.org/rosetta/result.php?resultid=279861161
https://boinc.bakerlab.org/rosetta/result.php?resultid=279861159

None of these was cancelled on my side. I would really like to know what is going on here. I was wondering whether it might have something to do with the operating system which I use (it is Win XP Pro x64)? Are there more strict homogenous redundancy validation checks enabled such that these WUs are only validated correctly when processed by another 64 bit Win XP? That might cause significant slow down during the validation process.

If you cannot solve this problem quickly, please let me know ASAP because in that case I will have to move my systems to a more productive project due to limited electricity funds. Unlike other DC projects you have RALPH as a good testing environment to make sure no such problems occur in the productive Rosetta@home environment. In the future, please make better use of that. If you do not have enough processing power with RALPH, please also let me know such that I can put some systems on that project (then at least I know I have to expect issues).

Michael.
President of Rechenkraft.net e.V.

http://www.rechenkraft.net - The world's first and largest distributed computing association. We make those things possible that supercomputers don't.
ID: 63312 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 63314 - Posted: 13 Sep 2009, 13:34:48 UTC

Do I spy a second validator, rah_validator_mini on server bk1?

Good luck with that as I can't see a lot of catching up with the existing validation.
ID: 63314 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Michael G.R.

Send message
Joined: 11 Nov 05
Posts: 264
Credit: 11,247,510
RAC: 0
Message 63315 - Posted: 13 Sep 2009, 15:31:57 UTC

The TeraFLOPS estimate on the frontpage is down to 9, which probably means that nobody's WUs are getting validated right now.

I suspect that they'll fix it soon and it will process the backlog.
ID: 63315 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile tiger

Send message
Joined: 16 Jul 06
Posts: 17
Credit: 1,083,385
RAC: 0
Message 63333 - Posted: 14 Sep 2009, 13:52:34 UTC - in response to Message 63315.  

For a project that aspires to reach 150 Tflops, I think a new attitude is needed. One does not just accidentally stumble upon success, no matter what the goal is.

The TeraFLOPS estimate on the frontpage is down to 9, which probably means that nobody's WUs are getting validated right now.

I suspect that they'll fix it soon and it will process the backlog.


ID: 63333 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,847,149
RAC: 2,368
Message 63334 - Posted: 14 Sep 2009, 14:12:29 UTC - in response to Message 63333.  

The project seems to be more focused on the scientific outcome rather than the IT credits and terraflops performance.
It seems that there is one major IT person in the project trying to keep up with it all and then there are others that help him.
They are doing their best with what people they have.
Hopefully they will learn that they need an more active IT approach to keep this project rolling smoothly. The science results will only come from those that stay or new people that join and crunch, but if IT troubles drive them away its a big loss for the science that this project is working on.

For a project that aspires to reach 150 Tflops, I think a new attitude is needed. One does not just accidentally stumble upon success, no matter what the goal is.

The TeraFLOPS estimate on the frontpage is down to 9, which probably means that nobody's WUs are getting validated right now.

I suspect that they'll fix it soon and it will process the backlog.


ID: 63334 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 63339 - Posted: 14 Sep 2009, 18:35:38 UTC

The validator server has pretty much cleaned up the job that created the IO problem. Now it is catching up with the rest of jobs. It's a little hard to estimate how long that is going to take, but hopefully the worst is over.

I agree that we need to somehow balance our effort between science and IT. I'm still relatively new to this team and still feeling my way through the IT part of the project.Hopefully over time, I'll be able to help DEK on this.
ID: 63339 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
LizzieBarry

Send message
Joined: 25 Feb 08
Posts: 76
Credit: 201,862
RAC: 0
Message 63340 - Posted: 14 Sep 2009, 18:56:52 UTC - in response to Message 63334.  
Last modified: 14 Sep 2009, 18:57:50 UTC

For a project that aspires to reach 150 Tflops, I think a new attitude is needed. One does not just accidentally stumble upon success, no matter what the goal is.

The project seems to be more focused on the scientific outcome rather than the IT credits and teraflops performance.

I agree, Greg, and that's exactly as it should be too. I'm here because I believe in the project, not because I believe in the volume of credits it offers.

The science results will only come from those that stay or new people that join and crunch, but if IT troubles drive them away its a big loss for the science that this project is working on.

Agreed again, but if someone was genuinely driven away by the slowness of awarding credits, that would be quite facile.

At some point the current issues will clear up and everyone who stayed will be rewarded for their persistence (and those who walked away won't). That seems quite equitable to me.

The validator server has pretty much cleaned up the job that created the IO problem. Now it is catching up with the rest of jobs. It's a little hard to estimate how long that is going to take, but hopefully the worst is over.

Thanks Yifan. Let's hope so. Though I note the bk1 and bk2 servers aren't running right now. Part of the problem or part of the solution?
ID: 63340 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 63341 - Posted: 14 Sep 2009, 19:09:45 UTC

you can ignore the server status page for now. I stopped the non-minirosetta daemons and fired up more assimilators and validators for the minirosetta jobs. 8 assimilators and 4 validators are running on bk1 and bk2. The load on these servers is very high and we're doing what we can with what we have.

The only issue is pending credits. Users will just have to wait a bit longer for their credits to be awarded as our system catches up. The more important issue is that our work unit generators continue to make new work and on that front we're doing fine.
ID: 63341 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Yifan Song
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 26 May 09
Posts: 62
Credit: 7,322
RAC: 0
Message 63342 - Posted: 14 Sep 2009, 19:10:27 UTC - in response to Message 63340.  


Thanks Yifan. Let's hope so. Though I note the bk1 and bk2 servers aren't running right now. Part of the problem or part of the solution?


It's part of the solution. DEK rearranged the validator servers a bit. They are just temporarily not showing properly on the webpage.
ID: 63342 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5664
Credit: 5,847,149
RAC: 2,368
Message 63343 - Posted: 14 Sep 2009, 20:16:30 UTC - in response to Message 63341.  

you can ignore the server status page for now. I stopped the non-minirosetta daemons and fired up more assimilators and validators for the minirosetta jobs. 8 assimilators and 4 validators are running on bk1 and bk2. The load on these servers is very high and we're doing what we can with what we have.

The only issue is pending credits. Users will just have to wait a bit longer for their credits to be awarded as our system catches up. The more important issue is that our work unit generators continue to make new work and on that front we're doing fine.



Kind of bouncing back and forth with the various servers these days it seems.
Fighting between work generation and then some unchecked code and now the validators. Being that things supposedly happen in 3's (so to speak) the problems should theoretically be over. (knock on wood, fingers crossed and all that)
Hope to see some stability in the project before the year ends....good luck keeping up with it all. you are doing a good job for one or two people.
ID: 63343 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Granted Credit taking forever....



©2024 University of Washington
https://www.bakerlab.org