Posts by LizzieBarry

1) Message boards : Number crunching : no work units (Message 67450)
Posted 30 Aug 2010 by LizzieBarry
Post:
IMO. if a Web page can't be made and kept accurate, it should be taken down. It serves no useful purpose if it can't be believed. Either ensure that the Server Status page is accurate, or remove it from the Web site because it helps nobody.

Earlier this evening the page was taken down and seems to be a little more representative now. Still no available work yet, but WiP is rising, which is promising. Fingers crossed.
2) Message boards : Number crunching : Why so much variation? (Message 66378)
Posted 30 May 2010 by LizzieBarry
Post:
I tried the suspend option on another long-running model earlier today and it didn't end the task shortly after. It carried on until the watchdog kicked in - same as the others.

It's certainly worth a try though.

I agree it depends on where the last checkpoint came - if there hasn't been one since the problem decoy began it seems to me the task will make the same decision to continue as it did before after going back to the last checkpoint and, more significantly, the time of the last checkpoint. Wall clock time doesn't matter as I understand things.

Suspend, unsuspend, hope! ;)
3) Message boards : Number crunching : Why so much variation? (Message 66372)
Posted 30 May 2010 by LizzieBarry
Post:
After watching several more of these long running tasks float through my systems today I really have to believe that there is something seriously wrong...

For example, I just had a job running on a dedicated AMD Phenom II 925 – standard clocking and 4 gig of memory. This job ran over 10 hours, claimed a credit of about 227 credits, and was granted just 32 credits.

This nets me just a little over 3 credits per CPU hour.

On jobs which terminate “normally” at the 6 hour point I currently have set in my preferences I seem to net about 20 to 25 credits per CPU hour.

That is a factor of about 7 – far greater than what I would expect due to the differences in the various X86 systems. And the opposing factors in the averaging algorithm necessary to knock this task from 227 to 32 credits boggle the mind.

I agree, Chris. I'm not obsessive on credits at all, but it does seem to me this isn't working right. With the watchdog kicking in after 3 or 4 hours over-run, and assuming there weren't any long-running models among the initial decoys, I can't see how a watchdog-truncated model will do more than halve the credits on a 34 hour runtime, or a third on an 8 hour runtime.

I'm seeing much the same as you with my WUs.
4) Message boards : Number crunching : Report long-running models here (Message 66225)
Posted 19 May 2010 by LizzieBarry
Post:
(Report format copied from above - seems to make sense)

A long-running model on this task, running on a 32-bit Vista laptop:

rhoA15May2010_1lb1_2j49_ProteinInterfaceDesign_15May2010_20686_35_0
<core_client_version>6.10.43</core_client_version>
[...]
# cpu_run_time_pref: 21600
BOINC:: CPU time: 36425.4s, 14400s + 21600s[2010- 5-19 11:49:16:] :: BOINC
InternalDecoyCount: 1206
======================================================
DONE :: 2 starting structures 36425.4 cpu seconds
This process generated 1206 decoys from 1206 attempts
======================================================
called boinc_finish
[...]
Claimed credit 98.1363273365197
Granted credit 81.0898045070925
application version 2.14


What gets me about this is that 1205 decoys seemed to run within my 6 hour runtime, then the last decoy had to get shut-down by the watchdog after exceeding 4 hours. Was I just unlucky? The credit award was still reasonable.
5) Message boards : Number crunching : Results pending and results still uploading (Message 65397)
Posted 23 Feb 2010 by LizzieBarry
Post:
I'm not whining about a few hundred credits here...you can keep 'em, I have more. I really believe project administration should take a look at this validator's performance since it came back up. But as the philosopher says, "that's just my opinion, I could be wrong."

If I'm not mistaken another job runs each day (or two) that picks up all the validate errors and awards them the same value as the claimed credit. Isn't that right, someone?

If so, I'd be more concerned that the project gets the benefit of the processing. Does the job's results get accepted properly for the science part?
6) Message boards : Number crunching : Scheduling request completed: got 0 new tasks (Message 64798)
Posted 4 Jan 2010 by LizzieBarry
Post:
Perhaps he meant this rather old thread?

http://boinc.bakerlab.org/rosetta/forum_thread.php?id=1269

Very likely, thanks. I searched for this 'promise' and target and couldn't find either. Reading from here it's clear that nothing whatsoever would happen unless it's volunteer-led as the project people were swamped. I don't imagine this has changed in the last 4 years either.
7) Message boards : Number crunching : Not getting any new tasks (Message 64767)
Posted 4 Jan 2010 by LizzieBarry
Post:
47000 and rising, 10 arrived here already. Now for the bunfight!

I'm going to adjust my runtime to limit what the manager asks for until the rush has died down, like others have done. Maybe 6 hours makes more sense.

Thanks to the guys coming in on a Sunday to help us out.
8) Message boards : Number crunching : Scheduling request completed: got 0 new tasks (Message 64765)
Posted 4 Jan 2010 by LizzieBarry
Post:
I'm sorry they haven't thrown themself at your feet in gratitude quite enough for you. Perhaps they'll beg your forgiveness on their return. If you could provide a spec for what an appropriate level of kow-towing would be I'm sure it would be appreciated.

Your sarcasm is immature.

100% Correct. I pitched it at the same level as your "I'm moving over to POEM because at the moment, they appear to be more grateful for donated computing power." Strange how you could recognise my parody of your comment, but you couldn't recognise the childishness of yours. Not grateful enough?

I have read the grandiose plans about the project eventually hitting 150 TFlops, and I get excited for the accelerated progress that will bring. These clownish outages leave me conflicted that TPTB aren't serious, though. And that, is JMHO.

Where on earth did you read this? Do you have a link? I doubt it very much.

It seems to me you got yourself worked up about a promise no-one ever made, then upset yourself because you think a few days downtime means this imaginary promise won't ever be realised. A more ridiculous self-manufactured indignance I struggle to imagine.

A pity, too, you didn't see fit to comment on your failure to manage your own remaining units. Nothing to do with you too, I suppose? But of course.

Sorry my sarcasm was so immature. I can't work out where that idea came from.
9) Message boards : Cafe Rosetta : seti at home dont know where to ask? (Message 64763)
Posted 4 Jan 2010 by LizzieBarry
Post:
And yes this thread has gotten waaaaaay off topic!!

Is there such a thing as 'off topic' in a cafe? Carry on.
10) Message boards : Number crunching : Scheduling request completed: got 0 new tasks (Message 64709)
Posted 1 Jan 2010 by LizzieBarry
Post:
My own insight is that it's Christmas and New Year and everyone on the project is having a well-deserved break and worrying about more important things, like their respective families, instead of washing their heads out about some downtime.

I'm sorry, but I have to disagree 100%. Get it fixed, then go swill egg nog with the wife-n-kids. I have six Q9550 systems. I'm moving over to POEM because at the moment, they appear to be more grateful for donated computing power.

Then perhaps you missed that this is exactly what happened when the validation problem was fixed on Christmas Eve. And lo, they then went to swill that egg nog.

I'm sorry they haven't thrown themself at your feet in gratitude quite enough for you. Perhaps they'll beg your forgiveness on their return. If you could provide a spec for what an appropriate level of kow-towing would be I'm sure it would be appreciated. Personally I think we owe more to the project team for giving us the opportunity to use our unused clock cycles 360 days of the year than they owe us for the other 5. After all, we've always had the option of sharing our time among other projects so this only affects those who chose to limit ourselves here (as I do too, which is our responsibility).

Personally I don't micro-manage my usage here, but for those who do, what responsibility do you take for running out of work? I see some people noted the lack of new work and increased their runtimes to 24hrs once it was clear there'd be no resolution between Christmas and New Year. Did you do the same to eke out what work you were sure of? A quick look at your tasks shows you kept your "6 quads" racing through as if everything was normal, with a 3 hour runtime. That's a user responsibility too. Or irresponsibility if you like.

Given we made our choices in this matter it ill behoves either of us to cast undue aspersions at the project team. I'm not exactly delighted either, but I bear my part of the responsibility and realise I'm hardly in a position to throw the first stone.

Nothing to get your knickers in a twist about.

That's an antagonistic and unnecessary comment.

I think you'll find it was both whimsical and placatory, though admittedly I found your comment funnier than mine.

There is a connection between the organizers and the volunteers that is being disregarded, and I don't think it's being too assertive to say so directly.

I agree it's not too assertive. It is a little hysterical and a bit ridiculous though IMO.

If that were the issue, I wouldn't criticize. But how can we know if that's the issue when no one took 15 minutes sometime during the last week to let us know?

Of course no-one can know, but a quick look at the calendar would be the most obvious clue, don't you think? But you criticised anyway. Well done you.
11) Message boards : Number crunching : Scheduling request completed: got 0 new tasks (Message 64684)
Posted 1 Jan 2010 by LizzieBarry
Post:
Lots of general advice among people of good will, but if you look on the Rosetta homepage, there's no communication from people who really know what the problem is. So we don't really know what the problem is. When the lack of communication gets bad enough, can we be blamed for imagining the Project may never come back?

Thanks for your insights.

My own insight is that it's Christmas and New Year and everyone on the project is having a well-deserved break and worrying about more important things, like their respective families, instead of washing their heads out about some downtime. Nothing to get your knickers in a twist about.

Meanwhile, back in the world that matters, happy new year to one and all.
12) Message boards : Number crunching : lr8_combine_smooth_torsion_it00 - All Errors? (Message 64204)
Posted 25 Nov 2009 by LizzieBarry
Post:
Shouldn't take long at 30 seconds each.

I'd hope not, but the one I had to abort had been running nearly 30 minutes (elapsed) but properties had no checkpoints and just a couple of seconds of processing. Didn't take long, but not immediate compute errors either.

I think I'm going to abort on sight, just to hurry this all along. YMMV.
13) Message boards : Number crunching : lr8_combine_smooth_torsion_it00 - All Errors? (Message 64195)
Posted 25 Nov 2009 by LizzieBarry
Post:
Ditto. I've got one running at the moment but with a lot of restarts so I'm just going to abort it, and I see some more coming down too which I'll keep my eye on too.

I do have some other jobs going, so if they keep me going while someone checks and advises whether we can abort on sight I'd appreciate it.
14) Message boards : Number crunching : bаd WUs (Message 63517)
Posted 29 Sep 2009 by LizzieBarry
Post:
So, yes, running things, even when they fail, helps the project. This is why credit is given even for failures. On the other hand, you can't encourage failure or people will waste server resources creating more of them.

On the other hand, once you see several posts all pointing to specific problems with specific WUs, it could be wise to suspend the ones of that name that you already have and let other work run until the Project Team has had time to assess and respond with any recommendation to cancel specific tasks. This is one advantage of having a cache of about a day of work.

It looks like there's some advantage in having three hands too... ;)
15) Message boards : Number crunching : bаd WUs (Message 63505)
Posted 28 Sep 2009 by LizzieBarry
Post:
And I have had another "histone" work unit error out on me. I'm just going to abort any of them that get sent my way. I don't feel like wasting time running junk work units.

Note - the results of the few histone workunits my computers have run, and the results from wingmates, suggest that they may run properly under Windows XP but not under 64-bit Windows Vista. Those of you reporting failed histone workunits may want to include a mention of which type of operating system you're using, in order to check this idea.

I'm running them now with Vista SP2 and having no problems. I can view properties which show a past checkpoint and see the graphics window.

32-bit or 64-bit? Both of my failed histone workunits ran on my 64-bit Vista SP2 machine.

32-bit SP2 here.
16) Message boards : Number crunching : bаd WUs (Message 63484)
Posted 27 Sep 2009 by LizzieBarry
Post:
And I have had another "histone" work unit error out on me. I'm just going to abort any of them that get sent my way. I don't feel like wasting time running junk work units.

Note - the results of the few histone workunits my computers have run, and the results from wingmates, suggest that they may run properly under Windows XP but not under 64-bit Windows Vista. Those of you reporting failed histone workunits may want to include a mention of which type of operating system you're using, in order to check this idea.

I'm running them now with Vista SP2 and having no problems. I can view properties which show a past checkpoint and see the graphics window.

The one issue is that it I'm currently on Model 0 Step 3875.

Oops. Just double checked and after 20minutes it's on Model 1 Step 920. Is it a problem? Not that I can recognise.
17) Message boards : Number crunching : Receiving Low Credit on a 8/16 Core System! Help! (Message 63465)
Posted 27 Sep 2009 by LizzieBarry
Post:
Well I am just completing a number of Rosetta work units that have taken nearly 7 hours each, the claimed credit was around 87, but the granted credit was only 20.

One of my 8 hour long-running jobs claimed 77 and granted 108. Inconsistent, yes, though that's hardly a problem. Parsimonious, you're probably right. It depends if you're running for the credits or for the project. For me it's the latter.
18) Message boards : Number crunching : Report long-running models here (Message 63416)
Posted 21 Sep 2009 by LizzieBarry
Post:
A couple of strange, long-running WUs here. Both successfully completed and credit awarded, but both ran in excess of 8 hours with a 4 hour default runtime:

lr5_score12_gb_run01_rlbd_1unp_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14707_62_0
CPU time 29385.66 [...]

# cpu_run_time_pref: 14400
Hbond tripped: [2009- 9-16 4:58:53:]
BOINC:: CPU time: 29383.6s, 14400s + 14400s[2009- 9-16 13:13:29:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 29383.6 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish


Exactly the same for me too.

lr5_score12_gb_run01_rlbd_1ugh_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_14707_136_0
CPU time 28841.62 [...]

# cpu_run_time_pref: 14400
Hbond tripped: [2009- 9-21 2:13:17:]
BOINC:: CPU time: 28839.4s, 14400s + 14400s[2009- 9-21 11: 7:12:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 28839.4 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================

19) Message boards : Number crunching : Problems with web site (Message 63362)
Posted 15 Sep 2009 by LizzieBarry
Post:
These issues all seem to be related to the current delayed awarding of credit.

Leave it one more day and most of it will be resolved - validation is rapidly ploughing through the massive backlog. Nothing to worry about at the user end and the r@h end is dealing with it now.
20) Message boards : Number crunching : Granted Credit taking forever.... (Message 63348)
Posted 14 Sep 2009 by LizzieBarry
Post:
Thanks Yifan. Let's hope so. Though I note the bk1 and bk2 servers aren't running right now. Part of the problem or part of the solution?

It's part of the solution. DEK rearranged the validator servers a bit. They are just temporarily not showing properly on the webpage.

Looking at the credits being awarded now I can see it's working.

Looking forward to a couple of days of high credits now.

I guess people are really used to getting credit right away with our project.

True. At one time they'd be available within 30 seconds, while some other projects routinely take days and more. Thanks for the solution.


Next 20



©2024 University of Washington
https://www.bakerlab.org