Problems with Rosetta version 5.93

Message boards : Number crunching : Problems with Rosetta version 5.93

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4015
Credit: 0
RAC: 0
Message 50954 - Posted: 24 Jan 2008, 20:53:48 UTC

Ended by watchdog, and running beyond their runtime target are two rather different things.
Rosetta Moderator: Mod.Sense
ID: 50954 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FalconFly
Avatar

Send message
Joined: 11 Jan 08
Posts: 23
Credit: 2,163,056
RAC: 0
Message 50955 - Posted: 24 Jan 2008, 21:32:45 UTC - in response to Message 50946.  
Last modified: 24 Jan 2008, 21:37:54 UTC

Falcon, what is your Rosetta Preference for target runtime?
Please see related info. in this thread.


Was set at 6 hours until this evening, when I reduced it to 4 (4x4h no progress is at least better than 4x6h no progress)

Typical WorkUnits that finished already :
Watchdog Terminated
Watchdog Terminated + Segmentation Violation (still valid though)
Watchdog Terminated
Watchdog Terminated

----------
If the WorkUnit just takes that long (and can't finish within 4 or 6 hours on a modern Athlon64 X2), I don't mind the increased runtime. I don't expect that to take 24 hours though (unless the Models are really much more complex than expected, which could be in theory for all I know)

Looking at Claimed vs. Granted Credit however, it seems that approx. 50-70% of the runtime is simply lost due to Watchdog not cutting in until 4x the set runtime (not sure what the Client actually does in that time).
ID: 50955 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 50957 - Posted: 25 Jan 2008, 3:17:55 UTC

I think there is something seriously wrong with the 2h4o_ WUs. They just seem to sit there using CPU, but not writing anything to the output files. They never end until the watchdog says they've used 4x the CPU time preference.
ID: 50957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
csbyseti

Send message
Joined: 24 Dec 05
Posts: 10
Credit: 3,758,617
RAC: 2
Message 50959 - Posted: 25 Jan 2008, 8:20:08 UTC - in response to Message 50954.  

Ended by watchdog, and running beyond their runtime target are two rather different things.


I think i'll mean the same as FalconFly.
The 2h4o - WU's have got a Problem. I restartet one WU on the Quad, CPU-time jumps down to 1h:xx (last working Checkpoint?) and seems to be running. Finished with wrong CPU-Time of 11337 sec (3h:8) but with Heartbeat-error.

https://boinc.bakerlab.org/rosetta/result.php?resultid=135428650

On the X2 the CPU-Time jumps down to 0h:0x after restart (from 6h:59), seems to run but dont work anymore until the watchdog will stop it. This WU would konsum 4x3h + 6h:59 = 19h of CPU-Time.
If such a WU will be stopped and restarted because of the Scheduler and resets the CPU-Time it will be a never ending loop.
ID: 50959 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4015
Credit: 0
RAC: 0
Message 50962 - Posted: 25 Jan 2008, 14:32:19 UTC

If such a WU will be stopped and restarted because of the Scheduler and resets the CPU-Time it will be a never ending loop.


The watchdog will catch such a thing and abort it for you. In this case, it would notice that the task was restarted 5 times from the exact same point. In other words, "I've started this thing 5 times and never reached a checkpoint, so I'm going to abort it".

The basic idea being that whatever it is about that task is not well suited to how you are using your computer, and so the watchdog ends it, reports it back and get another task, which will tend to have different behavior.
Rosetta Moderator: Mod.Sense
ID: 50962 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50965 - Posted: 25 Jan 2008, 17:36:45 UTC
Last modified: 25 Jan 2008, 18:25:05 UTC

Here's a snapshot of what others might be describing about the 2h4o wus. This is on my wifes laptop which was set to 1 hour run time pref, but I changed it at somepoint last nite to 6 hours(note: I changed it before I knew about this one, her laptop is WAAAAY out in the dining room, which never sees meals on the table, so I'm seldom there). Either way, we're way past that. It's longest recorded decoy (out of 912 recorded) was a 1gida which lasted 16627 seconds (4.61 hrs). I suspended the other projects to see what happens with it.



[edit] after 1 hour run time the cpu time has progressed one hour, and the "% complete has progressed from 98.558 to 98.664, but the "to comp" has remained unchanged.
ID: 50965 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50974 - Posted: 25 Jan 2008, 20:31:20 UTC

I can't edit after 60 min.

After 3 hours the cpu time seems right, % comp has progressed up to 98.846, and "to comp" has gone up one second to 00:09:54. Hmmm, at .1%/hour there's just 11 more hours to go making it 25 hours/decoy...gotta be a record. I'll not post again until it's nearly over. (yes...I know 98.848% seems like it'd be nearly over...LOL)
ID: 50974 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4015
Credit: 0
RAC: 0
Message 50975 - Posted: 25 Jan 2008, 21:14:45 UTC

Astro, the estimates are based on the time to completion as compared to the target runtimg... except for the final 10min to completion. They make increasingly fine adjustments to show things are still moving forward, but the client really doesn't know how long that model is going to take.

Once the model completes your time to completion will zip from whereever it was near 10min to zero. You can't take a .1% adjustment and extrapolate that into a prediction on final time to completion. The last 10-12min of the time to completion do not work that way. And the time prior to that is just based on the time spent, as compared to target runtime. So, until you've completed a model on that task, there really isn't a great method to arrive at a true predicted time to completion. For most tasks, which take less then an hour per model, this method works fairly well. These 6+hr per model tasks are basicaly the worst case for the time estimate calculations.
Rosetta Moderator: Mod.Sense
ID: 50975 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50976 - Posted: 25 Jan 2008, 21:28:14 UTC - in response to Message 50975.  

These 6+hr per model tasks are basicaly the worst case for the time estimate calculations.

I do think somethings wrong.

For the last 3 hours, it's progressed .1%/hour. If that were true for the full length of the wu, and given I'm 15 hours into it, then I should only be at 1.5% complete. At some point the "% comp" had to have progressed faster, and then at some point went into slow motion mode. I'm aware of how the "to comp" works and have NO issue with that. Also, If I'd had a 1hr, 2 hr, 3 hr, and shortly a 4 hour run time preference, then this one would have been ended by the watchdog. Ofcourse, I'm assuming it'll finish at all. If the .1%/hour holds, then a 6 hour pref would have been ended by the watchdog (have to wait and see total run time before I can say that definitively).

I guess, If these are really that long, then admin should change the % comp mechanism, and say something about having some "unusually LARGE" wus in the system ATM. Otherwise, you're going to get alot of questions and who knows how many users will "abort" just because they don't know it might be "normal".

Heck, I feel that I'm doing them a favor even running it as my gut feeling (without admin acknowledgement that this is normal) is I'm going to get nada for a days work.
ID: 50976 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 70
Credit: 4,530,841
RAC: 597
Message 50977 - Posted: 25 Jan 2008, 21:49:18 UTC

Another 24ho oddity - I had this one, work unit 123329393:

2h4o__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK-2h4o_-native__2668_12846

that ran 12.53 hours of CPU time! My outcome says "Success" and "Done" (and I got plenty of credit for it) - but when I look at the details, I see:

<core_client_version>5.10.20</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 10800
# random seed: 1745555
# cpu_run_time_pref: 10800
# random seed: 1745555
# cpu_run_time_pref: 10800
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 45123.3 seconds. Greater than 4X preferred time: 10800 seconds
**********************************************************************
GZIP SILENT FILE: .xx2h4o.out

But it shows a "Validate state" of VALID. I certainly am not complaining about the credit, but how can it be done and valid if Watchdog shut it down??
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 50977 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50981 - Posted: 25 Jan 2008, 22:48:34 UTC
Last modified: 25 Jan 2008, 22:54:53 UTC

THat's kind of my point. It was ended by the watchdog at 4X his/her 3 hour run pref at 12 hour + a bit. The Task ID shows NO decoy info at all. Was any scientifically worthwhile work performed? Or is it just credit for time served?? This is going to be very typical of all participants except those with a "cpu run time pref" exceeding 8-12 hours(depending on processor, etc).

and now that I think about my one wu and my 6 hour run time. It looks like I should bump that up to the next step above 6hrs or suffer the same fate as everyone else. It's at 99.005 percent after 16:38:00, so is holding to the .1%/hour.

[edit] moved up to 8 hour pref
ID: 50981 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile hedera
Avatar

Send message
Joined: 15 Jul 06
Posts: 70
Credit: 4,530,841
RAC: 597
Message 50982 - Posted: 25 Jan 2008, 23:17:18 UTC

What's the default on CPU runtime limits? I looked at my settings and target CPU runtime is "not selected", so I have whatever you get when you don't specify. Do we have a consensus on what it should be?? I'm a little confused.
--hedera

Never be afraid to try something new. Remember that amateurs built the ark. Professionals built the Titanic.

ID: 50982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rochester new york
Avatar

Send message
Joined: 2 Jul 06
Posts: 2767
Credit: 1,763,844
RAC: 316
Message 50983 - Posted: 25 Jan 2008, 23:35:41 UTC - in response to Message 50982.  

What's the default on CPU runtime limits? I looked at my settings and target CPU runtime is "not selected", so I have whatever you get when you don't specify. Do we have a consensus on what it should be?? I'm a little confused.

i got mine set for 1 day however i think it defaults to 3 hours
ID: 50983 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50984 - Posted: 25 Jan 2008, 23:39:27 UTC
Last modified: 25 Jan 2008, 23:41:44 UTC

yes, the "not selected" is the default of 3 hours.

also, that's set on the "web" so your client must "call home" in order to see and apply the change. This happens when it gets/reports work. But, if you want it to change in the middle of a run, you must do a "project update". You can manually update the projects from the "projects" tab on the manager. Highlight the project name in the right hand box by clicking on it. Then click the "update" button to the left.
ID: 50984 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50986 - Posted: 26 Jan 2008, 11:16:21 UTC
Last modified: 26 Jan 2008, 11:32:07 UTC

Here's an updated pic taken 17 hours + later(from the first pic posted) of the same wu. I changed my runtime pref to 8 last nite, but that only gets me up to 32 hours before the watchdog kicks in. Perhaps I'll go to 12 hour pref so it'll be able to finish normally as long as it doesn't take more than "48 HOURS to do ONE decoy" on a Mobile AMD64 3700 w/1 G ram. Boy, If the others I have in cache take anywhere near this long....all my Boincsimap, Einstein, and the rest of my rosetta will be past the deadline.

Also notice the rate of completion seems to be continually slowing (atleast I assume so) since it's only progressed .3% overnite while I slept, instead of the .1%/hour I was seeing. At 28 + hours, this decoy has already taken more than 6 times it's previous "longest decoy".




[Edit] I went to 12 hours "run time pref", so hopefully it'll finish in the next 19 hours.
ID: 50986 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile AdeB
Avatar

Send message
Joined: 12 Dec 06
Posts: 45
Credit: 4,187,319
RAC: 433
Message 50987 - Posted: 26 Jan 2008, 12:37:22 UTC - in response to Message 50986.  



[Edit] I went to 12 hours "run time pref", so hopefully it'll finish in the next 19 hours.


You really are on a mission to find out how long it will take.
Go for it, Astro!
ID: 50987 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50988 - Posted: 26 Jan 2008, 12:48:43 UTC - in response to Message 50987.  
Last modified: 26 Jan 2008, 12:55:36 UTC



[Edit] I went to 12 hours "run time pref", so hopefully it'll finish in the next 19 hours.


You really are on a mission to find out how long it will take.
Go for it, Astro!

Someone's got to be the guinea pig. "show me the little plastic wheel, and I'll take her for a spin".

I'll take this posting opportunity for an update:

CPU time 30:21:01, 99.453% complete, 00:09:56 remaining, using the benchmark claiming method, this wu is worth 429.75 credits so far. I wonder....
ID: 50988 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 10,098,723
RAC: 6,336
Message 50989 - Posted: 26 Jan 2008, 13:25:01 UTC - in response to Message 50988.  
Last modified: 26 Jan 2008, 13:26:32 UTC


Someone's got to be the guinea pig. "show me the little plastic wheel, and I'll take her for a spin".

I'll take this posting opportunity for an update:

CPU time 30:21:01, 99.453% complete, 00:09:56 remaining, using the benchmark claiming method, this wu is worth 429.75 credits so far. I wonder....


My box crunched one in the same category as yours (TWIST RINGS TWIST ANGEL). It ran a bit over 2.5 times the normal runtime, which is 6 hrs for me. A personal record!. ;)
ID: 50989 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
transient
Avatar

Send message
Joined: 30 Sep 06
Posts: 376
Credit: 10,098,723
RAC: 6,336
Message 50990 - Posted: 26 Jan 2008, 13:44:01 UTC - in response to Message 50988.  

[quote
CPU time 30:21:01, 99.453% complete, 00:09:56 remaining, using the benchmark claiming method, this wu is worth 429.75 credits so far. I wonder....[/quote]

Mine claimed 329 credits, but only got the standard 20 credits for a watchdog-ended task.

https://boinc.bakerlab.org/rosetta/result.php?resultid=135624272

ID: 50990 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Astro
Avatar

Send message
Joined: 2 Oct 05
Posts: 987
Credit: 500,253
RAC: 0
Message 50991 - Posted: 26 Jan 2008, 13:48:16 UTC
Last modified: 26 Jan 2008, 13:48:37 UTC

Strange. hedera received 88 of his 98 claimed for his watchdog ended task resultid=135513724. I wonder what the difference was?
ID: 50991 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Problems with Rosetta version 5.93



©2020 University of Washington
https://www.bakerlab.org