Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9

AuthorMessage
Profile Runaway1956

Send message
Joined: 5 Nov 05
Posts: 19
Credit: 535,400
RAC: 0
Message 14518 - Posted: 24 Apr 2006, 4:06:36 UTC

What to do about upload errors? This isn't the first one I've seen - but this is the first 600 point upload error, lol


4/23/2006 22:55:04 PM||Benchmark results:
4/23/2006 22:55:04 PM|| Number of CPUs: 1
4/23/2006 22:55:04 PM|| 2931 double precision MIPS (Whetstone) per CPU
4/23/2006 22:55:04 PM|| 9825 integer MIPS (Dhrystone) per CPU
4/23/2006 22:55:04 PM||Finished CPU benchmarks
4/23/2006 22:55:05 PM|rosetta@home|Resuming computation for result 7521_largescale_large_fullatom_relax_dec7521_1_09_2.pdb_437_69_1 using rosetta version 498
4/23/2006 22:55:05 PM||Resuming computation
4/23/2006 22:55:05 PM||Rescheduling CPU: Resuming computation
4/23/2006 22:55:05 PM||Using earliest-deadline-first scheduling because computer is overcommitted.
4/23/2006 22:56:06 PM|rosetta@home|Error on file upload: length of file /f/boinc/projects/rosetta/upload/275/7515_largescale_large_fullatom_relax_dec7515_1_66_1.pdb_436_146_0_0 35688 bytes != offset 0 bytes



Most of those errors have been on the slower machines, before I set my prefs to run for a whole day.


ID: 14518 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
JZ-power

Send message
Joined: 9 Nov 05
Posts: 1
Credit: 374,157
RAC: 0
Message 14553 - Posted: 24 Apr 2006, 22:41:59 UTC

I have 3 WU's, all on version 4.98.
I ended them because they got stuck at 1.04%

TRUNCATE_TERMINI_FULLRELAX_2tif__433_230_0 ResultID: 16980143

TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_219_1 ResultID: 16991986

TRUNCATE_TERMINI_FULLRELAX_1enh__433_303_0 ResultID: 16987980

ID: 14553 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 14609 - Posted: 25 Apr 2006, 18:51:24 UTC - in response to Message 14207.  


I like the idea below of not passing on bad jobs to another client when they fail -- so only 1 computer will have the problem, not 4. I'm running this idea by David Baker and David Kim now. Unlike other BOINC projects its not critical for every single workunit to get processed. Its way more important to keep bad workunits from causing trouble!


What's the status on the idea to set max results to 1? Any decision taken yet?
ID: 14609 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile surrealchereal
Avatar

Send message
Joined: 6 Nov 05
Posts: 23
Credit: 243,559
RAC: 0
Message 14658 - Posted: 26 Apr 2006, 11:10:24 UTC

I had one stuck on 1.04 % also but now it's gone and so is everything.
I can't connect to the server now either. What should I do?
Come BOINC with me!

USALUG !!
ID: 14658 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 14676 - Posted: 26 Apr 2006, 15:34:46 UTC - in response to Message 14609.  

What's the status on the idea to set max results to 1? Any decision taken yet?

With the current version being tested in Ralph, if the watchdog aborts a WU it is considered "valid" and so it's not sent out again.

ID: 14676 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Slaughtercult

Send message
Joined: 4 Nov 05
Posts: 1
Credit: 2,118,187
RAC: 2,434
Message 14702 - Posted: 26 Apr 2006, 21:03:30 UTC
Last modified: 26 Apr 2006, 21:04:07 UTC

I aborted WU 13416703 (HBLR_1.0_1mky_420_7360) after 12.5 hours at 2 %. A few hours before it was 3.x% .

greetings


ID: 14702 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bommer

Send message
Joined: 26 Nov 05
Posts: 3
Credit: 4,603,378
RAC: 0
Message 15088 - Posted: 30 Apr 2006, 17:19:00 UTC
Last modified: 30 Apr 2006, 17:21:30 UTC

What should I do with this one?

FARELAX_NOFILTERS_1rnbA_413_201_3

4.97% 26:27:13 hours of crunching, but still very active with he graphics.
If it's no error or stuck WU I don't matter that it takes it's time :)

RESULT ID 18302618
WORKUNIT ID 12816946

I haven't aborted it. The Deadline is on 10 May 2006. The WU is on Hold.

Thanx Bommer

ID: 15088 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 15091 - Posted: 30 Apr 2006, 18:12:18 UTC - in response to Message 15088.  

What should I do with this one?

FARELAX_NOFILTERS_1rnbA_413_201_3

4.97% 26:27:13 hours of crunching, but still very active with he graphics.
If it's no error or stuck WU I don't matter that it takes it's time :)

RESULT ID 18302618
WORKUNIT ID 12816946

I haven't aborted it. The Deadline is on 10 May 2006. The WU is on Hold.

Thanx Bommer

Your computers are hidden so I cannot tell what version of the Rosetta application you are running, or if your system is slower that normal.

But if you are using version 5.07 (look at the "TASKS" tab and the application number is shown there), if the workunit has run longer than 4 times you preferences setting for "run time" you should abort the Work Unit. If you have a slow system and your time setting is for 24 hours then you should let it run a while. If you ever see the percent complete fall back to a value below what it is now, abort the Work Unit and report it with a link to the result so we can find it for analysis.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 15091 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 15098 - Posted: 30 Apr 2006, 19:05:10 UTC - in response to Message 15088.  

What should I do with this one?

FARELAX_NOFILTERS_1rnbA_413_201_3

4.97% 26:27:13 hours of crunching, but still very active with he graphics.
If it's no error or stuck WU I don't matter that it takes it's time :)


If your computer is still using rosetta version 5.01, then the WU is probably in an infinite loop and should be aborted.

Version 5.07 has a watchdog thread, and it's best to let the watchdog do any aborting (if needed) as it will then send back information that is useful to the project.
ID: 15098 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bommer

Send message
Joined: 26 Nov 05
Posts: 3
Credit: 4,603,378
RAC: 0
Message 15141 - Posted: 1 May 2006, 7:42:55 UTC
Last modified: 1 May 2006, 8:18:39 UTC

Hello

@Moderator9: Now, my Computers are shown on the web site.

The WU is using rosetta version 5.01. The actual Processor Time is 40 hours on 7 %. The WU is now on HOLD.

RESULT ID 18302618
WORKUNIT ID 12816946

My Computer is an AMD X2 4600+ with WIN XP Prof Service Pack 2.

Greets Bommer

ID: 15141 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile belldandy from pleiades

Send message
Joined: 2 Nov 05
Posts: 6
Credit: 102,731
RAC: 0
Message 15144 - Posted: 1 May 2006, 8:55:19 UTC

2 WUs that I aborted because it takes wayyyy too much time (usual is 2-3 hours), they didn't hang though.

https://boinc.bakerlab.org/rosetta/result.php?resultid=17827510
FACONTACTS_NOFILTERS_1r69__441_248_1

https://boinc.bakerlab.org/rosetta/result.php?resultid=17773776
HBLR_1.0_2tif_420_9927_1

Version for both is 5.01
Campeones everywhere!
ID: 15144 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
tralala

Send message
Joined: 8 Apr 06
Posts: 376
Credit: 581,806
RAC: 0
Message 15147 - Posted: 1 May 2006, 9:54:47 UTC - in response to Message 15141.  

Hello

@Moderator9: Now, my Computers are shown on the web site.

The WU is using rosetta version 5.01. The actual Processor Time is 40 hours on 7 %. The WU is now on HOLD.

RESULT ID 18302618
WORKUNIT ID 12816946

My Computer is an AMD X2 4600+ with WIN XP Prof Service Pack 2.

Greets Bommer


Abort! It's a faulty WU which won't get aborted by 5.01 in time.

ID: 15147 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 15161 - Posted: 1 May 2006, 14:58:54 UTC - in response to Message 15147.  

Hello

@Moderator9: Now, my Computers are shown on the web site.

The WU is using rosetta version 5.01. The actual Processor Time is 40 hours on 7 %. The WU is now on HOLD.

RESULT ID 18302618
WORKUNIT ID 12816946

My Computer is an AMD X2 4600+ with WIN XP Prof Service Pack 2.

Greets Bommer


Abort! It's a faulty WU which won't get aborted by 5.01 in time.


I would agree. This one has gone too long to be normal. Version 5.01 will probably let it create 10 models before it completes assuming it does. I would abort it.


Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 15161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bommer

Send message
Joined: 26 Nov 05
Posts: 3
Credit: 4,603,378
RAC: 0
Message 15180 - Posted: 1 May 2006, 16:39:16 UTC

Hello

Now my last question. How many Credits I get for the aborted WU ???

Here the Link:

https://boinc.bakerlab.org/rosetta/result.php?resultid=18302618


Greets Bommer
ID: 15180 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cduk

Send message
Joined: 10 Dec 05
Posts: 3
Credit: 27,710
RAC: 0
Message 15187 - Posted: 1 May 2006, 16:54:25 UTC
Last modified: 1 May 2006, 16:55:40 UTC

One stuck at 1.04% I'm afraid...

Link here

Will the fact that I have "Leave applications in memory while preempted" set to "no" have any bearing on this?

CD
ID: 15187 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 15193 - Posted: 1 May 2006, 17:41:33 UTC - in response to Message 15187.  
Last modified: 1 May 2006, 17:42:15 UTC

One stuck at 1.04% I'm afraid...

Link here

Will the fact that I have "Leave applications in memory while preempted" set to "no" have any bearing on this?

CD

How long has it been running? What is your Rosetta preference for runtime? (looks like the default of 2hrs) What Rosetta application version is shown in the Work tab?

Yes, if you change you General Preference to leave applications in memory you will produce more work for your projects. Your PC just swaps Rosetta out to the paging file on disk while it is not running, so "leave in memory" is kind of a poor word choice. It just means that the application isn't completely ended. This allows it to pick up where it left off, except for when you power down your PC. The new more requent checkpointing helps do much the same thing. Which is especially important on these large proteins.

Those CASP WUs are "large" proteins, and it takes them much longer to complete each model. If it is truely "stuck" the watchdog will find it and end it.

Each WU must complete at least one model, regardless of your time preference. So, if you have a short (2-4 hr) preference, a single model may still take 6 hours to complete. Once it does complete, it will see the time preference is exceeded, zip to 100% progress, and report back the result.

Please let it run at least 10 hrs before you abort it. If the steps are progressing, you've probably got a normal one there, it's just large. The user time preference is not an absolute thing. You have to crunch one model in order to have any results to report. You will find that 10-24hr runtimes are very predictable from one WU to the next. It's when the runtime preference is short and the WU is large that disparities occur.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 15193 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cduk

Send message
Joined: 10 Dec 05
Posts: 3
Credit: 27,710
RAC: 0
Message 15203 - Posted: 1 May 2006, 18:59:02 UTC - in response to Message 15193.  

How long has it been running?

So far, nearly two hours.

What is your Rosetta preference for runtime? (looks like the default of 2hrs)


"Not selected" (default: 4 hours?)

What Rosetta application version is shown in the Work tab?

5.07.

It was stuck at 1.044x (steps creeping up slowly). While in the process of writing this reply, it has completed and reported successfully...? Confusing.

Sorry if I've wasted your time - but many thanks indeed for your explanation.
ID: 15203 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 15208 - Posted: 1 May 2006, 19:46:35 UTC - in response to Message 15203.  

It was stuck at 1.044x ...Confusing.

Let me attempt to explain. Model 1 will show 1% something. The fractions have something to do with specific points within the model where they have completed one portion and are beginning another. They revise the completion fractionally and it is really sort of a counter on where within model 1 they presently are. They used this information for debugging and helping to resolve these "hung WU" issues.

Once model 1 completes, then it basically looks at the target runtime and compares with how long you've run already. Your next model will tend to take roughly as long as the last one did. Now that we've COMPELTED one, we've some idea how long the future models will take. So, let's say target runtime is 4hrs, and model 1 took 1 hr, it would recompute progress to be about 25%, and then begin model 2. The estimated time remaining would then recalculate to roughly 3hrs. As model 2 progresses, you'll see fractional increases, again as it reaches various points within the model. The estimated runtime remaining will be shown to INCREASE during model 2, once it completes (another hour) progress is recalculated to be 50%, and estimated remaining time is recalculated and so takes a sudden drop over what it just was near the end of the model.

In short, with these proteins varying so widely in size, it really doesn't KNOW how far done it is until it completes that first model to give it a frame of reference. In your case, the first model exceeded your runtime preference and so you zipped right to 100%.

Making this more intuitive is definately on the list of desirable things to improve. What's really unfortunate is that it's the MOST confusing for the short runtimes... which is the default :(

Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 15208 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cduk

Send message
Joined: 10 Dec 05
Posts: 3
Credit: 27,710
RAC: 0
Message 15209 - Posted: 1 May 2006, 20:26:24 UTC

Feet1st,

Since posting I've re-read the FAQs (which have changed quite a bit since the last time I looked - I'll make a mental note to re-visit more often).

After doing this and after your excellent explanation, I now understand what was happening. It wasn't actually stuck, but since the progress %age wasn't moving and I hadn't seen this before in this or other BOINC projects, I mistakenly thought it had.

I appreciate that time estimation can be extremely difficult....!

Many thanks for your help.

CD
ID: 15209 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Moderator9
Volunteer moderator

Send message
Joined: 22 Jan 06
Posts: 1014
Credit: 0
RAC: 0
Message 15210 - Posted: 1 May 2006, 20:48:58 UTC - in response to Message 15180.  

Hello

Now my last question. How many Credits I get for the aborted WU ???

... Greets Bommer


This post will in part answer your questions.

Moderator9
ROSETTA@home FAQ
Moderator Contact
ID: 15210 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2024 University of Washington
https://www.bakerlab.org