Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

AuthorMessage
Grutte Pier [Wa Oars]~MAB The Frisian
Avatar

Send message
Joined: 6 Nov 05
Posts: 87
Credit: 497,588
RAC: 0
Message 13737 - Posted: 14 Apr 2006, 17:24:02 UTC - in response to Message 13719.  

Just a reminder to those who are posting stuck WU's -- please abort the 4 work units below. We know why they're hanging, are not sending out anymore, and are giving credits to any of these jobs that have timed out! Thanks.



I read the above as saying, no credit granted when we actually abort the stuck units although credit is granted when we leave them to timeout? Will this now be the standard way of dealing with all workunit timeouts or only in this case?


Random, I suppose.
I have let a Wu go on (graphics were moving) so it timed out or it aborted itself and no credits.
Can't be positive anymore about this project.
Have run it for about 5 months and another fortnight and than it's over.

ID: 13737 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chilcotin

Send message
Joined: 5 Nov 05
Posts: 15
Credit: 16,969,500
RAC: 0
Message 13743 - Posted: 14 Apr 2006, 18:15:57 UTC
Last modified: 14 Apr 2006, 18:17:57 UTC

Workunit aborted after 23 hours. Stuck at 1.04 %.

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=13940837

Link


edit: looks like this may be one of the 4 already flagged in the postings above ...
ID: 13743 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 13780 - Posted: 14 Apr 2006, 21:29:03 UTC

The 1.4 stalls are still coming I am vary tired of aborting them and losing the tens of thousands of points that are NOT granted in wasted CPU time.
If this project is going to keep letting out BAD work WU's.
Rosetta need to find a way to purge these Bad WU's from there servers when they are found to cause problems like these have. And / or send commands to the users client to delete or abort the Bad WU's on any upload / download to the Rosetta servers.
To keep all the bad WU's in the system or on the Rosetta servers and forcing us to run them to purge them them from the Rosetta system is unfair to us and does damage to the project reputation.
if this continue with out relief people will start to abandon this project
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 13780 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13782 - Posted: 14 Apr 2006, 21:46:52 UTC - in response to Message 13780.  

The 1.4 stalls are still coming I am vary tired of aborting them and losing the tens of thousands of points that are NOT granted in wasted CPU time.
If this project is going to keep letting out BAD work WU's.
Rosetta need to find a way to purge these Bad WU's from there servers when they are found to cause problems like these have. And / or send commands to the users client to delete or abort the Bad WU's on any upload / download to the Rosetta servers.
To keep all the bad WU's in the system or on the Rosetta servers and forcing us to run them to purge them them from the Rosetta system is unfair to us and does damage to the project reputation.
if this continue with out relief people will start to abandon this project


again, we are very sorry for the problems of the recent days. we have spent most of today taking steps to ensure that these problems do not occur again. all the problem work units have been cancelled, and everything should be back to normal very soon (once the jobs that have already been downloaded have left your machines).

since CASP is starting soon, and many of the proteins will be larger, we wanted to do some calculatoins on a broader range of sizes. before pursuing this much further, we need some way of ensuring that these jobs are only sent out to machines appropriate for them, which is difficult with the current BOINC setup; we hope Rom can help us with this.
ID: 13782 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 13784 - Posted: 14 Apr 2006, 22:06:15 UTC - in response to Message 13782.  

since CASP is starting soon, and many of the proteins will be larger, we wanted to do some calculatoins on a broader range of sizes. before pursuing this much further, we need some way of ensuring that these jobs are only sent out to machines appropriate for them, which is difficult with the current BOINC setup; we hope Rom can help us with this.

Beta 5.00 under Ralph@home preliminarily seems to be successfully processing work units that had previously failed under earlier versions of Rosetta. So it may not be the machines that are at fault, but the underlying Rosetta software itself (which seems to be on the way to being cleaned up if these early successes continue to hold up).

Regards,
Bob P.
ID: 13784 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
BennyRop

Send message
Joined: 17 Dec 05
Posts: 555
Credit: 140,800
RAC: 0
Message 13797 - Posted: 14 Apr 2006, 23:53:59 UTC - in response to Message 13782.  

since CASP is starting soon, and many of the proteins will be larger, we wanted to do some calculatoins on a broader range of sizes. before pursuing this much further, we need some way of ensuring that these jobs are only sent out to machines appropriate for them, which is difficult with the current BOINC setup; we hope Rom can help us with this.

When you upload information to the server, does it verify who it's coming from, or just blindly accept it, and then process it to see if it came from an actual machine running Rosetta?
If Boinc sends a request from hostid=121218 for another workunit, can't the amount of ram (and cpu speed) be looked up from the database that displays this info: https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=121218
Oops.. it doesn't list speed.. just the text cpuID. (speed would thus be based on the floating point and integer ratings..)

And then use something like this to determine what to send to each machine?
(Ram=Ram/number of cpu cores)
If hostid(121218).Ram > 750 Megs, then send EvenBiggerRamWU.
If hostid(121218).Ram > 500 Megs, then send BigRamWU.
If hostid(121218).Ram > 225 Megs, then send NormalWU.

ID: 13797 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 13798 - Posted: 15 Apr 2006, 0:12:13 UTC

But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 13798 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
David Baker
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 17 Sep 05
Posts: 705
Credit: 559,847
RAC: 0
Message 13799 - Posted: 15 Apr 2006, 1:43:52 UTC - in response to Message 13798.  

But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING


The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore.

ID: 13799 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Delk

Send message
Joined: 20 Feb 06
Posts: 25
Credit: 995,624
RAC: 0
Message 13801 - Posted: 15 Apr 2006, 5:28:36 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=17102919

Manually aborted at 1% after 52 hrs, it was a dodgy truncate unit I missed when checking systems.

Thing is though it did run for 52 hours does that mean the workunit timeouts only work on windows systems not linux? It would be good to know as I wont always be able to check up as often as I have been on these servers.

Thanks

ID: 13801 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13802 - Posted: 15 Apr 2006, 5:44:59 UTC - in response to Message 13801.  

Thanks for the report. Timeouts should happen on all machines -- but in BOINC they're tied to the number of floating point operations, not wall clock time, unfortunately. Because the Windows app appears to be a little bit faster than the Linux app, timeouts will generally take more time on the linux apps.

We completely understand that you can't check up on the servers all the time -- we're being extra vigilant to make sure this sort of problem doesn't happen again.


https://boinc.bakerlab.org/rosetta/result.php?resultid=17102919

Manually aborted at 1% after 52 hrs, it was a dodgy truncate unit I missed when checking systems.

Thing is though it did run for 52 hours does that mean the workunit timeouts only work on windows systems not linux? It would be good to know as I wont always be able to check up as often as I have been on these servers.

Thanks


ID: 13802 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@H] Ray
Avatar

Send message
Joined: 20 Sep 05
Posts: 118
Credit: 100,251
RAC: 0
Message 13805 - Posted: 15 Apr 2006, 6:48:01 UTC

I just aborted WU TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_20_2 as this was still at 1.04% at close to 10 hours (CPU time= 35,419.17 seconds).

When I checked the graphics a few times in the last hour nothing was hapening there also.

Ray


Pizza@Home Rays Place Rays place Forums
ID: 13805 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Carlos_Pfitzner
Avatar

Send message
Joined: 22 Dec 05
Posts: 71
Credit: 138,867
RAC: 0
Message 13806 - Posted: 15 Apr 2006, 6:53:25 UTC
Last modified: 15 Apr 2006, 6:57:04 UTC

Preferece run time 2 hours
actual cpu time 5 hours 50 minutes
Measured floating point speed 1417.88 million ops/sec
Measured integer speed 3114.3 million ops/sec
Done 1.48%
Windows XP
Rosetta 4.98
<message>aborted by user

https://boinc.bakerlab.org/rosetta/result.php?resultid=17217056

Click signature for global team stats
ID: 13806 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile anders n

Send message
Joined: 19 Sep 05
Posts: 403
Credit: 537,991
RAC: 0
Message 13807 - Posted: 15 Apr 2006, 6:59:29 UTC - in response to Message 13806.  

Preferece run time 2 hours
actual cpu time 5 hours 50 minutes
Measured floating point speed 1417.88 million ops/sec
Measured integer speed 3114.3 million ops/sec
Done 1.48%
Windows XP
Rosetta 4.98
<message>aborted by user

https://boinc.bakerlab.org/rosetta/result.php?resultid=17217056


Hi Carlos

I had 1 of those on a P4 2,8.

It took 3 H + to compete it.

Anders n

ID: 13807 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[DPC]Division_Brabant~OldButNotSoWise
Avatar

Send message
Joined: 23 Jan 06
Posts: 42
Credit: 371,797
RAC: 0
Message 13832 - Posted: 15 Apr 2006, 13:52:23 UTC

https://boinc.bakerlab.org/rosetta/result.php?resultid=17162086

Aborted at 1.8% after 5 hours or so, because the graphics also stopped.
ID: 13832 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Bin Qian

Send message
Joined: 13 Jul 05
Posts: 33
Credit: 36,897
RAC: 0
Message 13840 - Posted: 15 Apr 2006, 16:25:29 UTC - in response to Message 13832.  
Last modified: 15 Apr 2006, 16:28:34 UTC

Hi,

Thanks for reporting this. We are working on maintaining a relative stable run time on these WUs. Also see my reply to this thread.

Bin

https://boinc.bakerlab.org/rosetta/result.php?resultid=17162086

Aborted at 1.8% after 5 hours or so, because the graphics also stopped.


ID: 13840 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Cedomir Igaly

Send message
Joined: 5 Dec 05
Posts: 2
Credit: 66,345
RAC: 0
Message 13841 - Posted: 15 Apr 2006, 17:08:30 UTC
Last modified: 15 Apr 2006, 17:30:43 UTC

7521_largescale_large_fullatom_relax_dec7521_1_02_9.pdb_437_133_0

7486_largescale_large_fullatom_relax_dec7486_1_10_4.pdb_435_177_0

stuck (and aborted) at ~ 1%
ID: 13841 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
keitaisamurai

Send message
Joined: 21 Mar 06
Posts: 2
Credit: 55,037
RAC: 0
Message 13843 - Posted: 15 Apr 2006, 17:25:17 UTC

Work unit stuck at 1.04% with more than 175 HOURS of processing time (I really should check this computer more often).

https://boinc.bakerlab.org/rosetta/result.php?resultid=15389604

Needless to say, I think I'll abort it...
ID: 13843 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Sybr_E-N

Send message
Joined: 26 Nov 05
Posts: 2
Credit: 164,851
RAC: 0
Message 13854 - Posted: 15 Apr 2006, 20:33:31 UTC

50% error rate today...

https://boinc.bakerlab.org/rosetta/result.php?resultid=17261491
https://boinc.bakerlab.org/rosetta/result.php?resultid=17261490
https://boinc.bakerlab.org/rosetta/result.php?resultid=17261489
https://boinc.bakerlab.org/rosetta/result.php?resultid=17261456

All with the same error:
<core_client_version>5.2.13</core_client_version>
<message>Onjuiste functie. (0x1) - exit code 1 (0x1)
</message>
<stderr_txt>
# random seed: 3316837
# random seed: 3316837
# cpu_run_time_pref: 7200
# Exception caught in nstruct loop ii=1 i=2
#   num_decoys:1 attempts:2 cpu_run_time:5760.47
ERROR:: Exit at: .nblist.cc line:541

</stderr_txt>

"Onjuiste functie" is Dutch ( :) ) for "wrong function"
ID: 13854 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 13858 - Posted: 15 Apr 2006, 21:15:28 UTC
Last modified: 15 Apr 2006, 21:21:25 UTC

4/15/2006 12:26:24 PM|rosetta@home|Unrecoverable error for result ALL_TOPOLOGY_CODES_1shfA_434_201_0 (aborted by user)
aborted result

This seemed to freeze at 2.42% (I think - didn't do screen shot) and the CPU time in Boinc Manager had been stopped for almost an hour. Couldn't get into "show graphics" to look at progress. see this thread for no graphics problem

Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 13858 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Gretchen

Send message
Joined: 1 Nov 05
Posts: 1
Credit: 21,277
RAC: 0
Message 13874 - Posted: 16 Apr 2006, 2:19:07 UTC - in response to Message 13331.  
Last modified: 16 Apr 2006, 2:20:39 UTC

This one was up to 8 hours and 1.04% done, with 16 hours to go. So I aborted it.

  • The WU was created on : 11 Apr 2006 17:46:10 UTC
  • name: TRUNCATE_TERMINI_FULLRELAX_2tif__433_637
  • Computer was : AuthenticAMD AMD Athlon(tm) Processor
  • Operating System: Microsoft Windows XP
  • Home Edition, Service Pack 2, (05.01.2600.00)
  • Memory 639.42 MB


ID: 13874 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2024 University of Washington
https://www.bakerlab.org