Report stuck & aborted WU here please - II

Message boards : Number crunching : Report stuck & aborted WU here please - II

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 13881 - Posted: 16 Apr 2006, 6:23:01 UTC - in response to Message 13331.  
Last modified: 16 Apr 2006, 6:32:19 UTC

I aborted nine WUs today.

These four showed 20-50 hours of accumulated time

17028012
TRUNCATE_TERMINI_FULLRELAX_1b3aA_433_678

17051917
TRUNCATE_TERMINI_FULLRELAX_1ptq__433_905

17050886
TRUNCATE_TERMINI_FULLRELAX_1enh__433_896

16238549
FA_RLXpt_hom006_1ptq__361_440


The following five showed little or no accumulated time but had been running for 4-11 days:

17016383
TRUNCATE_TERMINI_FULLRELAX_1ptq__433_569

16970141
TRUNCATE_TERMINI_FULLRELAX_2tif__433_104

16995174
TRUNCATE_TERMINI_FULLRELAX_2tif__433_369

16196147
FA_RLXpt_hom002_1ptq__361_379

16227211
FARELAX_NOFILTERS_1bm8__417_637
ID: 13881 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
XS_DDT's_Cattle_Prods

Send message
Joined: 24 Mar 06
Posts: 12
Credit: 1,180,072
RAC: 0
Message 13901 - Posted: 16 Apr 2006, 17:14:27 UTC

17049214

stuck url just found it, almost 17 or so hours wasted, do we get any kind of credit for these things?
ID: 13901 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13905 - Posted: 16 Apr 2006, 20:28:21 UTC - in response to Message 13901.  

These work units were problematic -- getting your reports has helped us
fix an important bug in Rosetta. So it wasn't a total waste of CPU time!
And about once a week, we're running a script to grant credit to work units that claimed credit.

17049214

stuck url just found it, almost 17 or so hours wasted, do we get any kind of credit for these things?


ID: 13905 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
keitaisamurai

Send message
Joined: 21 Mar 06
Posts: 2
Credit: 55,037
RAC: 0
Message 13923 - Posted: 17 Apr 2006, 2:24:45 UTC

Found stuck at 1.02% after 16 hours of processing time.

16784959
ID: 13923 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 13932 - Posted: 17 Apr 2006, 3:57:28 UTC - in response to Message 13799.  

But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING


The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore.

Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would and send it self in IT JUST RESTARTED and it most likely was the 3rd time it restarted. That is 142 Hrs of wasted CPU time
And I will not get credit for it as you said we would because your time out did not work
This is why I said you must came up with a way to auto abort on our clients
A script to do a project reset or something .
To expect us to clean up after you send out Bad W/Us Is not right and you know it. You must come up with a better plan then do nothing
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 13932 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 13951 - Posted: 17 Apr 2006, 14:30:50 UTC

Here is a work unit that got stuck at a little over 1%:

17203004
Regards,
Bob P.
ID: 13951 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
ebahapo
Avatar

Send message
Joined: 17 Sep 05
Posts: 29
Credit: 413,302
RAC: 0
Message 13953 - Posted: 17 Apr 2006, 14:53:17 UTC

I aborted result #16087136 after it got stuck at about 75% with no progress.

HTH

ID: 13953 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [DPC]FOKschaap~devzero

Send message
Joined: 9 Dec 05
Posts: 1
Credit: 2,785,811
RAC: 0
Message 13957 - Posted: 17 Apr 2006, 16:06:40 UTC
Last modified: 17 Apr 2006, 16:14:22 UTC

I aborted TRUNCATE_TERMINI_FULLRELAX_2tif__433_736 after it got stuck at 1%.
ID: 13957 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13964 - Posted: 17 Apr 2006, 18:25:07 UTC - in response to Message 13932.  

Can you post a link to the result? I think you will actually get credit for this. The result is reported to us as a large amount of "claimed credit"; we are going through on a weekly basis and granting credit for these jobs that caused big problems and returned "invalid" results. The results and your posting are still useful to us -- in this case, the postings on this work unit helped us track down a pretty esoteric bug in Rosetta. Let me know what the result number is -- if its not flagged to get credit, I can see why.


But what about the removal of bad WU's from your servers You must set up a way to stop the resending out of the BAD WU's Letting the system purge it self is not right. You have the capability to do auto upgrades you should have the capability to auto abort bad WU;s on client side To let bad WU's run on yours or our system is a BAD THING


The bad WU's are removed from our servers, but we can't remove them from your machines. Hopefully there will be no more bad WU's at all so this won't be a problem anymore.

Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would and send it self in IT JUST RESTARTED and it most likely was the 3rd time it restarted. That is 142 Hrs of wasted CPU time
And I will not get credit for it as you said we would because your time out did not work
This is why I said you must came up with a way to auto abort on our clients
A script to do a project reset or something .
To expect us to clean up after you send out Bad W/Us Is not right and you know it. You must come up with a better plan then do nothing


ID: 13964 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kurre

Send message
Joined: 12 Apr 06
Posts: 9
Credit: 69,240
RAC: 0
Message 13968 - Posted: 17 Apr 2006, 19:33:06 UTC

Result ID 17407773,17407809 aborted
ID: 13968 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Kurre

Send message
Joined: 12 Apr 06
Posts: 9
Credit: 69,240
RAC: 0
Message 13971 - Posted: 17 Apr 2006, 19:44:18 UTC

Result ID 17188258 server was down and the transfer of the result faild. Seems like there is some work to do on klient to handle the recovery from this kind of situation
ID: 13971 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 13982 - Posted: 17 Apr 2006, 22:24:35 UTC - in response to Message 13964.  

Can you post a link to the result? I think you will actually get credit for this. The result is reported to us as a large amount of "claimed credit"; we are going through on a weekly basis and granting credit for these jobs that caused big problems and returned "invalid" results. The results and your posting are still useful to us -- in this case, the postings on this work unit helped us track down a pretty esoteric bug in Rosetta. Let me know what the result number is -- if its not flagged to get credit, I can see why.



I'm sorry I can not, I looked for it but could not find it. For me your system for tracking WU is hard to use for me It might work OK for me if I had only a few nodes working this. But I have over 50 nodes working this project, jobs get lost with so many pages of WU's It might help if you put in page numbers 1 to 10 20 30 40 50 instead of Just NEXT PAGE

But getting the points for lost jobs was not the point of my post putting a PC into a endless LOOP That was the point of my post to let you see a problem you (Rosetta) need to address You have sent out bad WU's in the past and you will send out more new Bad ones to come to think not is unwise. The wise thing is to plan for it and figure out a way to reset the project client side Doing this would flush all the WU's on the client
I know I would rather lose a few points in a flush then to have a PC stuck in a endless LOOP for Hundreds for hours untill I check on it and abort it and still get no points for it or just getting a fraction of the points for the total Hrs spent

If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 13982 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dimitris Hatzopoulos

Send message
Joined: 5 Jan 06
Posts: 336
Credit: 80,939
RAC: 0
Message 13985 - Posted: 17 Apr 2006, 22:52:29 UTC - in response to Message 13982.  

But getting the points for lost jobs was not the point of my post putting a PC into a endless LOOP That was the point of my post to let you see a problem you (Rosetta) need to address You have sent out bad WU's in the past and you will send out more new Bad ones to come to think not is unwise. The wise thing is to plan for it and figure out a way to reset the project client side Doing this would flush all the WU's on the client
I know I would rather lose a few points in a flush then to have a PC stuck in a endless LOOP for Hundreds for hours untill I check on it and abort it and still get no points for it or just getting a fraction of the points for the total Hrs spent


Lauren, I think what you say makes a lot of sense. The truth is that people like you, with 20+ nodes, bring a lot of TeraFLOPS to any project at very little "expense" (tech support).

I don't know how well the Time-To-Live timer works (after which a WU self-destructs). This needs to (ideally) be handled either with a "watchdog" thread for the Rosetta executable, or otherwise by BOINC-client itself.

Also, BOINC server should provide a mechanism for projects to cancel jobs, after they've been let loose on volunteers' PCs (e.g. the batch of bad jobs send out last week).

Another badly needed feature would be the ability of BOINC's scheduler/feeder to allocate WUs to eligible PCs based on capability and/or preference "flags" (e.g. >512MB, BigWU / BigMem flag, 24/7 operation, leave-in-mem = yes etc)

Finally, some optional but still useful features would be to 7zip (or bzip2) the files, which would halve the overall bandwidth per WU.

Unfortunately, a lot of the features mentioned above are almost "unique" to Rosetta.

One would hope that in this whole wide world, some code wizard would roll up his sleeves and code these things (at least in BOINC's open source) in a few afternoons for fun, like the Hungarian akosf did with Einstein's executables, but maybe we're asking too much.

PS: I have to admit that I'm disappointed wrt the big WUs sent out indiscriminately lately, despite being easy to anticipate the results... :-(
Best UFO Resources
Wikipedia R@h
How-To: Join Distributed Computing projects that benefit humanity
ID: 13985 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 13988 - Posted: 18 Apr 2006, 0:01:58 UTC - in response to Message 13932.  

Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would ...


The FA_* WUs are old ones from mid-march so they don't have the timeout enabled. As far as I know, these FA_* WUs were never cancelled, so they pop up now and then and get sent out again until 4 people have rejected them.
ID: 13988 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Laurenu2

Send message
Joined: 6 Nov 05
Posts: 57
Credit: 3,818,778
RAC: 0
Message 13997 - Posted: 18 Apr 2006, 1:14:19 UTC - in response to Message 13988.  

Well David I let this one FA_RLXpt_hom002_1ptq__361_380_3 run through 48 HR it did not self abort as you said it would ...


The FA_* WUs are old ones from mid-march so they don't have the timeout enabled. As far as I know, these FA_* WUs were never cancelled, so they pop up now and then and get sent out again until 4 people have rejected them.


Are you telling me this WU was not running for 100 to 150 Hrs I thought it was but steed it was running for a 1,000 + Hrs in a endlass Loop? (Grrrr)
If You Want The Best You Must forget The Rest
---------------And Join Free-DC----------------
ID: 13997 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 13998 - Posted: 18 Apr 2006, 1:24:29 UTC - in response to Message 13985.  

The "Time-to-Live" thread is a great idea, and something we can implement within Rosetta. I'll look into it.

As for the other stuff, we've made similar suggestions to the BOINC programmers. They're currently concentrating on their next stable release, but might be able to help us out after that. You're right about these being BOINC issues that mainly are a hassle for this Rosetta project. More than other BOINC applications, we're trying to improve our code continually, and thus we stumble upon more glitches when we try to make these scientific improvements.

But getting the points for lost jobs was not the point of my post putting a PC into a endless LOOP That was the point of my post to let you see a problem you (Rosetta) need to address You have sent out bad WU's in the past and you will send out more new Bad ones to come to think not is unwise. The wise thing is to plan for it and figure out a way to reset the project client side Doing this would flush all the WU's on the client
I know I would rather lose a few points in a flush then to have a PC stuck in a endless LOOP for Hundreds for hours untill I check on it and abort it and still get no points for it or just getting a fraction of the points for the total Hrs spent


Lauren, I think what you say makes a lot of sense. The truth is that people like you, with 20+ nodes, bring a lot of TeraFLOPS to any project at very little "expense" (tech support).

I don't know how well the Time-To-Live timer works (after which a WU self-destructs). This needs to (ideally) be handled either with a "watchdog" thread for the Rosetta executable, or otherwise by BOINC-client itself.

Also, BOINC server should provide a mechanism for projects to cancel jobs, after they've been let loose on volunteers' PCs (e.g. the batch of bad jobs send out last week).

Another badly needed feature would be the ability of BOINC's scheduler/feeder to allocate WUs to eligible PCs based on capability and/or preference "flags" (e.g. >512MB, BigWU / BigMem flag, 24/7 operation, leave-in-mem = yes etc)

Finally, some optional but still useful features would be to 7zip (or bzip2) the files, which would halve the overall bandwidth per WU.

Unfortunately, a lot of the features mentioned above are almost "unique" to Rosetta.

One would hope that in this whole wide world, some code wizard would roll up his sleeves and code these things (at least in BOINC's open source) in a few afternoons for fun, like the Hungarian akosf did with Einstein's executables, but maybe we're asking too much.

PS: I have to admit that I'm disappointed wrt the big WUs sent out indiscriminately lately, despite being easy to anticipate the results... :-(


ID: 13998 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
AMD_is_logical

Send message
Joined: 20 Dec 05
Posts: 299
Credit: 31,460,681
RAC: 0
Message 14003 - Posted: 18 Apr 2006, 2:59:56 UTC - in response to Message 13997.  

Are you telling me this WU was not running for 100 to 150 Hrs I thought it was but steed it was running for a 1,000 + Hrs in a endlass Loop? (Grrrr)


It probably sat in someone else's queue for a few weeks until it timed out. Then it may have been sent to someone else who then aborted it. Then it was sent to you and wasted 100 to 150 Hrs of your CPU time.

If it had been cancelled it would not have been sent out again.
ID: 14003 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 14012 - Posted: 18 Apr 2006, 5:04:00 UTC - in response to Message 13964.  

Let me know what the result number is -- if its not flagged to get credit, I can see why.


Here is one that ran for 88 hours. I aborted it on 28 March
https://boinc.bakerlab.org/rosetta/result.php?resultid=14764800
ID: 14012 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Interboy

Send message
Joined: 28 Sep 05
Posts: 3
Credit: 726,750
RAC: 340
Message 14019 - Posted: 18 Apr 2006, 8:06:48 UTC
Last modified: 18 Apr 2006, 8:07:37 UTC

ID: 14019 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Branislav

Send message
Joined: 23 Mar 06
Posts: 6
Credit: 450,417
RAC: 0
Message 14079 - Posted: 18 Apr 2006, 21:29:03 UTC - in response to Message 13331.  

Work unit aborted at 1.04% - CPU time used ~1 hour 30 minutes.
WU Name "VP_PRODUCTION_1qgtA_442_6769"
Application "Rosetta"
Workunit = 14313267;
Result ID = 17432540;
System = GenuineIntel Intel(R) Pentium(R) III Mobile CPU 866MHz
Microsoft Windows Millennium 04.90.3000.00
The workunit was aborted manually.

ID: 14079 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Report stuck & aborted WU here please - II



©2024 University of Washington
https://www.bakerlab.org