Please abort WUs with

Message boards : Number crunching : Please abort WUs with

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

AuthorMessage
R/B

Send message
Joined: 8 Dec 05
Posts: 195
Credit: 28,095
RAC: 0
Message 7847 - Posted: 28 Dec 2005, 22:25:23 UTC
Last modified: 28 Dec 2005, 22:26:45 UTC

None of you understood what I meant. I said the posting on the HOME THREAD alerting people to the problems with the 205 work units..

Not my own postings! Geez. I know how to edit those.

Go to rosetta home page and see how it is written. Should say in ALL CAPS not to abort DEFAULT units other than 205 ones....

(edit) This is the posting I was talking about.....From the HOME PAGE.

***************************************************

News

December 20, 2005
A bad batch of work units were created that can be identified with work unit names that start with "DEFAULT_xxxxx_205_". If you are running one of these work units, please abort it. We will grant credit to those who have run and aborted these work units. Details about this error and recent changes can be found in our Technical News page.

Founder of BOINC GROUP - Objectivists - Philosophically minded rational data crunchers.


ID: 7847 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 15
Message 7849 - Posted: 28 Dec 2005, 23:02:46 UTC - in response to Message 7847.  

None of you understood what I meant. I said the posting on the HOME THREAD alerting people to the problems with the 205 work units..

Not my own postings! Geez. I know how to edit those.


We understood what you said - but then the comment was made

Only mods can re-write existing postings.


and I spoke up to say that no, as far as I know, nobody can modify existing _postings_ (and thread titles, as THIS THREAD originally referred to "205"s) except the person who wrote it (for an hour), and that only STAFF can change the home page, not a mere mod.

The home page does say "start with DEFAULT_xxxxx_205_" and not "start with DEFAULT", but I agree that it can be confusing and be taken as "starting with 205 and continuing with other numbers" instead of "starting with this string of characters including specifically the '205'.", and therefore it should be expanded on. However, by the time they are present to change the wording, they'll also be present to delete the WUs, so there's no point...

ID: 7849 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ingleside

Send message
Joined: 25 Sep 05
Posts: 107
Credit: 1,514,472
RAC: 0
Message 7852 - Posted: 28 Dec 2005, 23:46:46 UTC - in response to Message 7264.  

I don't think there is anyone physically _at_ Rosetta today that knows how to kill them; I think that probably would require getting info from someone at SETI that has had to do it a few times. If there's anyone _there_... U.S. colleges are all on break.


Well, since BOINC is open-source, it took under a minute to locate http://setiathome2.ssl.berkeley.edu/cgi-bin/cvsweb.cgi/boinc/html/ops/

From a quick look on things, if project has setup the Administrator-pages it's just to login here, select "Cancel workunits" from among the many options, input first and last wu-id, and let the server do it's job of cancelling wu...


As for SETI@Home, well they've just had their routinely backup-outage so they're definitely still working, and BOINC-checkins can happen even if it's a holiday...

ID: 7852 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 7858 - Posted: 29 Dec 2005, 2:48:22 UTC - in response to Message 7852.  
Last modified: 29 Dec 2005, 2:49:35 UTC


As for SETI@Home, well they've just had their routinely backup-outage so they're definitely still working, and BOINC-checkins can happen even if it's a holiday...

Certainly true, but I think two factors contributed to no checking in:
1. The project is fairly new for them, and I'm guessing they did not fully realize the TLC (tender loving care) it might require;
2. They worked really intensely up to the holidays, and I am guessing that DB said for everyone to take a real vacation/total break and not even think about the project.

At any rate, it is only a relatively short period until they are back again next week. :)

Regards,
Bob P.
ID: 7858 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 15
Message 7859 - Posted: 29 Dec 2005, 3:00:44 UTC - in response to Message 7858.  

At any rate, it is only a relatively short period until they are back again next week. :)


They haven't completely abandoned ship for the week; I've seen a couple of postings made... I think it's just a matter of not having dealt with all the different crises SETI has dealt with, and being hesitant to do anything without being sure it's not going to make matters worse. I wasn't aware that the server-side stuff could even _be_ web-controlled; I sure wouldn't want to do it for the first time from home or a hotel room or something, without being able to refer to all the notes, and maybe do a backup first...

ID: 7859 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile River~~
Avatar

Send message
Joined: 15 Dec 05
Posts: 761
Credit: 285,578
RAC: 0
Message 7868 - Posted: 29 Dec 2005, 8:59:20 UTC - in response to Message 7859.  

At any rate, it is only a relatively short period until they are back again next week. :)


They haven't completely abandoned ship for the week; I've seen a couple of postings made...



Too right. We are all grateful for the team for taking time out from their break to turn up on these boards. It is kinda like phoning into the office when you are on vacation - effort beyond the call of duty.

But just like phoning in, there is only so much you can do by remote. When they are back in the office I am sure we will see real progress on all the outstanding issues.

I am hoping the first thing the server admins do is to steal the files off the server that belong to all the bad jobs it keeps recycling, so the server can't send them out again even if it tries to.

River~~



ID: 7868 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 15
Message 7978 - Posted: 30 Dec 2005, 9:23:11 UTC

Anyone with a "suspended" DEFAULT_xxxxx_205 please check the webpage for your results, and look at that one - if the "errors" line at the top says "Cancelled", you can unsuspend it and abort it. That will let it get back to the server and be finished. Thanks!

ID: 7978 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@H] Ray
Avatar

Send message
Joined: 20 Sep 05
Posts: 118
Credit: 100,251
RAC: 0
Message 8059 - Posted: 31 Dec 2005, 17:10:48 UTC
Last modified: 31 Dec 2005, 17:14:12 UTC

226 also has some bad ones, check this one, runs 5 to 8 hours before errering out.

INCREASE_CYCLES_10_1ogw_226_937

Reason: Access Violation (0xc0000005) at address 0x006047A8 write attempt to address 0x08567DA4

Exiting...

Cheers
Ray


Pizza@Home Rays Place Rays place Forums
ID: 8059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 15
Message 8063 - Posted: 31 Dec 2005, 17:22:03 UTC - in response to Message 8059.  

226 also has some bad ones, check this one, runs 5 to 8 hours before errering out.

INCREASE_CYCLES_10_1ogw_226_937


That one does have _some_ problem... but it's not the same as the DEFAULT_xxxx_205's. They error out because of maximum_cpu_time_exceeded. And I've had a number of those "INCREASE_CYCLES" WUs that completed just fine, although on my PC they ran about 4 hours instead of 2.

ID: 8063 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile rbpeake

Send message
Joined: 25 Sep 05
Posts: 168
Credit: 247,828
RAC: 0
Message 8064 - Posted: 31 Dec 2005, 18:03:33 UTC - in response to Message 8063.  

That one does have _some_ problem... but it's not the same as the DEFAULT_xxxx_205's. They error out because of maximum_cpu_time_exceeded. And I've had a number of those "INCREASE_CYCLES" WUs that completed just fine, although on my PC they ran about 4 hours instead of 2.


Here is one I got:

NO_BARCODE_FRAGS_1ogw_227_2815

<core_client_version>5.2.15</core_client_version>
<message> - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
No heartbeat from core client for 31 sec - exiting

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x7C911BF4 write attempt to address 0x00000000

Exiting...

Plus no credit received for slightly over 5 hours of work.
Regards,
Bob P.
ID: 8064 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [B@H] Ray
Avatar

Send message
Joined: 20 Sep 05
Posts: 118
Credit: 100,251
RAC: 0
Message 8066 - Posted: 1 Jan 2006, 0:08:56 UTC - in response to Message 8063.  

226 also has some bad ones, check this one, runs 5 to 8 hours before errering out.

INCREASE_CYCLES_10_1ogw_226_937


That one does have _some_ problem... but it's not the same as the DEFAULT_xxxx_205's. They error out because of maximum_cpu_time_exceeded. And I've had a number of those "INCREASE_CYCLES" WUs that completed just fine, although on my PC they ran about 4 hours instead of 2.


I have had other "INCREASE_CYCLES" units that ran good also, but as you say they took a lot longer, between 3 and 6 hours. I don't mind them running longer as long as they finish up.
Ray


Pizza@Home Rays Place Rays place Forums
ID: 8066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile kb7rzf
Avatar

Send message
Joined: 7 Oct 05
Posts: 16
Credit: 35,427
RAC: 0
Message 8192 - Posted: 2 Jan 2006, 18:04:49 UTC
Last modified: 2 Jan 2006, 18:07:16 UTC

I also just got an error for result MORE_FRAGS_1di2_222_4350_0

stderr out <core_client_version>5.2.13</core_client_version>
<message> - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
No heartbeat from core client for 31 sec - exiting

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x7C911E58 read attempt to address 0xBE02E900

Exiting...

</stderr_txt>
Heres the link to the WU
Here

edit

Heres the message from my messages tab.

1/2/2006 9:08:32 AM|rosetta@home|Resuming result MORE_FRAGS_1di2_222_4350_0 using rosetta version 481
1/2/2006 9:08:32 AM|SETI@home|Pausing result 03no03aa.21211.32001.292318.1.183_3 (left in memory)
1/2/2006 9:29:12 AM|rosetta@home|Unrecoverable error for result MORE_FRAGS_1di2_222_4350_0 ( - exit code -1073741819 (0xc0000005))
1/2/2006 9:29:13 AM|SETI@home|Result 03no03aa.21211.32001.292318.1.183_3 exited with zero status but no 'finished' file
1/2/2006 9:29:13 AM|SETI@home|If this happens repeatedly you may need to reset the project.
1/2/2006 9:29:13 AM||request_reschedule_cpus: process exited
1/2/2006 9:29:13 AM|rosetta@home|Computation for result MORE_FRAGS_1di2_222_4350_0 finished


ID: 8192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bartsob5&alicjam

Send message
Joined: 17 Sep 05
Posts: 6
Credit: 183,280
RAC: 0
Message 8194 - Posted: 2 Jan 2006, 18:32:07 UTC

and i'd like to ask, what about WU from NEW_SOFT_CENTROID_PACKING_2reb_225 series? after 9 hours there is still 1%!!!! full name of the workunit is NEW_SOFT_CENTROID_PACKING_2reb_225_3842
ID: 8194 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile kb7rzf
Avatar

Send message
Joined: 7 Oct 05
Posts: 16
Credit: 35,427
RAC: 0
Message 8196 - Posted: 2 Jan 2006, 18:57:43 UTC - in response to Message 8194.  

and i'd like to ask, what about WU from NEW_SOFT_CENTROID_PACKING_2reb_225 series? after 9 hours there is still 1%!!!! full name of the workunit is NEW_SOFT_CENTROID_PACKING_2reb_225_3842


I have a WU called NEW_SOFT_CENTROID_PACKING_2reb_225_4338_0, its ran for 29 minutes, and is at 10% done. But then my computer switched to crunch more SETI WU's, so I dunno if its a bad WU or not yet. Will keep an eye on it and see if anything strange happens.

Jeremy

ID: 8196 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bartsob5&alicjam

Send message
Joined: 17 Sep 05
Posts: 6
Credit: 183,280
RAC: 0
Message 8199 - Posted: 2 Jan 2006, 19:35:06 UTC

ok.. it has come back to normal after booting PC. so it was only false alarm... now, after 1 hour it has 30%
ID: 8199 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile bartsob5&alicjam

Send message
Joined: 17 Sep 05
Posts: 6
Credit: 183,280
RAC: 0
Message 8302 - Posted: 3 Jan 2006, 21:17:42 UTC

and again.. i had a WU named MORE_FRAGS_1ogw_222_4890 and it had error during computing after almost 2hours of computing... as i see, another user had also problem with this WU, but is it another bad series?
ID: 8302 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Padanian

Send message
Joined: 27 Sep 05
Posts: 14
Credit: 15,190
RAC: 0
Message 8315 - Posted: 4 Jan 2006, 0:24:37 UTC

Have a look at this

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=3833695

It seems like a recursive computing error.
ID: 8315 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile O&O
Avatar

Send message
Joined: 11 Dec 05
Posts: 25
Credit: 66,900
RAC: 0
Message 8383 - Posted: 4 Jan 2006, 22:45:52 UTC

I aborted DEFAULT_1n0u_218_344_8...
For the reason..in bold.

Did I do the right thing?

O&O
ID: 8383 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Tern
Avatar

Send message
Joined: 25 Oct 05
Posts: 576
Credit: 4,695,450
RAC: 15
Message 8393 - Posted: 4 Jan 2006, 23:10:00 UTC - in response to Message 8383.  

Did I do the right thing?


Your computers are hidden, so we can't look at the WU, so no idea... I _think_ so, but would have to look at the web page to be sure.

ID: 8393 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile O&O
Avatar

Send message
Joined: 11 Dec 05
Posts: 25
Credit: 66,900
RAC: 0
Message 8405 - Posted: 5 Jan 2006, 6:10:47 UTC - in response to Message 8393.  
Last modified: 5 Jan 2006, 6:29:32 UTC

ID: 8405 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · Next

Message boards : Number crunching : Please abort WUs with



©2024 University of Washington
https://www.bakerlab.org