Computational Error

Message boards : Number crunching : Computational Error

To post messages, you must log in.

Previous · 1 · 2 · 3 · Next

AuthorMessage
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 1135 - Posted: 9 Oct 2005, 3:32:25 UTC - in response to Message 1129.  
Last modified: 9 Oct 2005, 3:40:42 UTC

Holly,

No.

....



Answer from Einstein moved to here:

Oh, was it me you posted something to?

People usually adress me as Fuzzy or Ms. Noodles for some.

No, I couldn't use your post to anything, except the part about the person crunching that WU before me, had a Mac OS that's not compatible. I didn't know that, as I don't have a Mac.

No, it was NOT a benchmark, which made the WU crash! My first WU went fine and it seems that the one I crunch on now is doing OK. It has reached the critical point of 83.33 %, so we'll have to see. So I think that particular WU is bad.

I don't know if you bothered to see the spec's of my computer, as I don't have it hidden here (it's only over at Seti I have it hidden for reasons, I won't touch on now), so I don't know why you think my computer is memory limited? And no, it didn't crash because there was a benchmark running. I haven't had an automatic benchmark runned for several days now, but if it should come, while my computer is on Rosetta crunching, and the WU crash, so be it, untill they solve the problem here.

But thanks anyway.



[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 1135 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 1137 - Posted: 9 Oct 2005, 4:13:27 UTC
Last modified: 9 Oct 2005, 4:55:32 UTC

It just happened again!

Ok, let's see if we can dissect this problem from my my log:

10/8/2005 8:30:04 PM||Starting BOINC client version 4.72 for windows_intelx86
10/8/2005 8:30:04 PM||Data directory: C:ProgrammerBOINC
10/8/2005 8:30:04 PM||Processor Inventory: 1 GenuineIntel Intel(R) Pentium(R) 4 CPU 2.80GHz Processor(s)
10/8/2005 8:30:04 PM||Memory Inventory: Memory total - 503.36 MB, Swap total - 1.20 GB
10/8/2005 8:30:04 PM||Disk Inventory: Disk total - 37.25 GB, Disk available - 27.46 GB
10/8/2005 8:30:05 PM|rosetta@home|Computer ID: 12228; location: home; project prefs: home
10/8/2005 8:30:05 PM|LHC@home|Computer ID: 64638; location: home; project prefs: home
10/8/2005 8:30:05 PM|SETI@home|Computer ID: 1489784; location: home; project prefs: home
10/8/2005 8:30:05 PM||General prefs: from rosetta@home (last modified 2005-10-08 20:23:30)
10/8/2005 8:30:05 PM||General prefs: using separate prefs for home
10/8/2005 8:30:05 PM||Remote control not allowed; using loopback address
10/8/2005 8:30:05 PM|rosetta@home|Deferring computation for result 1cfyA_abrelax_13371_1
10/8/2005 8:30:05 PM|SETI@home|Deferring computation for result 18oc03ab.11910.20178.754822.194_0
10/8/2005 8:30:05 PM|LHC@home|Resuming computation for result wjun4D_v6s4hhpac_mqx__10__64.3304_59.3467__6_8__6__60_1_sixvf_boinc29130_2 using sixtrack version 4.67
10/8/2005 8:30:05 PM|SETI@home|Deferring communication with project for 1 minutes and 48 seconds
10/8/2005 8:41:35 PM||request_reschedule_cpus: project op
10/8/2005 8:41:36 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi

.... A lot of contacts to LHC ....

10/8/2005 9:59:40 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
10/8/2005 9:59:40 PM|LHC@home|Reason: To fetch work
10/8/2005 9:59:40 PM|LHC@home|Requesting 8450 seconds of work, returning 0 results
10/8/2005 9:59:41 PM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
10/8/2005 9:59:41 PM|LHC@home|No work from project
10/8/2005 9:59:42 PM|LHC@home|Deferring communication with project for 16 minutes and 7 seconds
10/8/2005 9:59:52 PM||request_reschedule_cpus: process exited
10/8/2005 9:59:52 PM|LHC@home|Computation for result wjun4D_v6s4hhpac_mqx__10__64.3304_59.3467__6_8__6__60_1_sixvf_boinc29130_2 finished
10/8/2005 9:59:52 PM|rosetta@home|Restarting result 1cfyA_abrelax_13371_1 using rosetta version 4.77
10/8/2005 9:59:53 PM|LHC@home|Started upload of wjun4D_v6s4hhpac_mqx__10__64.3304_59.3467__6_8__6__60_1_sixvf_boinc29130_2_0
10/8/2005 10:00:00 PM|LHC@home|Finished upload of wjun4D_v6s4hhpac_mqx__10__64.3304_59.3467__6_8__6__60_1_sixvf_boinc29130_2_0
10/8/2005 10:00:00 PM|LHC@home|Throughput 7152 bytes/sec
10/8/2005 10:02:42 PM||request_reschedule_cpus: project op
10/8/2005 10:02:42 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
10/8/2005 10:02:42 PM|LHC@home|Reason: Requested by user
10/8/2005 10:02:42 PM|LHC@home|Requesting 8640 seconds of work, returning 1 results

.... a lot of contacts to LHC ....

10/8/2005 10:29:20 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
10/8/2005 10:29:20 PM|LHC@home|Reason: To fetch work
10/8/2005 10:29:20 PM|LHC@home|Requesting 8640 seconds of work, returning 0 results
10/8/2005 10:29:21 PM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
10/8/2005 10:29:21 PM|LHC@home|No work from project
10/8/2005 10:29:22 PM|LHC@home|Deferring communication with project for 31 minutes and 9 seconds
10/8/2005 10:57:05 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
10/8/2005 10:57:05 PM|rosetta@home|Reason: To fetch work
10/8/2005 10:57:05 PM|rosetta@home|Requesting 2800 seconds of work, returning 0 results
10/8/2005 10:57:08 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
10/8/2005 10:57:09 PM|rosetta@home|Deferring communication with project for 5 seconds
10/8/2005 10:57:09 PM|rosetta@home|Started download of aa1acf_03_05.200_v1_3.gz
10/8/2005 10:57:09 PM|rosetta@home|Started download of aa1acf_09_05.200_v1_3.gz
10/8/2005 10:58:24 PM|rosetta@home|Finished download of aa1acf_03_05.200_v1_3.gz
10/8/2005 10:58:24 PM|rosetta@home|Throughput 14749 bytes/sec
10/8/2005 10:58:24 PM|rosetta@home|Started download of 1acf_.fasta
10/8/2005 10:58:25 PM|rosetta@home|Finished download of 1acf_.fasta
10/8/2005 10:58:25 PM|rosetta@home|Throughput 188 bytes/sec
10/8/2005 10:58:25 PM|rosetta@home|Started download of 1acf_.psipred_ss2.gz
10/8/2005 10:58:26 PM|rosetta@home|Finished download of 1acf_.psipred_ss2.gz
10/8/2005 10:58:26 PM|rosetta@home|Throughput 1429 bytes/sec
10/8/2005 10:58:26 PM|rosetta@home|Started download of 1acf.pdb.gz
10/8/2005 10:58:28 PM|rosetta@home|Finished download of 1acf.pdb.gz
10/8/2005 10:58:28 PM|rosetta@home|Throughput 16846 bytes/sec
10/8/2005 10:58:28 PM|rosetta@home|Started download of 1acf_.1d1jA.3dpair.base.pairmin_fixc3.cst.gz
10/8/2005 10:58:29 PM|rosetta@home|Finished download of 1acf_.1d1jA.3dpair.base.pairmin_fixc3.cst.gz
10/8/2005 10:58:29 PM|rosetta@home|Throughput 4522 bytes/sec
10/8/2005 10:59:39 PM|rosetta@home|Finished download of aa1acf_09_05.200_v1_3.gz
10/8/2005 10:59:39 PM|rosetta@home|Throughput 20819 bytes/sec
10/8/2005 10:59:39 PM||request_reschedule_cpus: files downloaded
10/8/2005 10:59:39 PM|rosetta@home|Pausing result 1cfyA_abrelax_13371_1 (removed from memory)

10/8/2005 10:59:40 PM|SETI@home|Restarting result 18oc03ab.11910.20178.754822.194_0 using setiathome version 4.18
10/8/2005 10:59:40 PM|rosetta@home|Unrecoverable error for result 1cfyA_abrelax_13371_1 ( - exit code -1073741819 (0xc0000005))
10/8/2005 10:59:41 PM||request_reschedule_cpus: process exited
10/8/2005 10:59:41 PM|rosetta@home|Deferring communication with project for 1 minutes and 0 seconds
10/8/2005 10:59:41 PM|rosetta@home|Computation for result 1cfyA_abrelax_13371_1 finished

10/8/2005 11:00:33 PM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
10/8/2005 11:00:33 PM|LHC@home|Reason: To fetch work
10/8/2005 11:00:33 PM|LHC@home|Requesting 8640 seconds of work, returning 0 results
10/8/2005 11:00:34 PM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
10/8/2005 11:00:34 PM|LHC@home|No work from project
10/8/2005 11:00:35 PM|LHC@home|Deferring communication with project for 18 minutes and 41 seconds

.... a lot of contacts to LHC ....

10/8/2005 11:19:18 PM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
10/8/2005 11:19:18 PM|LHC@home|No work from project
10/8/2005 11:19:19 PM|LHC@home|Deferring communication with project for 1 hours, 53 minutes, and 37 seconds
10/8/2005 11:25:52 PM||request_reschedule_cpus: project op
10/8/2005 11:25:52 PM|SETI@home|Pausing result 18oc03ab.11910.20178.754822.194_0 (removed from memory)
10/8/2005 11:25:53 PM|rosetta@home|Starting result 1acf__abrelax_no_cst_06323_0 using rosetta version 4.77
10/8/2005 11:25:53 PM||request_reschedule_cpus: process exited
10/8/2005 11:25:53 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
10/8/2005 11:25:53 PM|rosetta@home|Reason: Requested by user
10/8/2005 11:25:53 PM|rosetta@home|Requesting 0 seconds of work, returning 1 results // Here I return the first crashed WU
10/8/2005 11:25:55 PM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded

10/9/2005 12:19:20 AM|LHC@home|Deferring communication with project for 53 minutes and 37 seconds
10/9/2005 1:12:58 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
10/9/2005 1:12:58 AM|LHC@home|Reason: To fetch work
10/9/2005 1:12:58 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results
10/9/2005 1:12:59 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
10/9/2005 1:12:59 AM|LHC@home|No work from project
10/9/2005 1:13:00 AM|LHC@home|Deferring communication with project for 58 seconds
.... more contacts to LHC ....

10/9/2005 1:23:44 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
10/9/2005 1:23:44 AM|LHC@home|Reason: To fetch work
10/9/2005 1:23:44 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results
10/9/2005 1:23:45 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
10/9/2005 1:23:46 AM|LHC@home|No work from project
10/9/2005 1:23:47 AM|LHC@home|Deferring communication with project for 46 minutes and 53 seconds
10/9/2005 1:25:53 AM|SETI@home|Restarting result 18oc03ab.11910.20178.754822.194_0 using setiathome version 4.18
10/9/2005 1:25:53 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_06323_0 (removed from memory)
10/9/2005 1:25:55 AM||request_reschedule_cpus: process exited

10/9/2005 2:10:41 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
10/9/2005 2:10:41 AM|LHC@home|Reason: To fetch work
10/9/2005 2:10:41 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results
10/9/2005 2:10:42 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
10/9/2005 2:10:42 AM|LHC@home|No work from project
10/9/2005 2:10:43 AM|LHC@home|Deferring communication with project for 3 minutes and 29 seconds
.... again contacts to LHC ....

10/9/2005 3:14:15 AM|LHC@home|Deferring communication with project for 1 hours, 1 minutes, and 13 seconds
10/9/2005 3:25:55 AM|SETI@home|Pausing result 18oc03ab.11910.20178.754822.194_0 (removed from memory)
10/9/2005 3:25:56 AM|rosetta@home|Restarting result 1acf__abrelax_no_cst_06323_0 using rosetta version 4.77
10/9/2005 3:25:56 AM||request_reschedule_cpus: process exited

10/9/2005 4:14:16 AM|LHC@home|Deferring communication with project for 1 minutes and 12 seconds
10/9/2005 4:15:29 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
10/9/2005 4:15:29 AM|LHC@home|Reason: To fetch work
10/9/2005 4:15:29 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results
10/9/2005 4:15:30 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
10/9/2005 4:15:30 AM|LHC@home|No work from project
10/9/2005 4:15:31 AM|LHC@home|Deferring communication with project for 58 seconds
10/9/2005 4:16:31 AM|LHC@home|Fetching master file
10/9/2005 4:16:32 AM|LHC@home|Master page download succeeded

.... contacts to LHC .....

10/9/2005 4:43:54 AM|LHC@home|Sending scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi
10/9/2005 4:43:54 AM|LHC@home|Reason: To fetch work
10/9/2005 4:43:54 AM|LHC@home|Requesting 8640 seconds of work, returning 0 results
10/9/2005 4:43:55 AM|LHC@home|Scheduler request to http://lhcathome-sched1.cern.ch/scheduler/cgi succeeded
10/9/2005 4:43:55 AM|LHC@home|No work from project
10/9/2005 4:43:56 AM|LHC@home|Deferring communication with project for 1 hours, 54 minutes, and 20 seconds
10/9/2005 5:25:56 AM|SETI@home|Restarting result 18oc03ab.11910.20178.754822.194_0 using setiathome version 4.18
10/9/2005 5:25:56 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_06323_0 (removed from memory)
10/9/2005 5:25:58 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_no_cst_06323_0 ( - exit code -1073741819 (0xc0000005))
10/9/2005 5:25:58 AM||request_reschedule_cpus: process exited
10/9/2005 5:25:58 AM|rosetta@home|Deferring communication with project for 1 minutes and 0 seconds
10/9/2005 5:25:58 AM|rosetta@home|Computation for result 1acf__abrelax_no_cst_06323_0 finished

10/9/2005 5:26:00 AM|SETI@home|Sending scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi
10/9/2005 5:26:00 AM|SETI@home|Reason: To fetch work
10/9/2005 5:26:00 AM|SETI@home|Requesting 2171 seconds of work, returning 0 results
10/9/2005 5:26:02 AM|SETI@home|Scheduler request to http://setiboinc.ssl.berkeley.edu/sah_cgi/cgi succeeded
10/9/2005 5:26:03 AM|SETI@home|Deferring communication with project for 10 minutes and 4 seconds
10/9/2005 5:26:03 AM|SETI@home|Started download of 29ap04ab.4305.12480.803406.195
10/9/2005 5:26:16 AM|SETI@home|Finished download of 29ap04ab.4305.12480.803406.195
10/9/2005 5:26:16 AM|SETI@home|Throughput 29700 bytes/sec
10/9/2005 5:26:16 AM||request_reschedule_cpus: files downloaded
10/9/2005 5:26:58 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
10/9/2005 5:26:58 AM|rosetta@home|Reason: To fetch work
10/9/2005 5:26:58 AM|rosetta@home|Requesting 8640 seconds of work, returning 1 results // here BOINC manager returns the second crashed WU. I didn't update this time!
10/9/2005 5:27:00 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded

10/9/2005 5:27:01 AM||request_reschedule_cpus: files downloaded
10/9/2005 5:42:58 AM||request_reschedule_cpus: project op
10/9/2005 5:42:58 AM|SETI@home|Pausing result 18oc03ab.11910.20178.754822.194_0 (removed from memory)
10/9/2005 5:42:59 AM|rosetta@home|Starting result 1acf__abrelax_no_cst_07670_0 using rosetta version 4.77
10/9/2005 5:42:59 AM||request_reschedule_cpus: process exited

10/9/2005 5:43:57 AM|LHC@home|Deferring communication with project for 54 minutes and 19 seconds

I have set Rosetta to No New Work for now.

And I have to look in the other threads and posts about what I can help with to sort out this.


The last WU didn't seem stuck at any percentages, as the first one. The last time I looked to it, it was on about 87 % with less than a half hour to go.

Mayby this is another bug of some kind.

I'll save the files I have in my BOINC library right now, so David, if you're interested in them, you can contact me on fuzzy dot hollynoodles at gmail dot com.


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 1137 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
J D K
Avatar

Send message
Joined: 23 Sep 05
Posts: 168
Credit: 101,266
RAC: 0
Message 1138 - Posted: 9 Oct 2005, 4:25:29 UTC
Last modified: 9 Oct 2005, 4:26:10 UTC

You must keep Rosetta in memory when doing another project..
BOINC Wiki

ID: 1138 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 1139 - Posted: 9 Oct 2005, 4:31:39 UTC - in response to Message 1138.  
Last modified: 9 Oct 2005, 4:39:56 UTC

You must keep Rosetta in memory when doing another project..


// Where do I set it to that in 4.72?

EDIT: Found it!!!!

Let's see how it works out!

Thanks! :-)




[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 1139 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
[B^S] sTrey
Avatar

Send message
Joined: 25 Sep 05
Posts: 16
Credit: 15,524
RAC: 0
Message 1140 - Posted: 9 Oct 2005, 4:35:33 UTC - in response to Message 1111.  
Last modified: 9 Oct 2005, 4:41:21 UTC


So it seems the benchmarking issue is isolated to multiple CPU machines, whether logical (HT) or physical (dual core/dual processors)


For what it's worth: I've crunched only 17 wu's but haven't had any problems yet and have gone through a couple of benchmarks. I'm running it on a HT cpu, but I've told it to limit cpus to one, since that keeps the cpu temp reasonable.
(& it's set to keep projects in memory)
Hardware is P4 3.2Ghz, 1G ram, running XP Pro

Just another data point.
ID: 1140 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1144 - Posted: 9 Oct 2005, 7:38:07 UTC - in response to Message 1135.  

Holly,

Fuzzy Holly, or whatever ... :)

For whatever reason I thought the name was Holly. Then again, what do I know... sorry if it offended.

But, the explanation was to tell you that there is a problem when Rosetta@Home is suspended and removed from memory.

Using your logs:

10/8/2005 10:59:39 PM|rosetta@home|Pausing result 1cfyA_abrelax_13371_1 (removed from memory)
10/8/2005 10:59:40 PM|SETI@home|Restarting result 18oc03ab.11910.20178.754822.194_0 using setiathome version 4.18
10/8/2005 10:59:40 PM|rosetta@home|Unrecoverable error for result 1cfyA_abrelax_13371_1 ( - exit code -1073741819 (0xc0000005))

are two key lines. The first suspends the process and removes it from memory. Triggering the fatal error. That is what I was trying to say and failing.

Some get this error when Rosetta has work in memory and benchmarks run. I have been running a bit of Rosetta work and I think of the 200 or so I have only lost one to client error.

But, I leave in memory as all my machies have at least 1G RAM ... PowerMac has 2.5G :)

I looked at the one work unit you complained of, and looked at the result of the other user and was just trying to say that it is not a bad work unit. He/she did not process the work because of an OS-X problem, you because of the suspend problem ... anyway, it seems you are ok now ...
ID: 1144 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 1145 - Posted: 9 Oct 2005, 7:55:22 UTC - in response to Message 1144.  
Last modified: 9 Oct 2005, 7:57:14 UTC

Thanks, Mr. Buck. No, I didn't get your meaning.

But let's see how things go now, but I'm just puzzled now that my first WU apparently went well, as I had roundrobin turns to default 60 minutes, and have two other projects running at the same time. So it has been taken in and out of memory a couple of times at the least! Hmmm....

So maybe they have solved the problem and made some WU's that are not so sensible, and I got one????

But no matter what, I'll most probably get the problem again if I'm so unlucky that the automatic benchmark'ing kicks in, while I have a Rosetta WU running. I can't change to any client above 4.* before LHC will let me. But then I'll know what's going on and then just go on. This seem to be the price at this project, unless they get this solved in the nearest future.


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 1145 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1147 - Posted: 9 Oct 2005, 9:17:05 UTC

Its PAUL!

Geeze ...

:)

There are other people that are saying that they are not seeing the problem. SO, it may be intermittant too ... :(

As far as my meaning, I have not been doing all that hot for some weeks now so I am not surprised I was not as clear as I would have liked. Heck, earlier today I was typing and I lost that skill too ... not good signs ...

But, seem to be back to normal levels of bad, so we shall see if we improve.

I know David Kim has been hard at work, I can hear his brain grinding away all the way over here ... and he is several states away from where I live ... or, maybe it is just the trash truck ...
ID: 1147 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
devn

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 2,063
RAC: 0
Message 1160 - Posted: 9 Oct 2005, 14:00:58 UTC - in response to Message 1140.  


So it seems the benchmarking issue is isolated to multiple CPU machines, whether logical (HT) or physical (dual core/dual processors)


For what it's worth: I've crunched only 17 wu's but haven't had any problems yet and have gone through a couple of benchmarks. I'm running it on a HT cpu, but I've told it to limit cpus to one, since that keeps the cpu temp reasonable.
(& it's set to keep projects in memory)
Hardware is P4 3.2Ghz, 1G ram, running XP Pro

Just another data point.




i have HT but have also tried setting rosetta to use 1 cpu to see if it would make a difference. auto benchmarks caused "unrecoverable error" on a wu yesterday. rosetta is set to remain in memory when preempted but auto benchmarks throws it out of memory.
ID: 1160 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 3,251,762
RAC: 1,321
Message 1161 - Posted: 9 Oct 2005, 14:08:09 UTC

You can run 2 CPU's with the HT CPU's & overcome the "unrecoverable error" by simply Suspending the Rosetta Project & doing a Manual BenchMark.

Do this and mark down when you did it & then just do it again before 5 Days are up when the Server will ask for a Benchmark. I know this isn't probably practical for people with a Ton of Computers but for those with not so many it's a simple work around the Error until the Dev's can figure out why it's happening ...
ID: 1161 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1166 - Posted: 9 Oct 2005, 14:47:09 UTC

If you use BOINC VIew it should not be much more paiful than doing it to one computer.
ID: 1166 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
devn

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 2,063
RAC: 0
Message 1171 - Posted: 9 Oct 2005, 15:21:02 UTC - in response to Message 1161.  
Last modified: 9 Oct 2005, 15:22:45 UTC

You can run 2 CPU's with the HT CPU's & overcome the "unrecoverable error" by simply Suspending the Rosetta Project & doing a Manual BenchMark.

Do this and mark down when you did it & then just do it again before 5 Days are up when the Server will ask for a Benchmark. I know this isn't probably practical for people with a Ton of Computers but for those with not so many it's a simple work around the Error until the Dev's can figure out why it's happening ...



wanted to try the 1 cpu idea since it had been noted that those computers with 1 cpu weren't having the problem. it didn't work for me but sTrey had success with it.
ID: 1171 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 3,251,762
RAC: 1,321
Message 1175 - Posted: 9 Oct 2005, 15:58:09 UTC
Last modified: 9 Oct 2005, 15:58:47 UTC

wanted to try the 1 cpu idea since it had been noted that those computers with 1 cpu weren't having the problem. it didn't work for me but sTrey had success with it.
==========

Well if you have a HT CPU and run it only as 1 instead of 2 your giving up around 15% to 25 % of your Performance Crunching the WU's, may as well Crunch for a Project thats not having a Problem with the HT CPU's.

But that can be hard to do also since it seems all the Projects have some sort of Problem with them. Almost any Project I've run WU's for has a Problem from time to time starting up the Second or next WU when 1 of the 2 running finishes. It's a on again off again thing but all the Projects have the problem. Who knows ... :)
ID: 1175 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 1178 - Posted: 9 Oct 2005, 16:52:50 UTC

Problem solved! For me at least!

But this problem was very confusing, and yes, I have read in the other threads about it, but because my first WU went fine, I din't think my problem was about letting the WU's staying in memory on my computer. Very confusing, when it didn't crash the first time under the very same conditions as later!

But I have returned the next valid WU and has the third one crunching now.

So GO Rosetta!!!


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 1178 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 1180 - Posted: 9 Oct 2005, 16:59:23 UTC
Last modified: 9 Oct 2005, 17:03:38 UTC

Here's 4 WUs that all failed (dual Xeon HT, host ID=1779)when the auto benchmark ran:

10/9/2005 10:00:10 AM||Suspending computation and network activity - running CPU benchmarks
10/9/2005 10:00:10 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_02636_0 (removed from memory)// set to leave in memory
10/9/2005 10:00:10 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_04179_0 (removed from memory)
10/9/2005 10:00:10 AM|rosetta@home|Pausing result 1acf__abrelax_no_cst_04241_0 (removed from memory)
10/9/2005 10:00:10 AM|rosetta@home|Pausing result 1acf__abrelax_04274_0 (removed from memory)
10/9/2005 10:00:10 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_no_cst_04179_0 ( - exit code -1073741819 (0xc0000005))
10/9/2005 10:00:10 AM||request_reschedule_cpus: process exited
10/9/2005 10:00:11 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_no_cst_02636_0 ( - exit code -1073741819 (0xc0000005))
10/9/2005 10:00:11 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_no_cst_04241_0 ( - exit code -1073741819 (0xc0000005))
10/9/2005 10:00:11 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_04274_0 ( - exit code -1073741819 (0xc0000005))
10/9/2005 10:00:11 AM||request_reschedule_cpus: process exited

10/9/2005 10:00:12 AM||Running CPU benchmarks
10/9/2005 10:01:09 AM||Benchmark results:
10/9/2005 10:01:09 AM|| Number of CPUs: 4
10/9/2005 10:01:09 AM|| 1222 double precision MIPS (Whetstone) per CPU
10/9/2005 10:01:09 AM|| 1044 integer MIPS (Dhrystone) per CPU
10/9/2005 10:01:09 AM||Finished CPU benchmarks
10/9/2005 10:01:09 AM||Resuming computation and network activity
10/9/2005 10:01:09 AM||request_reschedule_cpus: Resuming activities
10/9/2005 10:01:09 AM|rosetta@home|Deferring communication with project for 2 seconds
10/9/2005 10:01:09 AM|rosetta@home|Computation for result 1acf__abrelax_no_cst_02636_0 finished
10/9/2005 10:01:09 AM|rosetta@home|Computation for result 1acf__abrelax_no_cst_04241_0 finished
10/9/2005 10:01:09 AM|rosetta@home|resume_or_start(): unexpected process state 2
10/9/2005 10:01:09 AM|rosetta@home|resume_or_start(): unexpected process state 2
10/9/2005 10:01:09 AM|rosetta@home|Starting result 1acf__abrelax_04604_0 using rosetta version 4.77
10/9/2005 10:01:10 AM|rosetta@home|Starting result 1acf__abrelax_no_cst_05010_0 using rosetta version 4.77
10/9/2005 10:01:10 AM|rosetta@home|Computation for result 1acf__abrelax_no_cst_04179_0 finished
10/9/2005 10:01:11 AM|rosetta@home|Computation for result 1acf__abrelax_04274_0 finished

10/9/2005 10:01:12 AM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi
10/9/2005 10:01:12 AM|rosetta@home|Reason: To fetch work
10/9/2005 10:01:12 AM|rosetta@home|Requesting 73006 seconds of work, returning 5 results
10/9/2005 10:01:13 AM|rosetta@home|Scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi succeeded
10/9/2005 10:01:14 AM||request_reschedule_cpus: files downloaded
10/9/2005 10:01:14 AM||request_reschedule_cpus: files downloaded
10/9/2005 10:01:14 AM||request_reschedule_cpus: files downloaded
10/9/2005 10:01:14 AM||request_reschedule_cpus: files downloaded

Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 1180 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
devn

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 2,063
RAC: 0
Message 1182 - Posted: 9 Oct 2005, 17:34:26 UTC - in response to Message 1175.  

Well if you have a HT CPU and run it only as 1 instead of 2 your giving up around 15% to 25 % of your Performance Crunching the WU's, may as well Crunch for a Project thats not having a Problem with the HT CPU's.

But that can be hard to do also since it seems all the Projects have some sort of Problem with them. Almost any Project I've run WU's for has a Problem from time to time starting up the Second or next WU when 1 of the 2 running finishes. It's a on again off again thing but all the Projects have the problem. Who knows ... :) [/quote]


i wouldn't mind giving up a little performance if using 1 cpu had worked and allowed rosetta to run w/out errors. haven't had problems with HT and other projects so far.
ID: 1182 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1189 - Posted: 9 Oct 2005, 23:17:36 UTC

Angus,

I am way behind on log file work, but, can you zip up the TXT and OLD files and send them to me p.d.buck@comcast.net

I hope to get at least one example out of them ... I don't like to pull from the pages here as I always seem to be missing something (when I compare the logs to posts).

Anyway, Thanks!
ID: 1189 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mike Gelvin
Avatar

Send message
Joined: 7 Oct 05
Posts: 65
Credit: 10,612,039
RAC: 0
Message 1190 - Posted: 10 Oct 2005, 0:53:27 UTC
Last modified: 10 Oct 2005, 0:58:49 UTC

I just had a HT machine comp error out without benchmarks being activated. I have had several computers (4) error out, 3 were HT, 1 was single threaded. I have sucessfully returned one work unit, it was from a single threaded cpu. Im pretty sure benchmarks were run on that computer but not sure Rosetta was actually interupted (could have been SETI). Anyway, heres the logs from the latest that failed.
10/09/05 13:24:46||request_reschedule_cpus: files downloaded
10/09/05 13:24:46|LHC@home|Restarting result w5_lhc_coll_IP15_trip_meas__30__64.31_59.32__4_6__6__28.5_1_sixvf_boinc14060_2 using sixtrack version 4.67
10/09/05 13:24:46|rosetta@home|Restarting result 1cfyA_abrelax_18090_0 using rosetta version 4.77
10/09/05 13:24:46|SETI@home|Pausing result 19fe04ab.3149.13793.42344.183_2 (removed from memory)
10/09/05 13:24:46|SETI@home|Pausing result 19fe04ab.3149.14129.848566.207_1 (removed from memory)
10/09/05 13:24:47||request_reschedule_cpus: process exited
10/09/05 13:24:47|LHC@home|Pausing result w5_lhc_coll_IP15_trip_meas__30__64.31_59.32__4_6__6__28.5_1_sixvf_boinc14060_2 (removed from memory)
10/09/05 13:24:47|rosetta@home|Starting result 1acf__abrelax_05497_2 using rosetta version 4.77
10/09/05 13:24:48||request_reschedule_cpus: process exited
10/09/05 15:24:49|LHC@home|Restarting result w5_lhc_coll_IP15_trip_meas__30__64.31_59.32__4_6__6__28.5_1_sixvf_boinc14060_2 using sixtrack version 4.67
10/09/05 15:24:49|rosetta@home|Pausing result 1cfyA_abrelax_18090_0 (removed from memory)
10/09/05 15:24:49|SETI@home|Restarting result 19fe04ab.3149.13793.42344.183_2 using setiathome version 4.18
10/09/05 15:24:49|rosetta@home|Pausing result 1acf__abrelax_05497_2 (removed from memory)
10/09/05 15:24:51|rosetta@home|Unrecoverable error for result 1cfyA_abrelax_18090_0 ( - exit code -1073741819 (0xc0000005))
10/09/05 15:24:52|rosetta@home|Unrecoverable error for result 1acf__abrelax_05497_2 ( - exit code -1073741819 (0xc0000005))
10/09/05 15:24:52||request_reschedule_cpus: process exited
10/09/05 15:24:52|rosetta@home|Computation for result 1cfyA_abrelax_18090_0 finished
10/09/05 15:24:52|rosetta@home|Computation for result 1acf__abrelax_05497_2 finished
10/09/05 15:24:52|LHC@home|Pausing result w5_lhc_coll_IP15_trip_meas__30__64.31_59.32__4_6__6__28.5_1_sixvf_boinc14060_2 (removed from memory)
10/09/05 15:24:52|SETI@home|Restarting result 19fe04ab.3149.14129.848566.207_1 using setiathome version 4.18

ID: 1190 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Angus

Send message
Joined: 17 Sep 05
Posts: 412
Credit: 321,053
RAC: 0
Message 1209 - Posted: 10 Oct 2005, 17:00:55 UTC - in response to Message 1189.  
Last modified: 10 Oct 2005, 17:03:15 UTC

Angus,

I am way behind on log file work, but, can you zip up the TXT and OLD files and send them to me p.d.buck@comcast.net

I hope to get at least one example out of them ... I don't like to pull from the pages here as I always seem to be missing something (when I compare the logs to posts).

Anyway, Thanks!


Paul - Sent. Let me know if they don't arrive - danged mail filters around here...


edit - Since I just saw in another post that 5.x.x may fix the problem, I'll update one of the dual Xeon HT boxes to v5 to see if it changes anything.

Proudly Banned from Predictator@Home and now Cosmology@home as well. Added SETI to the list today. Temporary ban only - so need to work harder :)



"You can't fix stupid" (Ron White)
ID: 1209 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jord
Avatar

Send message
Joined: 16 Sep 05
Posts: 41
Credit: 204,120
RAC: 0
Message 1211 - Posted: 10 Oct 2005, 17:19:16 UTC - in response to Message 1209.  

Best wait a day or two longer. Release Client v5.2 is on its way. Expected some place this week. If only so it stops you from having to change Boinc yet again in a couple of days.
ID: 1211 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · Next

Message boards : Number crunching : Computational Error



©2024 University of Washington
https://www.bakerlab.org