Computational Error

Message boards : Number crunching : Computational Error

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Betting Slip

Send message
Joined: 26 Sep 05
Posts: 71
Credit: 5,702,246
RAC: 0
Message 1057 - Posted: 7 Oct 2005, 12:25:21 UTC

Rosetta returns computational error when BOINC CC 4.45 does CPU Benchmarks

This probably is due to Rosetta being removed from memory while BOINC runs benchmark.
ID: 1057 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ocean Archer
Avatar

Send message
Joined: 22 Sep 05
Posts: 32
Credit: 49,302
RAC: 0
Message 1059 - Posted: 7 Oct 2005, 12:42:11 UTC

This re-enforces my comment that Rosetta does not like to share - regardless of the reason. LevelStake --- do you have the option to leave the program in memory selected?
ID: 1059 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Betting Slip

Send message
Joined: 26 Sep 05
Posts: 71
Credit: 5,702,246
RAC: 0
Message 1065 - Posted: 7 Oct 2005, 15:59:49 UTC - in response to Message 1059.  

This re-enforces my comment that Rosetta does not like to share - regardless of the reason. LevelStake --- do you have the option to leave the program in memory selected?



Yes, all BOINC projects I run are set to stay in memory but BOINC CC seems to throw them out when doing an Auto Benchmark
ID: 1065 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1068 - Posted: 7 Oct 2005, 17:04:09 UTC

To make the system consistent for measurement purposes, you have to unload the system as much as possible. Thus, halting the science applications and removing them from memory.

Now here is a question, what happens if BOINC is halted while the application is in memory. Does it equally abend the work when restarted? Or is there some subtle difference in the way the application responds to the two unloads?
ID: 1068 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ocean Archer
Avatar

Send message
Joined: 22 Sep 05
Posts: 32
Credit: 49,302
RAC: 0
Message 1077 - Posted: 7 Oct 2005, 20:12:09 UTC
Last modified: 7 Oct 2005, 20:16:42 UTC

Paul --

I cannot speak for others, but in my small, old, slow machines running WindowsME or Windows 2000, I do not have the same problem with LHC, SETI, Einstein, Predictor or PrimeGrid. Since my machines are old and slow, one would think they would be first to show problems; such is not the case.

A couple of my machines have been upgraded to BOINC 5.x.y, so I cannot retest all machines with all combinations unless I trash the upgrade and go back to version 4.45.

Excluding Climate Prediction (I don't run it), the bottom line is - Rosetta remains the only project that I cannot run in conjunction with other BOINC projects.

As far as stopping and restarting Rosetta, I have found on my machines that even when power to the computer is interrupted and the system later restarted, the project WU does not fail, and the system simply picks up and continues to process ...

(edited)

ID: 1077 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1078 - Posted: 7 Oct 2005, 21:15:50 UTC

Ocean Archer,

The "unload"/Client error problem *IS* with the rosetta science application , no question about that in anyones mind. Why this is so is the burning question. And the answer is, right now, no one knows.

If you rummage my account you can see I have had one. But, my machines are almost always higher end than most peoples (I have not gotten or run a system with less than 1G of RAM for several years).

So, I am not finding fault and would, and will be delighted if we can figure this one out. I know David Kim is looking into the problem ...
ID: 1078 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ocean Archer
Avatar

Send message
Joined: 22 Sep 05
Posts: 32
Credit: 49,302
RAC: 0
Message 1082 - Posted: 7 Oct 2005, 23:05:07 UTC

Paul and David --

I'm not the expert in this area, but I'm willing to break into any of my machines and configure them for testing purposes. How can I be of service??
ID: 1082 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jord
Avatar

Send message
Joined: 16 Sep 05
Posts: 41
Credit: 204,120
RAC: 0
Message 1084 - Posted: 7 Oct 2005, 23:34:19 UTC

That science applications are unloaded from memory when making a benchmark, is fixed in the next version of Boinc.

Using 5.1.8 I haven't found any problems of Rosetta being in memory together with Seti, Einstein, Seti Beta, Boinc Alpha and two other projects, or with Rosetta being in memory at the time of the automated benchmark or a forced manual one.

So in my opinion, if everyone may need to get 5.2.0 in about a week's time anyway, does time need to be spend on fixing it for older versions of Boinc? In that case, you'd best find the "leave in memory while benchmarking" fix in the CVS and build a couple of older Boinc versions with that fix. ;)
ID: 1084 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1085 - Posted: 8 Oct 2005, 0:29:00 UTC - in response to Message 1084.  

That science applications are unloaded from memory when making a benchmark, is fixed in the next version of Boinc.

Using 5.1.8 I haven't found any problems of Rosetta being in memory together with Seti, Einstein, Seti Beta, Boinc Alpha and two other projects, or with Rosetta being in memory at the time of the automated benchmark or a forced manual one.

So in my opinion, if everyone may need to get 5.2.0 in about a week's time anyway, does time need to be spend on fixing it for older versions of Boinc? In that case, you'd best find the "leave in memory while benchmarking" fix in the CVS and build a couple of older Boinc versions with that fix. ;)


Can anyone else confirm this (just wondering if this holds for other platforms/Windows versions)? I guess people with limited memory may still want to have apps removed from memory when running multiple projects. I am wondering how the other projects dealt with this issue/bug? anyone know?
ID: 1085 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jord
Avatar

Send message
Joined: 16 Sep 05
Posts: 41
Credit: 204,120
RAC: 0
Message 1089 - Posted: 8 Oct 2005, 1:26:02 UTC
Last modified: 8 Oct 2005, 1:39:03 UTC

As far as I know it is for all platforms, David.

But look:

08/10/2005 03:19:15||Suspending computation and network activity - running CPU benchmarks
08/10/2005 03:19:15|rosetta@home|Pausing result 1cfyA_abrelax_no_cst_10632_1 (left in memory)
08/10/2005 03:19:17||Running CPU benchmarks
08/10/2005 03:20:16||Benchmark results:
08/10/2005 03:20:16|| Number of CPUs: 1
08/10/2005 03:20:16|| 1217 double precision MIPS (Whetstone) per CPU
08/10/2005 03:20:16|| 2423 integer MIPS (Dhrystone) per CPU
08/10/2005 03:20:16||Finished CPU benchmarks
08/10/2005 03:20:17||Resuming computation and network activity
08/10/2005 03:20:17||request_reschedule_cpus: Resuming activities
08/10/2005 03:20:17|rosetta@home|Resuming result 1cfyA_abrelax_no_cst_10632_1 using rosetta version 477

(Link to the fix in CVS: here .. excuse me, was linking to the wrong fix first. Linking to the correct one of the 21st of September 2005 now.)
ID: 1089 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aurora Borealis

Send message
Joined: 7 Oct 05
Posts: 15
Credit: 352,300
RAC: 0
Message 1092 - Posted: 8 Oct 2005, 5:40:46 UTC
Last modified: 8 Oct 2005, 5:42:07 UTC

I can't leave the project in memory because it causes other problems with Windows 98.
The main difficulty is that you end up with several projects apparently running at the same time, and getting very long crunch time to nowhere.

Any ideas!!

Questions? Answers are in the BOINC Wiki.

Boinc V6.12.41
Win 7 i5 GPU Nvidia 470
ID: 1092 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 1096 - Posted: 8 Oct 2005, 7:54:25 UTC - in response to Message 1092.  
Last modified: 8 Oct 2005, 7:56:02 UTC

I can't leave the project in memory because it causes other problems with Windows 98. Any ideas!!


I had a look at the specs of your system. With only 223MB of RAM you will likely continue to have problems running Rosetta, at least until such time that the devs can sort this problem out. I'd suggest running something else for the time being, or running only Rosetta.

*** Join BOINC@Australia today ***
ID: 1096 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Solblekt

Send message
Joined: 27 Sep 05
Posts: 8
Credit: 3,302
RAC: 0
Message 1098 - Posted: 8 Oct 2005, 9:25:06 UTC

I can inform you that on my two computers rosetta crash after about 80% is done. Before 80% is done it does swap between projects without any problem. One computer has 256 MB the other 512 MB. The crash occure when there is a swap from rosetta to any other project. I do not leave the projects in memory.
I have noticed that rosetta at this stage use an lot of memory over 160 MB.
Can the problem has something to do with allocation of memory?
The message in BOINC is for some seconds or so that there has been something wrong in a calculation.

I do not belive that it is a good ide to leave all projects in memory.
Not as long as it use that much memory any way.
The thing is that you tend to end up with a computer working with paging instead.

For now I have stoped crunshing for rosetta.
I hope they will give us a hint on the frontpage when the problem has been solved.
ID: 1098 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile kb7rzf
Avatar

Send message
Joined: 7 Oct 05
Posts: 16
Credit: 35,427
RAC: 0
Message 1103 - Posted: 8 Oct 2005, 12:37:04 UTC

10/7/2005 8:51:28 PM|rosetta@home|Pausing result 1cfyA_abrelax_16893_0 (removed from memory)
10/7/2005 8:51:30 PM|rosetta@home|Unrecoverable error for result 1cfyA_abrelax_16893_0 ( - exit code -1073741819 (0xc0000005))
10/7/2005 8:51:31 PM||request_reschedule_cpus: process exited
10/7/2005 8:51:31 PM|rosetta@home|Deferring communication with project for 1 minutes and 0 seconds
10/7/2005 8:51:31 PM|rosetta@home|Computation for result 1cfyA_abrelax_16893_0 finished

I had one error come up with mine, and as Solblekt said, it did it after 80% was done. I currently have 1 WU that is not going any higher with this project, and its sitting at 83.33%, with the time to complete slowly going up now, and the percentage not moving.

Jeremy

ID: 1103 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Aurora Borealis

Send message
Joined: 7 Oct 05
Posts: 15
Credit: 352,300
RAC: 0
Message 1109 - Posted: 8 Oct 2005, 14:06:32 UTC - in response to Message 1096.  
Last modified: 8 Oct 2005, 14:07:17 UTC

I can't leave the project in memory because it causes other problems with Windows 98. Any ideas!!


I had a look at the specs of your system. With only 223MB of RAM you will likely continue to have problems running Rosetta, at least until such time that the devs can sort this problem out. I'd suggest running something else for the time being, or running only Rosetta.

Thanks for the reply. I've temporarily suspended the other projects and set Rosetta to NO NEW WORK until I run through my current queue.
Hopefully a solution will be found and I can reactivate this project in the future.

Questions? Answers are in the BOINC Wiki.

Boinc V6.12.41
Win 7 i5 GPU Nvidia 470
ID: 1109 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Webmaster Yoda
Avatar

Send message
Joined: 17 Sep 05
Posts: 161
Credit: 162,253
RAC: 0
Message 1111 - Posted: 8 Oct 2005, 15:15:39 UTC
Last modified: 8 Oct 2005, 15:42:04 UTC

For what it's worth.

I just had one crash on me, after 3.75 hours of crunching. Happened when BOINC did its automatic benchmarks. First one with this problem on this machine that I'm aware of. 3.4GHz Pentium 4 with Windows SBS 2003 Server, 1GB RAM, HT enabled with Rosetta on one logical CPU, CPDN on the other.

8/10/2005 11:08:04 PM 720 Suspending computation and network activity - running CPU benchmarks
8/10/2005 11:08:04 PM 721 Pausing result 3i29_000185019_0 (removed from memory)
8/10/2005 11:08:04 PM 722 Pausing result 1cfyA_abrelax_09628_0 (removed from memory)
8/10/2005 11:08:05 PM 723 Unrecoverable error for result 1cfyA_abrelax_09628_0 ( - exit code -1073741819 (0xc0000005))

No problem with the CPDN work unit, only Rosetta.

However... At approximately the same time, my Laptop (2.8GHz Mobile Pentium 4, Win XP Pro, 512MB RAM, no HT) and my other PC (overclocked Athlon XP 3000+, Win2K Pro, 512MB RAM) ran benchmarks as well, and had NO problem.

So it seems the benchmarking issue is isolated to multiple CPU machines, whether logical (HT) or physical (dual core/dual processors)

*** Join BOINC@Australia today ***
ID: 1111 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Betting Slip

Send message
Joined: 26 Sep 05
Posts: 71
Credit: 5,702,246
RAC: 0
Message 1112 - Posted: 8 Oct 2005, 15:35:26 UTC - in response to Message 1068.  

To make the system consistent for measurement purposes, you have to unload the system as much as possible. Thus, halting the science applications and removing them from memory.

Now here is a question, what happens if BOINC is halted while the application is in memory. Does it equally abend the work when restarted? Or is there some subtle difference in the way the application responds to the two unloads?



I have noticed if you exit BOINC and reboot computer Rosetta carries on as normal. Rosetta only has problems during benchmarking when it is removed from memory by CC otherwise if left in memory no problem.
ID: 1112 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 1127 - Posted: 8 Oct 2005, 21:50:28 UTC

Another one:

10/8/2005 10:59:40 PM|rosetta@home|Unrecoverable error for result 1cfyA_abrelax_13371_1 ( - exit code -1073741819 (0xc0000005))
10/8/2005 10:59:41 PM||request_reschedule_cpus: process exited
10/8/2005 10:59:41 PM|rosetta@home|Deferring communication with project for 1 minutes and 0 seconds
10/8/2005 10:59:41 PM|rosetta@home|Computation for result 1cfyA_abrelax_13371_1 finished

It was this WU and I can see it's getting ready to get sent again. I don't want it!

Maybe it's just bad!


[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 1127 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1129 - Posted: 8 Oct 2005, 22:50:12 UTC

Holly,

No.

The first death was related to a problem with Rosetta@Home and some versions of OS-X. The person with that computer will never complete a work unit successfully.

Your error is MOST LIKELY the problem that David Kim is tearing his hair out over. Rosetta@Home seems to have problems when it is removed from memory. There are variations, but seem to center on running of the benchmarks and pausing with removal from memory.

The solution for these is to let them stay in memory. If your computer is memory limited you may want to put RAH on hold for a bit.

Another issue has to do with checkpointing and that is being looked into so the work is saved more often.

Last point, BOINC policy is to never send a failed work unit to the same account. Not just computer, account (unless they changed the rules on me behind my back, in which case Ingleside will usually yell at me ...). So, if it dies, you should never see it again.

In contrast, classic SETI@Home never made these kinds of tests so the overall science had the potential of being compromised.
ID: 1129 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The Pirate
Avatar

Send message
Joined: 22 Sep 05
Posts: 20
Credit: 7,090,933
RAC: 0
Message 1132 - Posted: 9 Oct 2005, 2:38:34 UTC

Paul, let me add to this, this is only happening on my dual processor computers. One is running dual AMD MP 2100's on windows 32 bit SP2, one is running dual AMD MP 2600's on Linux and one is running a pair of Opteron 275's (4 cpu's) on Windows 64 bit. I do have BOINC set to leave it in memory.

ID: 1132 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · Next

Message boards : Number crunching : Computational Error



©2024 University of Washington
https://www.bakerlab.org