Problems with Rosetta version 5.41

Message boards : Number crunching : Problems with Rosetta version 5.41

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · Next

AuthorMessage
martino.corti

Send message
Joined: 23 Nov 06
Posts: 1
Credit: 180
RAC: 0
Message 32329 - Posted: 9 Dec 2006, 14:09:30 UTC

Hi all,
I am working with Rosetta 5.41 and got a few "Computation error" in the last weeks.
In one case I did notice that the error occurred immediatly after an order of "Activity / Suspend" (issued by me through the standard "BOINC Manager": from time to time I need to have the whole PC available), but I wasn't sure of the evidence.

Today I had some spare time to verify the possible causal correlation:
-) the PC was dedicated to BOINC;
-) Rosetta 5.1 was running on a WU;
-) I issued the "Activity / Suspend" order using "BOINC Manager" and the WU was reported immediatly under "Computational Error".

I suspect that the same happens when BOINC system is preempting a WU: in one case the CPU was close to the configured 60' (59'54"), but this is harder to verify explicitly since a WU could be left anywhere when the PC is shut down (which is not causing the problem).

I hope this oservation can be of help.

Martino


ID: 32329 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Thomas F. Bates IV

Send message
Joined: 10 May 06
Posts: 5
Credit: 2,853,254
RAC: 0
Message 32332 - Posted: 9 Dec 2006, 15:12:21 UTC - in response to Message 31881.  

FYI -
Two recent WUs caused 5.41 to consume 100% CPU and it refused to suspend on user activity; I had to kill the process. Win2kPro, BOINC 5.4.9
ID: 32332 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 456
Message 32338 - Posted: 9 Dec 2006, 16:07:52 UTC

> Have had 8 WUs fail with the same error code and all on the same machine, Host 264297, a P4 2.53 GHz @ 2.75 GHz, not running Boinc screensaver just a standard Windows screensaver. Has been fine till 2 days ago.

https://boinc.bakerlab.org/rosetta/result.php?resultid=50937552
https://boinc.bakerlab.org/rosetta/result.php?resultid=50895262
https://boinc.bakerlab.org/rosetta/result.php?resultid=50854943
https://boinc.bakerlab.org/rosetta/result.php?resultid=50734311
https://boinc.bakerlab.org/rosetta/result.php?resultid=50414918
https://boinc.bakerlab.org/rosetta/result.php?resultid=50327688 (debug info)
https://boinc.bakerlab.org/rosetta/result.php?resultid=50288762

All have "exit code - 1073741819"

A few WUs have processed in between the failed ones but so far none have run on the 9/12/06.
I turned off the Boinc graphics 3 weeks ago and all has been running error free till now.
ID: 32338 · Rating: -1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Faust

Send message
Joined: 7 Sep 06
Posts: 14
Credit: 49,559
RAC: 0
Message 32359 - Posted: 9 Dec 2006, 22:29:05 UTC
Last modified: 9 Dec 2006, 22:30:27 UTC

I don't know if that's a problem - but I think it's very unlikely for my fastest machine to only get 3.29 credits for a completed WU :) (claimed crdit 42.04)

I don't see any errors and it also didn't crash.
Screensaver was also off.

Result

I'm also having the same problem Feet1st described in a dedicated thread - 'RAC dropping, BOINC dropping comms'. It happens alot lately. but I guess that's a Boinc issue.

Faust.



Faust.
ID: 32359 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 456
Message 32385 - Posted: 10 Dec 2006, 7:24:21 UTC - in response to Message 32338.  

> Have had 8 WUs fail with the same error code and all on the same machine, Host 264297, a P4 2.53 GHz @ 2.75 GHz, not running Boinc screensaver just a standard Windows screensaver. Has been fine till 2 days ago.

https://boinc.bakerlab.org/rosetta/result.php?resultid=50937552
https://boinc.bakerlab.org/rosetta/result.php?resultid=50895262
https://boinc.bakerlab.org/rosetta/result.php?resultid=50854943
https://boinc.bakerlab.org/rosetta/result.php?resultid=50734311
https://boinc.bakerlab.org/rosetta/result.php?resultid=50414918
https://boinc.bakerlab.org/rosetta/result.php?resultid=50327688 (debug info)
https://boinc.bakerlab.org/rosetta/result.php?resultid=50288762

All have "exit code - 1073741819"

A few WUs have processed in between the failed ones but so far none have run on the 9/12/06.
I turned off the Boinc graphics 3 weeks ago and all has been running error free till now.


Another two same machine,
https://boinc.bakerlab.org/rosetta/result.php?resultid=50984487 (This one hung for hours so aborted, Boinc Manager showed nothing happening).

https://boinc.bakerlab.org/rosetta/result.php?resultid=51029592 (Same error as before "Exit code -1073741819)

Have now updated Boinc from 5.5.0 to 5.4.11 to see what happens.

ID: 32385 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 456
Message 32401 - Posted: 10 Dec 2006, 11:36:10 UTC - in response to Message 32385.  

> Have had 8 WUs fail with the same error code and all on the same machine, Host 264297, a P4 2.53 GHz @ 2.75 GHz, not running Boinc screensaver just a standard Windows screensaver. Has been fine till 2 days ago.

https://boinc.bakerlab.org/rosetta/result.php?resultid=50937552
https://boinc.bakerlab.org/rosetta/result.php?resultid=50895262
https://boinc.bakerlab.org/rosetta/result.php?resultid=50854943
https://boinc.bakerlab.org/rosetta/result.php?resultid=50734311
https://boinc.bakerlab.org/rosetta/result.php?resultid=50414918
https://boinc.bakerlab.org/rosetta/result.php?resultid=50327688 (debug info)
https://boinc.bakerlab.org/rosetta/result.php?resultid=50288762

All have "exit code - 1073741819"

A few WUs have processed in between the failed ones but so far none have run on the 9/12/06.
I turned off the Boinc graphics 3 weeks ago and all has been running error free till now.


Another two same machine,
https://boinc.bakerlab.org/rosetta/result.php?resultid=50984487 (This one hung for hours so aborted, Boinc Manager showed nothing happening).

https://boinc.bakerlab.org/rosetta/result.php?resultid=51029592 (Same error as before "Exit code -1073741819)

Have now updated Boinc from 5.5.0 to 5.4.11 to see what happens.


> Well that was a great success, first 2 WUs with new Boinc version and had 2 more failures, still with the same error (exit code -1073741819)

https://boinc.bakerlab.org/rosetta/result.php?resultid=51110957
https://boinc.bakerlab.org/rosetta/result.php?resultid=51130389

Also had one lock up for hours on one of my Linux machines with the cpu time not moving nor the % done or time left, so had to abort
https://boinc.bakerlab.org/rosetta/result.php?resultid=50256485.

The only thing I have done with the Pentium 4 machine is change that host from always running to run as per preferences so as to give the cpu a break.
ID: 32401 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jnargus

Send message
Joined: 4 Oct 06
Posts: 5
Credit: 6,955,745
RAC: 2,892
Message 32406 - Posted: 10 Dec 2006, 16:26:18 UTC

"Nice" to see that others are having the same problem I am. My linux box 366920 still seems to be getting Rosetta credit but most of the time the WU just stops updating with no messages. I have now suspended Rosetta on this machine as it "wastes" one of my cores when the WU is running but nothing is happening. My windows boxes don't seem to be having this problem so Rosetta will still run there.


ID: 32406 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Slappyto] popolito

Send message
Joined: 8 Mar 06
Posts: 13
Credit: 998,822
RAC: 1,440
Message 32413 - Posted: 10 Dec 2006, 17:09:39 UTC

The last applications use a lot of memory (I have only 256mb of ram, I can't crunch for rosetta).
ID: 32413 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Feet1st
Avatar

Send message
Joined: 30 Dec 05
Posts: 1755
Credit: 4,690,520
RAC: 0
Message 32416 - Posted: 10 Dec 2006, 18:26:52 UTC

Futura Sciences The project added the Docking work units and found they took considerable more memory then other work units, and so they changed things so that these work units would only be sent to computers with more memory. That was around mid October. The system requirements still shows 256MB should work fine. Most work units only use around 110MB. Those that are known to use more will only be sent to machines with more then the 256MB memory.

So, unless you are having the screensaver problems, you should be crunching fine with 256MB. And if you are having screensaver problems, please just turn the screensaver to (none) until the problems are resolved.
Add this signature to your EMail:
Running Microsoft's "System Idle Process" will never help cure cancer, AIDS nor Alzheimer's. But running Rosetta@home just might!
https://boinc.bakerlab.org/rosetta/
ID: 32416 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 32419 - Posted: 10 Dec 2006, 19:27:07 UTC - in response to Message 32413.  

There are plenty of jobs which does not have a memory requirement in the queue and you should be able to receive jobs to crunch. What messages did you get?
The last applications use a lot of memory (I have only 256mb of ram, I can't crunch for rosetta).


ID: 32419 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 32420 - Posted: 10 Dec 2006, 19:39:35 UTC - in response to Message 32401.  

exit code -1073741819) is graphic-related. Not sure about the failure one on the linux machine and it may just get stuck. Does it happen very often?
> Have had 8 WUs fail with the same error code and all on the same machine, Host 264297, a P4 2.53 GHz @ 2.75 GHz, not running Boinc screensaver just a standard Windows screensaver. Has been fine till 2 days ago.

https://boinc.bakerlab.org/rosetta/result.php?resultid=50937552
https://boinc.bakerlab.org/rosetta/result.php?resultid=50895262
https://boinc.bakerlab.org/rosetta/result.php?resultid=50854943
https://boinc.bakerlab.org/rosetta/result.php?resultid=50734311
https://boinc.bakerlab.org/rosetta/result.php?resultid=50414918
https://boinc.bakerlab.org/rosetta/result.php?resultid=50327688 (debug info)
https://boinc.bakerlab.org/rosetta/result.php?resultid=50288762

All have "exit code - 1073741819"

A few WUs have processed in between the failed ones but so far none have run on the 9/12/06.
I turned off the Boinc graphics 3 weeks ago and all has been running error free till now.


Another two same machine,
https://boinc.bakerlab.org/rosetta/result.php?resultid=50984487 (This one hung for hours so aborted, Boinc Manager showed nothing happening).

https://boinc.bakerlab.org/rosetta/result.php?resultid=51029592 (Same error as before "Exit code -1073741819)

Have now updated Boinc from 5.5.0 to 5.4.11 to see what happens.


> Well that was a great success, first 2 WUs with new Boinc version and had 2 more failures, still with the same error (exit code -1073741819)

https://boinc.bakerlab.org/rosetta/result.php?resultid=51110957
https://boinc.bakerlab.org/rosetta/result.php?resultid=51130389

Also had one lock up for hours on one of my Linux machines with the cpu time not moving nor the % done or time left, so had to abort
https://boinc.bakerlab.org/rosetta/result.php?resultid=50256485.

The only thing I have done with the Pentium 4 machine is change that host from always running to run as per preferences so as to give the cpu a break.


ID: 32420 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Chu

Send message
Joined: 23 Feb 06
Posts: 120
Credit: 112,439
RAC: 0
Message 32421 - Posted: 10 Dec 2006, 19:48:17 UTC - in response to Message 32406.  

Hi, in your results from the linux host, I saw 'segmention violation" in those "cilent error" WUs. I assume these are the WUs you are reporting here. Can you describe a little bit more on what you have seen? Those jobs got stuck also? Did you manually abort those WUs? This will help us understand those stderr.txt files better and decide whether your reported problems are the same as Conan has reported in his post below. Thanks.
"Nice" to see that others are having the same problem I am. My linux box 366920 still seems to be getting Rosetta credit but most of the time the WU just stops updating with no messages. I have now suspended Rosetta on this machine as it "wastes" one of my cores when the WU is running but nothing is happening. My windows boxes don't seem to be having this problem so Rosetta will still run there.


ID: 32421 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jnargus

Send message
Joined: 4 Oct 06
Posts: 5
Credit: 6,955,745
RAC: 2,892
Message 32426 - Posted: 10 Dec 2006, 21:51:20 UTC
Last modified: 10 Dec 2006, 21:52:27 UTC

Sun 10 Dec 2006 09:25:27 AM EST|rosetta@home|Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R7_R40_filters_1441_553_0 (removed from memory)
Sun 10 Dec 2006 09:25:27 AM EST|rosetta@home|Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R7_R41_filters_1441_553_0 (removed from memory)

These are the two WU I currently have. I just resumed the Rosetta project and BOINC nicely paused the two Einstein WU that were going but never gave me a message saying that the Rosetta WU had been restarted. The BOINC manager task pane shows the two Rosetta WUs as "Running" but the time and percentages are not changing and from what I've seen in the past the processes that are running in memory are not actually doing anything.

I have reset the Rosetta project several time to clean out the WUs that are in the state these two are now. After Resetting the project it seems to work fine for a while and then I get some WUs that error out and then the rest just seem to hang. I have had them hang at around 1 hour of CPU time, the state the two I have now are at, and some times it gets over two hours before it hangs. All the WUs that have not been reported were ones that just hung and I since cleared them out by resetting the project.

If you need any more info let me know

2006-12-10 08:23:53 [rosetta@home] Unrecoverable error for result BENCH_ABRELAX_SAVE_ALL_OUT_1ctf__BARCODE_R10_R13_filters_1441_512_0 (process exited with code 131 (0x83))
2006-12-10 08:23:55 [rosetta@home] Unrecoverable error for result BENCH_ABRELAX_SAVE_ALL_OUT_4ubpA_BARCODE_R2_R77_filters_1441_493_0 (process got signal 11)

These are the last two lines from the stderrae.txt file.

(I would have included the previous post but I'm still too new at this to know how ;-)
(Now I see the Reply to this Post button!)(Doh)

ID: 32426 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 32447 - Posted: 11 Dec 2006, 7:37:15 UTC
Last modified: 11 Dec 2006, 7:38:18 UTC

This one clapped out after 6hrs 40min not long after it restarted.

On 10hr run time. No graphics or screensaver.

https://boinc.bakerlab.org/rosetta/result.php?resultid=51329026

ID: 32447 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 11 Oct 05
Posts: 150
Credit: 3,818,279
RAC: 456
Message 32453 - Posted: 11 Dec 2006, 11:42:17 UTC - in response to Message 32420.  

> Thanks Chu, no it does not happen often, if fact hardly at all since I turned the Boinc screensaver off 3-4 weeks ago. But If you say that error is graphic related then there must be a problem with running ANY screensaver with Rosetta as I only run a standard Windows one and have since even turned that off.
The machine has started to process again ok now.
Spoke to soon, came home today and found I had a power glitch of some sort turning off all my computers bar 1 (on a UPS), on turning them back on the Intel that has been creating all theses reports and was just working again, has fried its power supply so further testing will have to wait (this may of been the reason for the errors? a failing PSU?).
The workunits that lockup occur now and then and if caught early don't cause a problem, but I have had a couple go for days before I found them and lost heaps of processing because of it.

exit code -1073741819) is graphic-related. Not sure about the failure one on the linux machine and it may just get stuck. Does it happen very often?
> Have had 8 WUs fail with the same error code and all on the same machine, Host 264297, a P4 2.53 GHz @ 2.75 GHz, not running Boinc screensaver just a standard Windows screensaver. Has been fine till 2 days ago.

https://boinc.bakerlab.org/rosetta/result.php?resultid=50937552
https://boinc.bakerlab.org/rosetta/result.php?resultid=50895262
https://boinc.bakerlab.org/rosetta/result.php?resultid=50854943
https://boinc.bakerlab.org/rosetta/result.php?resultid=50734311
https://boinc.bakerlab.org/rosetta/result.php?resultid=50414918
https://boinc.bakerlab.org/rosetta/result.php?resultid=50327688 (debug info)
https://boinc.bakerlab.org/rosetta/result.php?resultid=50288762

All have "exit code - 1073741819"

A few WUs have processed in between the failed ones but so far none have run on the 9/12/06.
I turned off the Boinc graphics 3 weeks ago and all has been running error free till now.


Another two same machine,
https://boinc.bakerlab.org/rosetta/result.php?resultid=50984487 (This one hung for hours so aborted, Boinc Manager showed nothing happening).

https://boinc.bakerlab.org/rosetta/result.php?resultid=51029592 (Same error as before "Exit code -1073741819)

Have now updated Boinc from 5.5.0 to 5.4.11 to see what happens.


> Well that was a great success, first 2 WUs with new Boinc version and had 2 more failures, still with the same error (exit code -1073741819)

https://boinc.bakerlab.org/rosetta/result.php?resultid=51110957
https://boinc.bakerlab.org/rosetta/result.php?resultid=51130389

Also had one lock up for hours on one of my Linux machines with the cpu time not moving nor the % done or time left, so had to abort
https://boinc.bakerlab.org/rosetta/result.php?resultid=50256485.

The only thing I have done with the Pentium 4 machine is change that host from always running to run as per preferences so as to give the cpu a break.



ID: 32453 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jnargus

Send message
Joined: 4 Oct 06
Posts: 5
Credit: 6,955,745
RAC: 2,892
Message 32454 - Posted: 11 Dec 2006, 13:02:39 UTC - in response to Message 32421.  

https://boinc.bakerlab.org/result.php?resultid=51444398
https://boinc.bakerlab.org/result.php?resultid=51444399

I just aborted these two WU for failing to do anything. My system was still happily "Running" these WUs but nothing was happening. I also started Rosetta on my other linux (Debian) box and I will let you know if it has problems.

Hope this helps


Hi, in your results from the linux host, I saw 'segmention violation" in those "cilent error" WUs. I assume these are the WUs you are reporting here. Can you describe a little bit more on what you have seen? Those jobs got stuck also? Did you manually abort those WUs? This will help us understand those stderr.txt files better and decide whether your reported problems are the same as Conan has reported in his post below. Thanks.
"Nice" to see that others are having the same problem I am. My linux box 366920 still seems to be getting Rosetta credit but most of the time the WU just stops updating with no messages. I have now suspended Rosetta on this machine as it "wastes" one of my cores when the WU is running but nothing is happening. My windows boxes don't seem to be having this problem so Rosetta will still run there.





ID: 32454 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
googloo
Avatar

Send message
Joined: 15 Sep 06
Posts: 133
Credit: 21,715,841
RAC: 5,856
Message 32458 - Posted: 11 Dec 2006, 16:06:13 UTC

Getting quite a few of the messages "rosetta@home|rosetta not responding to screensaver, exiting" roughly one a day. I have reset the project with no change in frequency. Was happening occasionally with 5.40 but higher frequency with 5.41.

I can't tell if it's graphics related. I have screensaver set to start after 2 minutes idle and goes to blank screen after 3 minutes.

Processor: 2 GenuineIntel Intel(R) Pentium(R) D CPU 3.40GHz
Memory: 2.00 GB physical, 3.85 GB virtual
Disk: 222.65 GB total, 180.39 GB free
Windows XP

BOINC runs all the time, projects left in memory.

ID: 32458 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile [AF>Slappyto] popolito

Send message
Joined: 8 Mar 06
Posts: 13
Credit: 998,822
RAC: 1,440
Message 32463 - Posted: 11 Dec 2006, 17:25:02 UTC

There are plenty of jobs which does not have a memory requirement in the queue and you should be able to receive jobs to crunch. What messages did you get?

No, I wanted to say, I can't crunch for rosetta because the application takes too much memory and it's annoying when I use the computer (the application takes more than 100mo).
ID: 32463 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The_Bad_Penguin
Avatar

Send message
Joined: 5 Jun 06
Posts: 2751
Credit: 4,271,025
RAC: 0
Message 32466 - Posted: 11 Dec 2006, 19:18:13 UTC - in response to Message 32463.  

Problem? My end? Rosetta's end?

Result ID 51580518

CPU time 17330.796875
stderr out <core_client_version>5.4.11</core_client_version>
<stderr_txt>
# random seed: 1782254
# cpu_run_time_pref: 21600
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
Stuck at score 3.29644 for 3600 seconds
**********************************************************************
GZIP SILENT FILE: .cc1tig.out

</stderr_txt>


Validate state Valid
Claimed credit 62.638971651533
Granted credit 20
ID: 32466 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
P . P . L .

Send message
Joined: 20 Aug 06
Posts: 581
Credit: 4,865,274
RAC: 0
Message 32471 - Posted: 11 Dec 2006, 20:35:12 UTC

Are these W.U's bad i had another error last night it was the same type

as the other one i reported before, FRA_t369 something. I was the second to do

it if anyone is interested, it Ran for 1hr 20 something.

https://boinc.bakerlab.org/rosetta/result.php?resultid=51542218



ID: 32471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Number crunching : Problems with Rosetta version 5.41



©2024 University of Washington
https://www.bakerlab.org