Rosetta crashes on pausing

Message boards : Number crunching : Rosetta crashes on pausing

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 1097 - Posted: 8 Oct 2005, 9:12:06 UTC

I'm new to Rosetta and this is bugging me. I get a WU about 75% done and:

10/8/2005 3:24:42 AM|rosetta@home|Pausing result 1acf__abrelax_00304_0 (removed from memory)
10/8/2005 3:24:43 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_00304_0 ( - exit code -1073741819 (0xc0000005))

What gives and how do I fix it???

Ron
ID: 1097 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Solblekt

Send message
Joined: 27 Sep 05
Posts: 8
Credit: 3,302
RAC: 0
Message 1099 - Posted: 8 Oct 2005, 9:34:17 UTC

We all have the same problem.
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=132
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=85
https://boinc.bakerlab.org/rosetta/forum_thread.php?id=126

There are more links for this subject.
ID: 1099 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Solblekt

Send message
Joined: 27 Sep 05
Posts: 8
Credit: 3,302
RAC: 0
Message 1100 - Posted: 8 Oct 2005, 9:35:59 UTC

Well now why can't I click on the links I just posted?
ID: 1100 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
cah_user_1217

Send message
Joined: 17 Sep 05
Posts: 3
Credit: 2,187
RAC: 0
Message 1101 - Posted: 8 Oct 2005, 10:05:47 UTC - in response to Message 1100.  

Well now why can't I click on the links I just posted?


Perhaps you didn't use the 'url=' BBCode? Take a look here for the correct usage.

ID: 1101 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 1106 - Posted: 8 Oct 2005, 13:27:44 UTC

Ah well, looks like trouble for everyone who runs more than just Rosetta. Not me. Good.
ID: 1106 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile FZB

Send message
Joined: 17 Sep 05
Posts: 84
Credit: 4,607,091
RAC: 827
Message 1151 - Posted: 9 Oct 2005, 10:08:27 UTC

you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage
--
Florian
www.domplatz1.de
ID: 1151 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,070,826
RAC: 0
Message 1157 - Posted: 9 Oct 2005, 13:31:04 UTC
Last modified: 9 Oct 2005, 13:35:19 UTC

Well, here is another datapoint. Running Boinc CC 4.72, compiled myself on a Fedora Core 3 box with an AMD XP 2600+ cpu. It appears it tried to run benchmarks, and failed to stop the rosetta app that was running. Computation did not resume. When I found it the cpu usage was down to 0% and the rossetta processes and boinc cc process were still there but doing nothing. (Actually, there were 3 rosetta processes which seems to be normal.) I had to run my stop script (boincctl - found on the Add-on page. I'm its author.) to abort all the processes. I do have my preferences set to leave applications in memory.

Here is a copy of stdout

2005-10-09 08:59:19 [---] Suspending computation and network activity - running CPU benchmarks
2005-10-09 08:59:19 [rosetta@home] Pausing result 1acf__abrelax_no_cst_06642_0 (removed from memory)
2005-10-09 08:59:21 [---] Running CPU benchmarks
2005-10-09 08:59:29 [---] Failed to stop applications; aborting CPU benchmarks
2005-10-09 08:59:29 [---] Resuming computation and network activity
2005-10-09 08:59:29 [---] request_reschedule_cpus: Resuming activities
2005-10-09 08:59:29 [---] ACTIVE_TASK_SET::check_app_exited(): pid 21432 not found

Then I stopped things:

2005-10-09 09:12:45 [---] Received signal 15
2005-10-09 09:12:45 [---] Exit requested by user
2005-10-09 09:12:51 [---] request_reschedule_cpus: exit_tasks

Then I restarted everything:
2005-10-09 09:13:04 [---] Starting BOINC client version 4.72 for i686-pc-linux-gnu
2005-10-09 09:13:04 [---] Data directory: /home/charlie/Boinc
2005-10-09 09:13:04 [---] Processor Inventory: 1 AuthenticAMD AMD Athlon(TM) XP 2600+ Processor(s)
2005-10-09 09:13:04 [---] Memory Inventory: Memory total - 503.37 MB, Swap total - 1019.74 MB
2005-10-09 09:13:04 [---] Disk Inventory: Disk total - 55.39 GB, Disk available - 47.40 GB
2005-10-09 09:13:04 [Predictor @ Home] Computer ID: 116444; location: home; project prefs: default
2005-10-09 09:13:04 [rosetta@home] Computer ID: 4375; location: home; project prefs: default
2005-10-09 09:13:04 [SETI@home] Computer ID: 850659; location: home; project prefs: default
2005-10-09 09:13:04 [---] General prefs: from rosetta@home (last modified 2005-10-07 19:31:33)
2005-10-09 09:13:04 [---] General prefs: no separate prefs for home; using your defaults
2005-10-09 09:13:04 [---] Remote control allowed
2005-10-09 09:13:04 [rosetta@home] Resuming computation for result 1acf__abrelax_no_cst_06642_0 using rosetta version 4.77




-Charlie
ID: 1157 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 1173 - Posted: 9 Oct 2005, 15:50:07 UTC - in response to Message 1151.  

you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage

I'll try this...
ID: 1173 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Charles Dennett
Avatar

Send message
Joined: 27 Sep 05
Posts: 102
Credit: 2,070,826
RAC: 0
Message 1179 - Posted: 9 Oct 2005, 16:58:34 UTC - in response to Message 1157.  
Last modified: 9 Oct 2005, 16:59:11 UTC

Well, here is another datapoint. Running Boinc CC 4.72, compiled myself on a Fedora Core 3 box with an AMD XP 2600+ cpu. It appears it tried to run benchmarks, and failed to stop the rosetta app that was running. Computation did not resume. When I found it the cpu usage was down to 0% and the rossetta processes and boinc cc process were still there but doing nothing. (Actually, there were 3 rosetta processes which seems to be normal.) I had to run my stop script (boincctl - found on the Add-on page. I'm its author.) to abort all the processes. I do have my preferences set to leave applications in memory.

Here is a copy of stdout

2005-10-09 08:59:19 [---] Suspending computation and network activity - running CPU benchmarks
2005-10-09 08:59:19 [rosetta@home] Pausing result 1acf__abrelax_no_cst_06642_0 (removed from memory)
2005-10-09 08:59:21 [---] Running CPU benchmarks
2005-10-09 08:59:29 [---] Failed to stop applications; aborting CPU benchmarks
2005-10-09 08:59:29 [---] Resuming computation and network activity
2005-10-09 08:59:29 [---] request_reschedule_cpus: Resuming activities
2005-10-09 08:59:29 [---] ACTIVE_TASK_SET::check_app_exited(): pid 21432 not found

Then I stopped things:

2005-10-09 09:12:45 [---] Received signal 15
2005-10-09 09:12:45 [---] Exit requested by user
2005-10-09 09:12:51 [---] request_reschedule_cpus: exit_tasks

Then I restarted everything:
2005-10-09 09:13:04 [---] Starting BOINC client version 4.72 for i686-pc-linux-gnu
2005-10-09 09:13:04 [---] Data directory: /home/charlie/Boinc
2005-10-09 09:13:04 [---] Processor Inventory: 1 AuthenticAMD AMD Athlon(TM) XP 2600+ Processor(s)
2005-10-09 09:13:04 [---] Memory Inventory: Memory total - 503.37 MB, Swap total - 1019.74 MB
2005-10-09 09:13:04 [---] Disk Inventory: Disk total - 55.39 GB, Disk available - 47.40 GB
2005-10-09 09:13:04 [Predictor @ Home] Computer ID: 116444; location: home; project prefs: default
2005-10-09 09:13:04 [rosetta@home] Computer ID: 4375; location: home; project prefs: default
2005-10-09 09:13:04 [SETI@home] Computer ID: 850659; location: home; project prefs: default
2005-10-09 09:13:04 [---] General prefs: from rosetta@home (last modified 2005-10-07 19:31:33)
2005-10-09 09:13:04 [---] General prefs: no separate prefs for home; using your defaults
2005-10-09 09:13:04 [---] Remote control allowed
2005-10-09 09:13:04 [rosetta@home] Resuming computation for result 1acf__abrelax_no_cst_06642_0 using rosetta version 4.77






Hmm. The result actually finished and can be found here:
https://boinc.bakerlab.org/rosetta/result.php?resultid=222089

It's marked valid but the output from stderr.txt that is included in this URL would seem to indicate otherwise.
-Charlie
ID: 1179 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 1201 - Posted: 10 Oct 2005, 14:47:41 UTC - in response to Message 1151.  

you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage

Didn't work. Still crashed on a pause. How did such buggy code get released?
ID: 1201 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 2,808,014
RAC: 0
Message 1202 - Posted: 10 Oct 2005, 14:59:47 UTC - in response to Message 1201.  
Last modified: 10 Oct 2005, 15:03:48 UTC

you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage

Didn't work. Still crashed on a pause. How did such buggy code get released?


I found that didn't work either, about the only way you can avoid it is to just Suspend all other Projects and run Rosetta exclusively. Then Suspend Rosetta and run the other Projects when ready.

Also as I suggested in another Thread is to run the Benchmarks manually with Rosetta Suspended and keep track of when you do, then just make sure you do it again before 5 days are up ...

I find this Project to be the most Time consuming of all the Projects I run. I have to constantly be on the lookout for WU's that are hung or stuck at a certain % or I could end up with 50% or more of my Computers just spinning their wheels and accomplishing nothing ...

PS: I have another hung WU right now, thats the 4'th one this morning on 4 different Computers, things are running well though according to the Dev's ... ;)
ID: 1202 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jord
Avatar

Send message
Joined: 16 Sep 05
Posts: 41
Credit: 177,527
RAC: 128
Message 1203 - Posted: 10 Oct 2005, 15:22:00 UTC - in response to Message 1202.  

you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage

Didn't work. Still crashed on a pause. How did such buggy code get released?


I found that didn't work either

May I ask how you set it? Did you just go to your preferences webpage here and set the option to leave the application in memory? Or did you also Update RAH through Boinc Manager afterwards?

The memory usage will not go up much. When switching between work units, if you have the option to leave them in memory set on, BOINC will write the units to your page file (swap file, virtual memory). Only a very small part is being kept active in the RAM, less even than the Windows Task Manager shows!
ID: 1203 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 2,808,014
RAC: 0
Message 1204 - Posted: 10 Oct 2005, 15:47:10 UTC

May I ask how you set it? Did you just go to your preferences webpage here and set the option to leave the application in memory? Or did you also Update RAH through Boinc Manager afterwards?
==========

As for myself I already had my Preferences set to Leave In Memory when I joined the Rosetta Project. So it should have Propagated across to it when I Attached to the Project.

I also checked later on to make sure it was showing to Leave In Memory here at this Projects Preferences ...
ID: 1204 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1212 - Posted: 10 Oct 2005, 17:20:22 UTC - in response to Message 1201.  

Didn't work. Still crashed on a pause. How did such buggy code get released?

Perhaps because the test beds were set up to never remove from memory. Perhaps because with 200 machines to monitor, the fact that one or two machines died over a work unit was missed. Perhaps because like many things the project is on a low budget, have few people on staff, and there is lots to do.

I try to work on the Wiki every day, for as long as I can. As many hours as I put in, there are still hundreds, if not thousands of errors. As a former developer I can tell you that no matter how rigerous the testing regime, the software will always fail in the field.

Lastly, though perfection is fail free operation is desired. The BOINC System is designed to be robust in the presence of error. Note, that this does not mean your personal experience will be without problem. But, the scientific results are protected. As in the cases we see here. Your result failed, someone elses will succeed, the project moves on.

I am not saying that they don't want to fix this, and "knowing" the project people as I do (yes, I do have a little more "access" than many people, but it is not THAT much more), all project members on all projects take all problems seriously. But, there are only so many hours in a day...

Last point, I know that I have similar problems to what the projects do, too much to do, and not enough Paul. So, the thought is hostile. They really do care. But, none of us is well served with comments like these. Yes, worse examples abound, but, it starts small, like this, and it is not fair to those that do work so hard. To sum it up, we all are working on it. Please be kind ...
ID: 1212 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Scott Brown

Send message
Joined: 19 Sep 05
Posts: 19
Credit: 8,739
RAC: 0
Message 1215 - Posted: 10 Oct 2005, 17:36:32 UTC - in response to Message 1204.  

As for myself I already had my Preferences set to Leave In Memory when I joined the Rosetta Project. So it should have Propagated across to it when I Attached to the Project.

I also checked later on to make sure it was showing to Leave In Memory here at this Projects Preferences ...


Are you attached to other projects? If so, you need to make sure that the prefs are set to leave in memory at all. Otherwise, your machine will alternate between settings as it contacts the separate projects (had this happen to me when I first joined SZTAKI and forgot to switch the default pref).

ID: 1215 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 1220 - Posted: 10 Oct 2005, 18:26:24 UTC - in response to Message 1203.  

May I ask how you set it? Did you just go to your preferences webpage here and set the option to leave the application in memory? Or did you also Update RAH through Boinc Manager afterwards?

The memory usage will not go up much. When switching between work units, if you have the option to leave them in memory set on, BOINC will write the units to your page file (swap file, virtual memory). Only a very small part is being kept active in the RAM, less even than the Windows Task Manager shows!

I went to the preference web page clicked yes next to leave application in memory, them updated RAH. This is correct, yes?
ID: 1220 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Ron Peterson

Send message
Joined: 6 Oct 05
Posts: 23
Credit: 4,268,694
RAC: 0
Message 1221 - Posted: 10 Oct 2005, 18:34:37 UTC - in response to Message 1212.  

Didn't work. Still crashed on a pause. How did such buggy code get released?

Perhaps because the test beds were set up to never remove from memory. Perhaps because with 200 machines to monitor, the fact that one or two machines died over a work unit was missed. Perhaps because like many things the project is on a low budget, have few people on staff, and there is lots to do.

I try to work on the Wiki every day, for as long as I can. As many hours as I put in, there are still hundreds, if not thousands of errors. As a former developer I can tell you that no matter how rigerous the testing regime, the software will always fail in the field.

Lastly, though perfection is fail free operation is desired. The BOINC System is designed to be robust in the presence of error. Note, that this does not mean your personal experience will be without problem. But, the scientific results are protected. As in the cases we see here. Your result failed, someone elses will succeed, the project moves on.

I am not saying that they don't want to fix this, and "knowing" the project people as I do (yes, I do have a little more "access" than many people, but it is not THAT much more), all project members on all projects take all problems seriously. But, there are only so many hours in a day...

Last point, I know that I have similar problems to what the projects do, too much to do, and not enough Paul. So, the thought is hostile. They really do care. But, none of us is well served with comments like these. Yes, worse examples abound, but, it starts small, like this, and it is not fair to those that do work so hard. To sum it up, we all are working on it. Please be kind ...


Sorry, I didn't mean to slam anyone. It's just in the past 5 days or so, I've yet to have a single RAH WU to complete. I'm 0 for 18 on two different computers. As a Q and Aer, and someone who has beta tested software, this seems extreme to me. I'm glad that it is being worked on.
ID: 1221 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Fuzzy Hollynoodles
Avatar

Send message
Joined: 7 Oct 05
Posts: 234
Credit: 15,020
RAC: 0
Message 1225 - Posted: 10 Oct 2005, 20:25:58 UTC - in response to Message 1220.  

May I ask how you set it? Did you just go to your preferences webpage here and set the option to leave the application in memory? Or did you also Update RAH through Boinc Manager afterwards?

The memory usage will not go up much. When switching between work units, if you have the option to leave them in memory set on, BOINC will write the units to your page file (swap file, virtual memory). Only a very small part is being kept active in the RAM, less even than the Windows Task Manager shows!

I went to the preference web page clicked yes next to leave application in memory, them updated RAH. This is correct, yes?


Remember to click the Update button in your BOINC manager, so the changes can take place.



[b]"I'm trying to maintain a shred of dignity in this world." - Me[/b]

ID: 1225 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
STE\/E

Send message
Joined: 17 Sep 05
Posts: 125
Credit: 2,808,014
RAC: 0
Message 1227 - Posted: 10 Oct 2005, 20:59:33 UTC

Remember to click the Update button in your BOINC manager, so the changes can take place.
==========

Although that is a good idea Fuzzy & something I do myself anytime I make a Preference change the changes will ( or should anyway) take place Automatically the next time the Client Contacts the Server.
ID: 1227 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1232 - Posted: 10 Oct 2005, 23:17:46 UTC

Ron,

Time to take a break. Rosetta@Home is not right for you AT THIS TIME .... that could change tomorrow. Heck, I just stopped Predictor@Home and SETI@Home for awhile. LHC@Home stopped itself. But, I added SZTAKI and Rosetta@Home so all is well with the world.

Some where I did the lecture about our vs. project goals, heck it might even have been here, but I am too tired to go look ...

Anyway, this will be fixed. Not so sure Predictor@Home will stop popping up diaog boxes on deaths, but I can hope. Till then, well, they can live without me for a bit. I may add one of my slower machines back in. But, I find that even worse.

With Rosetta@Home you can mitigate the problems, the only way I can do it with predictor is to stay up all the time and watch the screens of my compters to see if they hae a dialog box up. I had one machine that cost me well over 24 hours because I had not noticed the prblem.
ID: 1232 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · Next

Message boards : Number crunching : Rosetta crashes on pausing



©2021 University of Washington
https://www.bakerlab.org