Message boards : Number crunching : Rosetta crashes on pausing
Author | Message |
---|---|
Ron Peterson Send message Joined: 6 Oct 05 Posts: 23 Credit: 4,268,694 RAC: 0 |
I'm new to Rosetta and this is bugging me. I get a WU about 75% done and: 10/8/2005 3:24:42 AM|rosetta@home|Pausing result 1acf__abrelax_00304_0 (removed from memory) 10/8/2005 3:24:43 AM|rosetta@home|Unrecoverable error for result 1acf__abrelax_00304_0 ( - exit code -1073741819 (0xc0000005)) What gives and how do I fix it??? Ron |
Solblekt Send message Joined: 27 Sep 05 Posts: 8 Credit: 3,302 RAC: 0 |
We all have the same problem. https://boinc.bakerlab.org/rosetta/forum_thread.php?id=132 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=85 https://boinc.bakerlab.org/rosetta/forum_thread.php?id=126 There are more links for this subject. |
Solblekt Send message Joined: 27 Sep 05 Posts: 8 Credit: 3,302 RAC: 0 |
Well now why can't I click on the links I just posted? |
cah_user_1217 Send message Joined: 17 Sep 05 Posts: 3 Credit: 2,187 RAC: 0 |
Well now why can't I click on the links I just posted? Perhaps you didn't use the 'url=' BBCode? Take a look here for the correct usage. |
Ron Peterson Send message Joined: 6 Oct 05 Posts: 23 Credit: 4,268,694 RAC: 0 |
Ah well, looks like trouble for everyone who runs more than just Rosetta. Not me. Good. |
FZB Send message Joined: 17 Sep 05 Posts: 84 Credit: 4,948,999 RAC: 0 |
you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage -- Florian www.domplatz1.de |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,660 RAC: 421 |
Well, here is another datapoint. Running Boinc CC 4.72, compiled myself on a Fedora Core 3 box with an AMD XP 2600+ cpu. It appears it tried to run benchmarks, and failed to stop the rosetta app that was running. Computation did not resume. When I found it the cpu usage was down to 0% and the rossetta processes and boinc cc process were still there but doing nothing. (Actually, there were 3 rosetta processes which seems to be normal.) I had to run my stop script (boincctl - found on the Add-on page. I'm its author.) to abort all the processes. I do have my preferences set to leave applications in memory. Here is a copy of stdout 2005-10-09 08:59:19 [---] Suspending computation and network activity - running CPU benchmarks 2005-10-09 08:59:19 [rosetta@home] Pausing result 1acf__abrelax_no_cst_06642_0 (removed from memory) 2005-10-09 08:59:21 [---] Running CPU benchmarks 2005-10-09 08:59:29 [---] Failed to stop applications; aborting CPU benchmarks 2005-10-09 08:59:29 [---] Resuming computation and network activity 2005-10-09 08:59:29 [---] request_reschedule_cpus: Resuming activities 2005-10-09 08:59:29 [---] ACTIVE_TASK_SET::check_app_exited(): pid 21432 not found Then I stopped things: 2005-10-09 09:12:45 [---] Received signal 15 2005-10-09 09:12:45 [---] Exit requested by user 2005-10-09 09:12:51 [---] request_reschedule_cpus: exit_tasks Then I restarted everything: 2005-10-09 09:13:04 [---] Starting BOINC client version 4.72 for i686-pc-linux-gnu 2005-10-09 09:13:04 [---] Data directory: /home/charlie/Boinc 2005-10-09 09:13:04 [---] Processor Inventory: 1 AuthenticAMD AMD Athlon(TM) XP 2600+ Processor(s) 2005-10-09 09:13:04 [---] Memory Inventory: Memory total - 503.37 MB, Swap total - 1019.74 MB 2005-10-09 09:13:04 [---] Disk Inventory: Disk total - 55.39 GB, Disk available - 47.40 GB 2005-10-09 09:13:04 [Predictor @ Home] Computer ID: 116444; location: home; project prefs: default 2005-10-09 09:13:04 [rosetta@home] Computer ID: 4375; location: home; project prefs: default 2005-10-09 09:13:04 [SETI@home] Computer ID: 850659; location: home; project prefs: default 2005-10-09 09:13:04 [---] General prefs: from rosetta@home (last modified 2005-10-07 19:31:33) 2005-10-09 09:13:04 [---] General prefs: no separate prefs for home; using your defaults 2005-10-09 09:13:04 [---] Remote control allowed 2005-10-09 09:13:04 [rosetta@home] Resuming computation for result 1acf__abrelax_no_cst_06642_0 using rosetta version 4.77 -Charlie |
Ron Peterson Send message Joined: 6 Oct 05 Posts: 23 Credit: 4,268,694 RAC: 0 |
you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage I'll try this... |
Charles Dennett Send message Joined: 27 Sep 05 Posts: 102 Credit: 2,081,660 RAC: 421 |
Well, here is another datapoint. Running Boinc CC 4.72, compiled myself on a Fedora Core 3 box with an AMD XP 2600+ cpu. It appears it tried to run benchmarks, and failed to stop the rosetta app that was running. Computation did not resume. When I found it the cpu usage was down to 0% and the rossetta processes and boinc cc process were still there but doing nothing. (Actually, there were 3 rosetta processes which seems to be normal.) I had to run my stop script (boincctl - found on the Add-on page. I'm its author.) to abort all the processes. I do have my preferences set to leave applications in memory. Hmm. The result actually finished and can be found here: https://boinc.bakerlab.org/rosetta/result.php?resultid=222089 It's marked valid but the output from stderr.txt that is included in this URL would seem to indicate otherwise. -Charlie |
Ron Peterson Send message Joined: 6 Oct 05 Posts: 23 Credit: 4,268,694 RAC: 0 |
you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage Didn't work. Still crashed on a pause. How did such buggy code get released? |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 114 |
you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage I found that didn't work either, about the only way you can avoid it is to just Suspend all other Projects and run Rosetta exclusively. Then Suspend Rosetta and run the other Projects when ready. Also as I suggested in another Thread is to run the Benchmarks manually with Rosetta Suspended and keep track of when you do, then just make sure you do it again before 5 days are up ... I find this Project to be the most Time consuming of all the Projects I run. I have to constantly be on the lookout for WU's that are hung or stuck at a certain % or I could end up with 50% or more of my Computers just spinning their wheels and accomplishing nothing ... PS: I have another hung WU right now, thats the 4'th one this morning on 4 different Computers, things are running well though according to the Dev's ... ;) |
Jord Send message Joined: 16 Sep 05 Posts: 41 Credit: 204,120 RAC: 0 |
you can avoid this error (beside when it is disturbed by benchmark) with setting your preferences to "leave app in memory" until science app 4.77 is replaced with a newer one. note though that this will inc rease your memmory usage May I ask how you set it? Did you just go to your preferences webpage here and set the option to leave the application in memory? Or did you also Update RAH through Boinc Manager afterwards? The memory usage will not go up much. When switching between work units, if you have the option to leave them in memory set on, BOINC will write the units to your page file (swap file, virtual memory). Only a very small part is being kept active in the RAM, less even than the Windows Task Manager shows! |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 114 |
May I ask how you set it? Did you just go to your preferences webpage here and set the option to leave the application in memory? Or did you also Update RAH through Boinc Manager afterwards? ========== As for myself I already had my Preferences set to Leave In Memory when I joined the Rosetta Project. So it should have Propagated across to it when I Attached to the Project. I also checked later on to make sure it was showing to Leave In Memory here at this Projects Preferences ... |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Didn't work. Still crashed on a pause. How did such buggy code get released? Perhaps because the test beds were set up to never remove from memory. Perhaps because with 200 machines to monitor, the fact that one or two machines died over a work unit was missed. Perhaps because like many things the project is on a low budget, have few people on staff, and there is lots to do. I try to work on the Wiki every day, for as long as I can. As many hours as I put in, there are still hundreds, if not thousands of errors. As a former developer I can tell you that no matter how rigerous the testing regime, the software will always fail in the field. Lastly, though perfection is fail free operation is desired. The BOINC System is designed to be robust in the presence of error. Note, that this does not mean your personal experience will be without problem. But, the scientific results are protected. As in the cases we see here. Your result failed, someone elses will succeed, the project moves on. I am not saying that they don't want to fix this, and "knowing" the project people as I do (yes, I do have a little more "access" than many people, but it is not THAT much more), all project members on all projects take all problems seriously. But, there are only so many hours in a day... Last point, I know that I have similar problems to what the projects do, too much to do, and not enough Paul. So, the thought is hostile. They really do care. But, none of us is well served with comments like these. Yes, worse examples abound, but, it starts small, like this, and it is not fair to those that do work so hard. To sum it up, we all are working on it. Please be kind ... |
Scott Brown Send message Joined: 19 Sep 05 Posts: 19 Credit: 8,739 RAC: 0 |
As for myself I already had my Preferences set to Leave In Memory when I joined the Rosetta Project. So it should have Propagated across to it when I Attached to the Project. Are you attached to other projects? If so, you need to make sure that the prefs are set to leave in memory at all. Otherwise, your machine will alternate between settings as it contacts the separate projects (had this happen to me when I first joined SZTAKI and forgot to switch the default pref). |
Ron Peterson Send message Joined: 6 Oct 05 Posts: 23 Credit: 4,268,694 RAC: 0 |
May I ask how you set it? Did you just go to your preferences webpage here and set the option to leave the application in memory? Or did you also Update RAH through Boinc Manager afterwards? I went to the preference web page clicked yes next to leave application in memory, them updated RAH. This is correct, yes? |
Ron Peterson Send message Joined: 6 Oct 05 Posts: 23 Credit: 4,268,694 RAC: 0 |
Didn't work. Still crashed on a pause. How did such buggy code get released? Sorry, I didn't mean to slam anyone. It's just in the past 5 days or so, I've yet to have a single RAH WU to complete. I'm 0 for 18 on two different computers. As a Q and Aer, and someone who has beta tested software, this seems extreme to me. I'm glad that it is being worked on. |
Fuzzy Hollynoodles Send message Joined: 7 Oct 05 Posts: 234 Credit: 15,020 RAC: 0 |
May I ask how you set it? Did you just go to your preferences webpage here and set the option to leave the application in memory? Or did you also Update RAH through Boinc Manager afterwards? Remember to click the Update button in your BOINC manager, so the changes can take place. [b]"I'm trying to maintain a shred of dignity in this world." - Me[/b] |
STE\/E Send message Joined: 17 Sep 05 Posts: 125 Credit: 4,100,301 RAC: 114 |
Remember to click the Update button in your BOINC manager, so the changes can take place. ========== Although that is a good idea Fuzzy & something I do myself anytime I make a Preference change the changes will ( or should anyway) take place Automatically the next time the Client Contacts the Server. |
Paul D. Buck Send message Joined: 17 Sep 05 Posts: 815 Credit: 1,812,737 RAC: 0 |
Ron, Time to take a break. Rosetta@Home is not right for you AT THIS TIME .... that could change tomorrow. Heck, I just stopped Predictor@Home and SETI@Home for awhile. LHC@Home stopped itself. But, I added SZTAKI and Rosetta@Home so all is well with the world. Some where I did the lecture about our vs. project goals, heck it might even have been here, but I am too tired to go look ... Anyway, this will be fixed. Not so sure Predictor@Home will stop popping up diaog boxes on deaths, but I can hope. Till then, well, they can live without me for a bit. I may add one of my slower machines back in. But, I find that even worse. With Rosetta@Home you can mitigate the problems, the only way I can do it with predictor is to stay up all the time and watch the screens of my compters to see if they hae a dialog box up. I had one machine that cost me well over 24 hours because I had not noticed the prblem. |
Message boards :
Number crunching :
Rosetta crashes on pausing
©2024 University of Washington
https://www.bakerlab.org