Message boards : Number crunching : Problems with Rosetta version 5.68 and 5.70
Author | Message |
---|---|
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Hi all -- please continue to keep posting your problems here. One note -- there are now two rosetta applications running (5.68 and 5.70). Its probably going to be a big pain for you to figure out which one was used for which workunit... its probably best to post issues for both here! If you can post a link to your workunit we should be able to figure out which application had the problem. |
dcdc Send message Joined: 3 Nov 05 Posts: 1832 Credit: 119,860,059 RAC: 4,566 |
Hi Rhiju I've just noticed the new 5.70 tasks show as 'rosetta beta 5.70' in Boinc Manager & BoincView - i'm sure that will confuse some people so it's probably worth changing that label if you can! |
Rhiju Volunteer moderator Send message Joined: 8 Jan 06 Posts: 223 Credit: 3,546 RAC: 0 |
Unfortunately, we can't label the new app plain "rosetta" because we need to keep the name of the stable app "rosetta". But I agree, maybe "rosetta_new" would be a better name than "rosetta_beta"... I'll talk to David K. about this. Hi Rhiju |
` Send message Joined: 21 Oct 06 Posts: 254 Credit: 56,691 RAC: 0 |
-edit- Found the answer, nevermind. :) |
Marky-UK Send message Joined: 1 Nov 05 Posts: 73 Credit: 1,689,495 RAC: 0 |
Its probably going to be a big pain for you to figure out which one was used for which workunit... its probably best to post issues for both here! The application version is also shown at the bottom of the Result page. |
Dead2 Send message Joined: 8 Jun 07 Posts: 4 Credit: 16,463,862 RAC: 0 |
https://boinc.bakerlab.org/rosetta/workunit.php?wuid=80447961 This workunit failed on the 5.70 client. |
Dean Send message Joined: 11 Feb 07 Posts: 4 Credit: 631,230 RAC: 0 |
This post was originally posted on the 5.68 thread. A moderator asked that it be moved over here. With 5.68 on a Debian Linux 2.6x machine, most Rosetta tasks will run to about 84% completion, and then hang. The "CPU time" does not increment for the task, and the task will remain hung for as long as it is the executable task. I am also running World Communit Grid, and there is no problem with the WCG tasks. But, when WCG releases BOINC to Rosetta, the Rosetta tasks go nowhere. I have seen this on multiple tasks, and most recently with:CNTRL_01ABRELAX_SAVE_ALL_OUT_-1elwA-_filters_1782_11292_1 and CNTRL_01ABRELAX_SAVE_ALL_OUT_-1iibA-_filters_1782_128542_1. I have paused the tasks and then resumed, restarted BOINC, reset the Rosetta project, left it to run for several days, all to no avail. A new Rosetta task will run to the 84% completion, and then hang. Once in a while, a task will actually complete, usually right after I reset Rosetta. I am also running Rosetta on Windows XP and 2000 machines with no problems. Since I despise Microsoft products, I am very motivated to get this fixed on Linux ;) "I'm an American, I believe in the American Way, I worry if the government encourages open source, and I don't think we've done enough education of policy makers to understand the threat." Jim Allchin, OS Chief, Microsoft |
dentaku Send message Joined: 24 Jun 07 Posts: 3 Credit: 13,468 RAC: 0 |
AFter about 80-90 % the task finsihes with a "coimputation error". WUs of other projects don't fail ... (Ubuntu 7.04 64 Bit).) Earthlings: http://video.google.com/videoplay?docid=3664359489218547625 |
dentaku Send message Joined: 24 Jun 07 Posts: 3 Credit: 13,468 RAC: 0 |
AFter about 80-90 % the task finsihes with a "coimputation error". WUs of other projects don't fail ... The results for these work units show this: Server state Over Outcome Client error Client state Compute error Exit status 193 (0xc1) CPU time 7938.456121 stderr out <core_client_version>5.10.8</core_client_version> <![CDATA[ <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> Graphics are disabled due to configuration... # cpu_run_time_pref: 14400 # random seed: 2070819 Graphics are disabled due to configuration... # cpu_run_time_pref: 14400 # random seed: 2070819 SIGSEGV: segmentation violation Stack trace (13 frames): [0x8cdfdab] [0x8cdabdc] [0xffffe500] [0x8c4a1a7] [0x8b51232] [0x8c31c24] [0x849a832] [0x80dad6d] [0x85c5a97] [0x86eda1b] [0x86edac6] [0x8d43ca4] [0x8048111] Exiting... Graphics are disabled due to configuration... # cpu_run_time_pref: 14400 SIGSEGV: segmentation violation Stack trace (13 frames): [0x8cdfdab] [0x8cdabdc] [0xffffe500] [0x8c4a1a7] [0x8b51232] [0x8c31c24] [0x849a857] [0x80dad6d] [0x85c5a97] [0x86eda1b] [0x86edac6] [0x8d43ca4] [0x8048111] Exiting... SIGSEGV: segmentation violation SIGABRT: abort called SIGABRT: abort called SIGABRT: abort called ... several hundred times .... SIGABRT: abort called SIGABRT: abort called SIGABRT: abort called </stderr_txt> ]]> Validate state Invalid Claimed credit 23.9044403037559 Granted credit 0 application version 5.68 Earthlings: http://video.google.com/videoplay?docid=3664359489218547625 |
TCU Computer Science Send message Joined: 7 Dec 05 Posts: 28 Credit: 12,861,977 RAC: 0 |
With 5.68 on a Debian Linux 2.6x machine, most Rosetta tasks will run to about 84% completion, and then hang. The "CPU time" does not increment for the task, and the task will remain hung for as long as it is the executable task. I had a problem similar to this a year ago. On CentOS (and Mac OS X) the Rosetta task would hang. boincmgr showed Rosetta running but the accumulated CPU time did not increase. Usually, the Rosetta task would remain in the process list after I stopped boinc. I had to manually kill the Rosetta task. Then when I restarted boinc, the Rosetta task would resume accumulating CPU time. I switched most of my Linux boxes and all of my Macs to Einstein because I didn't have time to check those machines for hung tasks. Recently, I tried switching back to Rosetta. On machines with CentOS 4.1 (kernel 2.6.9-11) Rosetta still hung but machines with CentOS 4.5 (kernel 2.6.9-55) have not experienced that problem. So, all of my Linux boxes have been updated and most switched back to Rosetta. I still have the problem on Mac OS X. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
this workunit is stuck at 55.273% complete and says its waiting for memory. I suspended all other work units and tried to get it to run, but it insists it needs more memory. Not sure how much more it needs. I don't have that many processes running. In the meantime rosie has moved on to a abrelax WU instead. Should I just abort this memory problemed WU or what? |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
this workunit is stuck at 55.273% complete and says its waiting for memory. UPDATE - This WU completed but got stuck at the same percent completion as mentioned above. It showed as a success but I only got half credit for it. The result data is here |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
one other odd thing, i stopped some work units from running to get rid of the beta stuff and some other minority WU's. When those cleared out I set everything back to run. I have 6 or so WU's that are due on the 3rd and it started on one and then stopped running it and went to a WU that is due on the 4th and had run a few secs when I was suspending things. Why would RAH start one WU and then stop it and jump to the first WU of a different date? In the mean time I have suspended everything from the 4th and onwards to get RAH to focus on the stuff due on the 3rd. |
Susie HomeMaker Send message Joined: 12 Nov 06 Posts: 22 Credit: 2,511,881 RAC: 0 |
ok... here's a post with no probs !! :-) Except no graphics The Cruncher Mem now fully popped (2gb) Graphics Ati x800 (512mb) Os = debian 64 / Dual boot with win XP that ONLY gets used for NLE More graphcs PRETTY please Oh.. and a port for AmigaOs4 (PPC) :-) |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Why would RAH start one WU and then stop it and jump to the first WU of a different date? This is what BOINC does when it feels there is not currently enough memory that BOINC is allowed to use. It starts another task and runs it as long as it can before possibly hitting the same need for memory. Did you happen to look in the task manager at the memory that task (and any other active BOINC tasks) was using? You can use the view pulldown to select the Mem Usage column for display. Looks like your machine has 512MB of memory and only a single CPU, so BOINC should only have one active task at a time. A single Rosetta task, even a "large" one, should only need about half of that. How are your general preferences set for memory that BOINC is allowed to use? What may have happened is perhaps you allow BOINC to use a greater % of memory when the machine is idle. So, the task needed more and more memory as it progressed in that specific model, it reached the upper limit for memory while your computer is in use (or perhaps you stopped in to check on it and so your computer went from idle to in-use), then you see the task waiting for memory. Then later, perhaps you left the computer and it went idle again and was allowed enough memory to complete the first task. The above is assuming that you allow more memory while idle then when in-use. That is what most people do if they limit memory usage. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
its set for 100% when not in use and 75% when in use BOINC is using currently 5800K with a peak of 9944K BOINCMGR is using 3928K with a peak of 9188K From the BOINC manager messages: 6/29/2007 8:01:24 PM||Preferences limit memory usage when active to 255.74MB 6/29/2007 8:01:24 PM||Preferences limit memory usage when idle to 460.34MB Why would RAH start one WU and then stop it and jump to the first WU of a different date? |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
BOINC is using currently 5800K with a peak of 9944K BOINC is looking at the memory used by the "Rosetta_xxxxx" process. So the above are only the minor portion of the picture. The Rosetta task would be what I'd be curious to know how large that got when that task ran. Too late to check this time, but that was what I was trying to ask about. It's going to be something north of 110,000K. Just a question of how far north. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
BOINC is using currently 5800K with a peak of 9944K i see what you mean - but what gets me is that there are at least 6 more WU's that were next in line to run all with the same due date, but boinc chose to goto the next day and start work on a unit that had already started but was suspended when I was trying to run selected work units to get everything the same for a straight run of nothing but bench-0512 units. |
Mod.Sense Volunteer moderator Send message Joined: 22 Aug 06 Posts: 4018 Credit: 0 RAC: 0 |
Rhiju? Is there any variation in these tasks as to some requesting large memory and some not? I see now why you are saying that was an odd thing for BOINC to do. I had originally thought you were just confused about starting multiple tasks. You've got much more to your picture. If the project sends out a large memory task, I believe the BOINC client knows that, and so it may have skipped a few of those in favor of a lower memory task on the later due date. Often the task names will be similar enough that they look the same. Rosetta Moderator: Mod.Sense |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
have a look at the message on this WU says something to the effect of Can't set up shared mem: -1 Will run in standalone mode. I think this is the one I had to force to finish. The other one you see in my post from earlier today that stalled and then reported as complete but was only 50% done. everything else has run ok today. I have a 6hr run cycle per WU. |
Message boards :
Number crunching :
Problems with Rosetta version 5.68 and 5.70
©2024 University of Washington
https://www.bakerlab.org