Problems with Rosetta version 5.68 and 5.70

Message boards : Number crunching : Problems with Rosetta version 5.68 and 5.70

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 42676 - Posted: 26 Jun 2007, 19:38:17 UTC

Hi all -- please continue to keep posting your problems here. One note -- there are now two rosetta applications running (5.68 and 5.70). Its probably going to be a big pain for you to figure out which one was used for which workunit... its probably best to post issues for both here!
If you can post a link to your workunit we should be able to figure out which application had the problem.
ID: 42676 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dcdc

Send message
Joined: 3 Nov 05
Posts: 1832
Credit: 119,860,059
RAC: 4,566
Message 42678 - Posted: 26 Jun 2007, 20:23:56 UTC

Hi Rhiju

I've just noticed the new 5.70 tasks show as 'rosetta beta 5.70' in Boinc Manager & BoincView - i'm sure that will confuse some people so it's probably worth changing that label if you can!
ID: 42678 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Rhiju
Volunteer moderator

Send message
Joined: 8 Jan 06
Posts: 223
Credit: 3,546
RAC: 0
Message 42679 - Posted: 26 Jun 2007, 20:40:27 UTC - in response to Message 42678.  

Unfortunately, we can't label the new app plain "rosetta" because we need to keep the name of the stable app "rosetta". But I agree, maybe "rosetta_new" would be a better name than "rosetta_beta"... I'll talk to David K. about this.

Hi Rhiju

I've just noticed the new 5.70 tasks show as 'rosetta beta 5.70' in Boinc Manager & BoincView - i'm sure that will confuse some people so it's probably worth changing that label if you can!


ID: 42679 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
`

Send message
Joined: 21 Oct 06
Posts: 254
Credit: 56,691
RAC: 0
Message 42686 - Posted: 27 Jun 2007, 2:13:14 UTC
Last modified: 27 Jun 2007, 2:14:47 UTC

-edit- Found the answer, nevermind. :)
ID: 42686 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Marky-UK

Send message
Joined: 1 Nov 05
Posts: 73
Credit: 1,689,495
RAC: 0
Message 42697 - Posted: 27 Jun 2007, 9:54:06 UTC - in response to Message 42676.  

Its probably going to be a big pain for you to figure out which one was used for which workunit... its probably best to post issues for both here!
If you can post a link to your workunit we should be able to figure out which application had the problem.

The application version is also shown at the bottom of the Result page.
ID: 42697 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Dead2

Send message
Joined: 8 Jun 07
Posts: 4
Credit: 16,463,862
RAC: 0
Message 42712 - Posted: 27 Jun 2007, 17:01:00 UTC

https://boinc.bakerlab.org/rosetta/workunit.php?wuid=80447961

This workunit failed on the 5.70 client.
ID: 42712 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Dean

Send message
Joined: 11 Feb 07
Posts: 4
Credit: 631,230
RAC: 0
Message 42728 - Posted: 27 Jun 2007, 19:36:42 UTC

This post was originally posted on the 5.68 thread. A moderator asked that it be moved over here.

With 5.68 on a Debian Linux 2.6x machine, most Rosetta tasks will run to about 84% completion, and then hang. The "CPU time" does not increment for the task, and the task will remain hung for as long as it is the executable task.

I am also running World Communit Grid, and there is no problem with the WCG tasks. But, when WCG releases BOINC to Rosetta, the Rosetta tasks go nowhere. I have seen this on multiple tasks, and most recently with:CNTRL_01ABRELAX_SAVE_ALL_OUT_-1elwA-_filters_1782_11292_1 and CNTRL_01ABRELAX_SAVE_ALL_OUT_-1iibA-_filters_1782_128542_1.

I have paused the tasks and then resumed, restarted BOINC, reset the Rosetta project, left it to run for several days, all to no avail. A new Rosetta task will run to the 84% completion, and then hang. Once in a while, a task will actually complete, usually right after I reset Rosetta.

I am also running Rosetta on Windows XP and 2000 machines with no problems. Since I despise Microsoft products, I am very motivated to get this fixed on Linux ;)
"I'm an American, I believe in the American Way, I worry if the government encourages open source, and I don't think we've done enough education of policy makers to understand the threat." Jim Allchin, OS Chief, Microsoft
ID: 42728 · Rating: 1 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dentaku

Send message
Joined: 24 Jun 07
Posts: 3
Credit: 13,468
RAC: 0
Message 42760 - Posted: 28 Jun 2007, 11:25:29 UTC

AFter about 80-90 % the task finsihes with a "coimputation error". WUs of other projects don't fail ...

(Ubuntu 7.04 64 Bit).)
Earthlings: http://video.google.com/videoplay?docid=3664359489218547625
ID: 42760 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile dentaku

Send message
Joined: 24 Jun 07
Posts: 3
Credit: 13,468
RAC: 0
Message 42782 - Posted: 28 Jun 2007, 18:25:47 UTC - in response to Message 42760.  

AFter about 80-90 % the task finsihes with a "coimputation error". WUs of other projects don't fail ...

(Ubuntu 7.04 64 Bit).)


The results for these work units show this:


Server state	Over
Outcome	Client error
Client state	Compute error
Exit status	193 (0xc1)

CPU time	7938.456121
stderr out	<core_client_version>5.10.8</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
# random seed: 2070819
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
# random seed: 2070819
SIGSEGV: segmentation violation
Stack trace (13 frames):
[0x8cdfdab]
[0x8cdabdc]
[0xffffe500]
[0x8c4a1a7]
[0x8b51232]
[0x8c31c24]
[0x849a832]
[0x80dad6d]
[0x85c5a97]
[0x86eda1b]
[0x86edac6]
[0x8d43ca4]
[0x8048111]

Exiting...
Graphics are disabled due to configuration...
# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation
Stack trace (13 frames):
[0x8cdfdab]
[0x8cdabdc]
[0xffffe500]
[0x8c4a1a7]
[0x8b51232]
[0x8c31c24]
[0x849a857]
[0x80dad6d]
[0x85c5a97]
[0x86eda1b]
[0x86edac6]
[0x8d43ca4]
[0x8048111]

Exiting...
SIGSEGV: segmentation violation
SIGABRT: abort called
SIGABRT: abort called
SIGABRT: abort called
... several hundred times ....
SIGABRT: abort called
SIGABRT: abort called
SIGABRT: abort called

</stderr_txt>
]]>
Validate state	Invalid
Claimed credit	23.9044403037559
Granted credit	0
application version	5.68

Earthlings: http://video.google.com/videoplay?docid=3664359489218547625
ID: 42782 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
TCU Computer Science

Send message
Joined: 7 Dec 05
Posts: 28
Credit: 12,861,977
RAC: 0
Message 42807 - Posted: 29 Jun 2007, 5:07:19 UTC - in response to Message 42728.  

With 5.68 on a Debian Linux 2.6x machine, most Rosetta tasks will run to about 84% completion, and then hang. The "CPU time" does not increment for the task, and the task will remain hung for as long as it is the executable task.


I had a problem similar to this a year ago. On CentOS (and Mac OS X) the Rosetta task would hang. boincmgr showed Rosetta running but the accumulated CPU time did not increase. Usually, the Rosetta task would remain in the process list after I stopped boinc. I had to manually kill the Rosetta task. Then when I restarted boinc, the Rosetta task would resume accumulating CPU time. I switched most of my Linux boxes and all of my Macs to Einstein because I didn't have time to check those machines for hung tasks.

Recently, I tried switching back to Rosetta. On machines with CentOS 4.1 (kernel 2.6.9-11) Rosetta still hung but machines with CentOS 4.5 (kernel 2.6.9-55) have not experienced that problem. So, all of my Linux boxes have been updated and most switched back to Rosetta.

I still have the problem on Mac OS X.
ID: 42807 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 42809 - Posted: 29 Jun 2007, 8:55:59 UTC
Last modified: 29 Jun 2007, 8:56:15 UTC

this workunit is stuck at 55.273% complete and says its waiting for memory.
I suspended all other work units and tried to get it to run, but it insists it needs more memory. Not sure how much more it needs. I don't have that many processes running. In the meantime rosie has moved on to a abrelax WU instead.

Should I just abort this memory problemed WU or what?
ID: 42809 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 42817 - Posted: 29 Jun 2007, 10:38:08 UTC - in response to Message 42809.  
Last modified: 29 Jun 2007, 10:38:42 UTC

this workunit is stuck at 55.273% complete and says its waiting for memory.
I suspended all other work units and tried to get it to run, but it insists it needs more memory. Not sure how much more it needs. I don't have that many processes running. In the meantime rosie has moved on to a abrelax WU instead.

Should I just abort this memory problemed WU or what?


UPDATE - This WU completed but got stuck at the same percent completion as mentioned above. It showed as a success but I only got half credit for it.

The result data is here
ID: 42817 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 42823 - Posted: 29 Jun 2007, 12:00:31 UTC

one other odd thing, i stopped some work units from running to get rid of the beta stuff and some other minority WU's. When those cleared out I set everything back to run. I have 6 or so WU's that are due on the 3rd and it started on one and then stopped running it and went to a WU that is due on the 4th and had run a few secs when I was suspending things. Why would RAH start one WU and then stop it and jump to the first WU of a different date? In the mean time I have suspended everything from the 4th and onwards to get RAH to focus on the stuff due on the 3rd.
ID: 42823 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Susie HomeMaker

Send message
Joined: 12 Nov 06
Posts: 22
Credit: 2,511,881
RAC: 0
Message 42826 - Posted: 29 Jun 2007, 12:40:06 UTC

ok... here's a post with no probs !!

:-)

Except no graphics

The Cruncher

Mem now fully popped (2gb)

Graphics Ati x800 (512mb)

Os = debian 64 / Dual boot with win XP that ONLY gets used for NLE


More graphcs PRETTY please

Oh.. and a port for AmigaOs4 (PPC)

:-)
ID: 42826 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 42835 - Posted: 29 Jun 2007, 14:26:07 UTC - in response to Message 42823.  
Last modified: 29 Jun 2007, 14:34:52 UTC

Why would RAH start one WU and then stop it and jump to the first WU of a different date?


This is what BOINC does when it feels there is not currently enough memory that BOINC is allowed to use. It starts another task and runs it as long as it can before possibly hitting the same need for memory.

Did you happen to look in the task manager at the memory that task (and any other active BOINC tasks) was using? You can use the view pulldown to select the Mem Usage column for display.

Looks like your machine has 512MB of memory and only a single CPU, so BOINC should only have one active task at a time. A single Rosetta task, even a "large" one, should only need about half of that. How are your general preferences set for memory that BOINC is allowed to use?

What may have happened is perhaps you allow BOINC to use a greater % of memory when the machine is idle. So, the task needed more and more memory as it progressed in that specific model, it reached the upper limit for memory while your computer is in use (or perhaps you stopped in to check on it and so your computer went from idle to in-use), then you see the task waiting for memory. Then later, perhaps you left the computer and it went idle again and was allowed enough memory to complete the first task. The above is assuming that you allow more memory while idle then when in-use. That is what most people do if they limit memory usage.
Rosetta Moderator: Mod.Sense
ID: 42835 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 42842 - Posted: 29 Jun 2007, 17:57:37 UTC - in response to Message 42835.  
Last modified: 29 Jun 2007, 18:04:07 UTC

its set for 100% when not in use and 75% when in use
BOINC is using currently 5800K with a peak of 9944K
BOINCMGR is using 3928K with a peak of 9188K

From the BOINC manager messages:
6/29/2007 8:01:24 PM||Preferences limit memory usage when active to 255.74MB
6/29/2007 8:01:24 PM||Preferences limit memory usage when idle to 460.34MB

Why would RAH start one WU and then stop it and jump to the first WU of a different date?


This is what BOINC does when it feels there is not currently enough memory that BOINC is allowed to use. It starts another task and runs it as long as it can before possibly hitting the same need for memory.

Did you happen to look in the task manager at the memory that task (and any other active BOINC tasks) was using? You can use the view pulldown to select the Mem Usage column for display.

Looks like your machine has 512MB of memory and only a single CPU, so BOINC should only have one active task at a time. A single Rosetta task, even a "large" one, should only need about half of that. How are your general preferences set for memory that BOINC is allowed to use?

What may have happened is perhaps you allow BOINC to use a greater % of memory when the machine is idle. So, the task needed more and more memory as it progressed in that specific model, it reached the upper limit for memory while your computer is in use (or perhaps you stopped in to check on it and so your computer went from idle to in-use), then you see the task waiting for memory. Then later, perhaps you left the computer and it went idle again and was allowed enough memory to complete the first task. The above is assuming that you allow more memory while idle then when in-use. That is what most people do if they limit memory usage.


ID: 42842 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 42844 - Posted: 29 Jun 2007, 18:15:11 UTC - in response to Message 42842.  

BOINC is using currently 5800K with a peak of 9944K
BOINCMGR is using 3928K with a peak of 9188K


BOINC is looking at the memory used by the "Rosetta_xxxxx" process. So the above are only the minor portion of the picture.

The Rosetta task would be what I'd be curious to know how large that got when that task ran. Too late to check this time, but that was what I was trying to ask about. It's going to be something north of 110,000K. Just a question of how far north.


Rosetta Moderator: Mod.Sense
ID: 42844 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 42847 - Posted: 29 Jun 2007, 19:03:24 UTC - in response to Message 42844.  

BOINC is using currently 5800K with a peak of 9944K
BOINCMGR is using 3928K with a peak of 9188K


BOINC is looking at the memory used by the "Rosetta_xxxxx" process. So the above are only the minor portion of the picture.

The Rosetta task would be what I'd be curious to know how large that got when that task ran. Too late to check this time, but that was what I was trying to ask about. It's going to be something north of 110,000K. Just a question of how far north.



i see what you mean - but what gets me is that there are at least 6 more WU's that were next in line to run all with the same due date, but boinc chose to goto the next day and start work on a unit that had already started but was suspended when I was trying to run selected work units to get everything the same for a straight run of nothing but bench-0512 units.
ID: 42847 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 42850 - Posted: 29 Jun 2007, 19:26:21 UTC

Rhiju? Is there any variation in these tasks as to some requesting large memory and some not?

I see now why you are saying that was an odd thing for BOINC to do. I had originally thought you were just confused about starting multiple tasks. You've got much more to your picture.

If the project sends out a large memory task, I believe the BOINC client knows that, and so it may have skipped a few of those in favor of a lower memory task on the later due date. Often the task names will be similar enough that they look the same.
Rosetta Moderator: Mod.Sense
ID: 42850 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Greg_BE
Avatar

Send message
Joined: 30 May 06
Posts: 5691
Credit: 5,859,226
RAC: 0
Message 42862 - Posted: 29 Jun 2007, 21:31:15 UTC
Last modified: 29 Jun 2007, 21:34:28 UTC

have a look at the message on this WU

says something to the effect of Can't set up shared mem: -1
Will run in standalone mode.

I think this is the one I had to force to finish.
The other one you see in my post from earlier today that stalled and then reported as complete but was only 50% done.

everything else has run ok today. I have a 6hr run cycle per WU.
ID: 42862 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Number crunching : Problems with Rosetta version 5.68 and 5.70



©2024 University of Washington
https://www.bakerlab.org