Rosetta@home

minirosetta 2.05

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : minirosetta 2.05

Sort
AuthorMessage
David E K Profile
Forum moderator
Project administrator
Project developer
Project scientist

Joined: Jul 1 05
Posts: 961
ID: 14
Credit: 2,369,109
RAC: 1,381
Message 64951 - Posted 13 Jan 2010 18:11:01 UTC

This app update includes a fix for checkpointing.

Please report issues and bugs here!

thanks,

DK

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 64953 - Posted 13 Jan 2010 19:21:01 UTC

Hi,

I'll be resubmitting the *gbnnotyr* protein design trajectories to boinc over the next few hours. The tests I ran on ralph showed that the checkpointing issue is resolved. To make sure that there are no other issues, I will submit these trajectories 'slowly' starting with a modest sized batch, and according to the responses I get on the thread I will increase the number of work units over the next few days. Please keep me posted about these problems. Your reports have been invaluable in tracking this problem down!

Sarel.
____________

hellotheworld

Joined: Feb 27 08
Posts: 3
ID: 244310
Credit: 131,814
RAC: 46
Message 64959 - Posted 14 Jan 2010 9:03:30 UTC - in response to Message ID 64951.

This app update includes a fix for checkpointing.

Please report issues and bugs here!

thanks,

DK


Hi,

I have a strange graphic I wanted to show you... I *think* there *might* be a problem...

Please go to see this sreen shoot :

http://www.flickr.com/photos/37828392@N08/4273
(Capitain Flam is my account on Flickr)


Possible bug for the application BOINC / ROSETTA, because the protein is *completely* folded, in a tiny meat ball ;-)

I hope this is NOT a bug, or even, I hope it will help you to solve it ;)

hellotheworld

Joined: Feb 27 08
Posts: 3
ID: 244310
Credit: 131,814
RAC: 46
Message 64960 - Posted 14 Jan 2010 9:23:40 UTC - in response to Message ID 64959.

This app update includes a fix for checkpointing.

Please report issues and bugs here!

thanks,

DK


Hi,

I have a strange graphic I wanted to show you... I *think* there *might* be a problem...

Please go to see this screen shoot :

http://www.flickr.com/photos/37828392@N08/4273

(Capitain Flam is my account on Flickr)


Possible bug for the application BOINC / ROSETTA, because the protein is *completely* folded, in a tiny meat ball ;-)

I hope this is NOT a bug, or even, I hope it will help you to solve it ;)


Sorry, I didn't cut'n'paste well the link... Here it is !

http://www.flickr.com/photos/37828392@N08/4273113531/

Sorry sorry sorry :-|

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 64967 - Posted 14 Jan 2010 14:35:05 UTC

Bad news guys just woke up today and my homopt_cstmc WU is stuck @ 40% using no CPU time. Although 3-4 other different named WU's have gone through and been totally fine. Just thought id let you know.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 64969 - Posted 14 Jan 2010 16:40:36 UTC

Admin, please double check the application version those are running under. (it is shown in the tasks tab of the advanced view under the application column)
____________
Rosetta Moderator: Mod.Sense

hellotheworld

Joined: Feb 27 08
Posts: 3
ID: 244310
Credit: 131,814
RAC: 46
Message 64971 - Posted 14 Jan 2010 16:58:37 UTC - in response to Message ID 64969.

Admin, please double check the application version those are running under. (it is shown in the tasks tab of the advanced view under the application column)


About
http://www.flickr.com/photos/37828392@N08/4273113531/

I confirm running under :
Rosetta mini 2.03

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 64972 - Posted 14 Jan 2010 17:05:05 UTC

I can 100% confirm i am/was running the new version mini rosetta 2.05 when i got the stuck homopt WU. Heres the WU link: http://boinc.bakerlab.org/rosetta/workunit.php?wuid=282419440. A wingman seems to have also had a compute error, but I can confirm i was running the updated 2.05 client.

Rabinovitch Profile
Avatar

Joined: Apr 28 07
Posts: 28
ID: 170444
Credit: 1,483,610
RAC: 2,997
Message 64974 - Posted 14 Jan 2010 17:10:04 UTC

New app working well. And it seems that now the WU need less RAM (about 100 MB per WU). Is it true? If it is, then may be this is a step to rosetta's GPU client? :-)

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 64975 - Posted 14 Jan 2010 17:14:42 UTC
Last modified: 14 Jan 2010 17:19:18 UTC

Although I didnt grab a screenshot the task details of the work unit show "application version 2.05" You can check it out at http://boinc.bakerlab.org/rosetta/result.php?resultid=310562856. I wish i could give you guys more information, anything else i can do to help you guys solve this issue? All other work so far has gone through fine, but upon further investigation the common factor is windows 7. I have a boinc_filtered loopbuild_threading running now at 33% which gave me problems on 2.03, so i will see how it goes on 2.05 and give an update.

Oxfez

Joined: May 28 07
Posts: 1
ID: 180596
Credit: 161,558
RAC: 0
Message 64977 - Posted 14 Jan 2010 19:43:55 UTC

One of my tasks has "meatballed" too:

lr5_no_pro_close_no_dun_A_rlbd_1rnb_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_16701_583_0

Running new 2.05

According to the time to completion, it's going to be a long old process too.

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 64979 - Posted 14 Jan 2010 20:47:33 UTC - in response to Message ID 64974.

Thanks! If these were the *gbn* runs, then they have a low-memory step which is memory efficient, but then they /might/ go on to a memory intensive step requiring 300-500Mb...

New app working well. And it seems that now the WU need less RAM (about 100 MB per WU). Is it true? If it is, then may be this is a step to rosetta's GPU client? :-)


____________

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 64984 - Posted 15 Jan 2010 0:40:43 UTC - in response to Message ID 64975.

Although I didnt grab a screenshot the task details of the work unit show "application version 2.05" You can check it out at http://boinc.bakerlab.org/rosetta/result.php?resultid=310562856. I wish i could give you guys more information, anything else i can do to help you guys solve this issue? All other work so far has gone through fine, but upon further investigation the common factor is windows 7. I have a boinc_filtered loopbuild_threading running now at 33% which gave me problems on 2.03, so i will see how it goes on 2.05 and give an update.


I wouldn't worry about it. A number of these have failed. I have just sent in two that failed on their second run.

____________

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 64985 - Posted 15 Jan 2010 1:08:25 UTC

While The boinc_filtered WU went through fine, i have another that has stalled: opttest2.2d4f..... just thought id give an update, it froze at 18.046%. Other than that 2.05 seems stable although sometimes the graphics crash when i try to look at them.

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 64986 - Posted 15 Jan 2010 3:42:21 UTC
Last modified: 15 Jan 2010 3:43:42 UTC

Just had to shut down boinc, which i did properly to run a few programs quickly. Seems both Wu's the computer was working on started from model 0 when the client restarted. Both units were between 10-15 models done for being around 20% complete which they are currently (20% complete and now working on model 1). Did the units really just start over from 0 and erase all the previous work? Is this another issue we are tracking? Just trying to be helpful!

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 64987 - Posted 15 Jan 2010 3:55:59 UTC

In another thread, I've seen something about workunits using one of the new features not having working checkpointing while that feature is running. Checkpointing still works for workunits that don't use that feature.

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 64988 - Posted 15 Jan 2010 4:02:26 UTC

I was reading the 2.03 thread and saw something about the checkpoint issue, which i saw with myself just now thats why I thought I would point it out. Your saying everything is fine even though the model says its starting from 1 again correct? Thanks for the help!

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 64993 - Posted 15 Jan 2010 15:12:43 UTC - in response to Message ID 64974.

New app working well. And it seems that now the WU need less RAM (about 100 MB per WU). Is it true? If it is, then may be this is a step to rosetta's GPU client? :-)


I too notice that version 2.05 uses less RAM, and not only on tasks *gbn*. Somewhere 200-250 MB instead of 300-350 in version 2.03.
Is it one of "and other minor updates" about which is written in "Version Release Log"?
If so it seems to me not absolutely "minor" :)

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 64994 - Posted 15 Jan 2010 16:05:29 UTC

I noticed such thing in the new version (though it can feature of the concrete WU - this type of WU in version 2.03 did not come across to me). At model calculation at first steps go very fast, for example 36000 steps have been calculated all for 6 minutes after that calculation has gone very slowly and following 10 steps have occupied more than 10 minutes.
And it is conceived?
Task example: job_boinc_1bm8__broker_random_pairings_from_psipred_16 906_1305_1

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 64995 - Posted 15 Jan 2010 16:45:34 UTC

Please don't presume that the information from the Project Team is an inaccurate description and that your memory observations are a new and permanent condition for all to enjoy going forward. As Sarel points out, they introduced a new type of work unit which has a new low-memory phase to execution. And so you are only going to see the lower memory usage when that specific type of task is being worked on. And this new type of work unit was introduced in prior versions, so the actual delta to v2.05 is small. Since this new type of work is a current area of review, you may see a high concentration of this type of work for a period of time. But it doesn't mean we can presume more then was stated.
____________
Rosetta Moderator: Mod.Sense

Aroundomaha

Joined: Sep 11 08
Posts: 14
ID: 278107
Credit: 41,518,185
RAC: 13,084
Message 64996 - Posted 15 Jan 2010 21:46:29 UTC - in response to Message ID 64951.

For the past two days my Windows 7 machine has been bombing with occasional blue screen of death crashes. I ran the Microsoft debugger and it points to an issue with minirosetta 2.05.


--------- enclosed debug information -----------------
3: kd> !analyze -v
*******************************************************************************
* *
* Bugcheck Analysis *
* *
*******************************************************************************

MULTIPLE_IRP_COMPLETE_REQUESTS (44)
A driver has requested that an IRP be completed (IoCompleteRequest()), but
the packet has already been completed. This is a tough bug to find because
the easiest case, a driver actually attempted to complete its own packet
twice, is generally not what happened. Rather, two separate drivers each
believe that they own the packet, and each attempts to complete it. The
first actually works, and the second fails. Tracking down which drivers
in the system actually did this is difficult, generally because the trails
of the first driver have been covered by the second. However, the driver
stack for the current request can be found by examining the DeviceObject
fields in each of the stack locations.
Arguments:
Arg1: fffffa800afb3320, Address of the IRP
Arg2: 0000000000000eae
Arg3: 0000000000000000
Arg4: 0000000000000000

Debugging Details:
------------------


IRP_ADDRESS: fffffa800afb3320

CUSTOMER_CRASH_COUNT: 1

DEFAULT_BUCKET_ID: VISTA_DRIVER_FAULT

BUGCHECK_STR: 0x44

PROCESS_NAME: minirosetta_2.

CURRENT_IRQL: 2

LAST_CONTROL_TRANSFER: from fffff8000285fb95 to fffff80002875f00


Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65002 - Posted 16 Jan 2010 3:02:32 UTC - in response to Message ID 64953.
Last modified: 16 Jan 2010 3:04:02 UTC

Hi,

I'll be resubmitting the *gbnnotyr* protein design trajectories to boinc over the next few hours. The tests I ran on ralph showed that the checkpointing issue is resolved. To make sure that there are no other issues, I will submit these trajectories 'slowly' starting with a modest sized batch, and according to the responses I get on the thread I will increase the number of work units over the next few days. Please keep me posted about these problems. Your reports have been invaluable in tracking this problem down!

Sarel.


At last I have received enough WUs of this type for check. My output - still there are problems with checkpointing. In difference from version 2.03 the information about "CPU time at last checkpoint" is displayed now correctly that gives the chance to BOINC client to switch between projects, but after restart calculation still starts from the beginning.
Here a task example which I watched: 8gbnnotyr_3gbn_2iug_9Jan2010_16915_7_0
Before restart it has been used 0:33 hour CPU time, 27 models done, after restarting another 1:27 hour and 72 more models are calculated.
But apparently in the report 72 models counted after restarting are mirrored only, 27 models do not suffice, also the task was completed with Validate error.

Here another example: 8gbnnotyr_3gbn_1ijt_9Jan2010_16915_1_0
The same results - in report there are only models counted after restarting and Validate error too.

For matching here the task of this type which was computing without breaks: 8gbnnotyr_3gbn_1woj_9Jan2010_16909_12_0
Without interruption 2 hours of CPU result to 94 models (compare with 72 and 67 in the previous cases in the same 2 hours of CPU time) and Validate state = Valid
The difference just corresponds somewhere to 0.5 hours of CPU time, and so much time passed before restartings

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65003 - Posted 16 Jan 2010 3:22:05 UTC - in response to Message ID 64995.

Please don't presume that the information from the Project Team is an inaccurate description and that your memory observations are a new and permanent condition for all to enjoy going forward. As Sarel points out, they introduced a new type of work unit which has a new low-memory phase to execution. And so you are only going to see the lower memory usage when that specific type of task is being worked on. And this new type of work unit was introduced in prior versions, so the actual delta to v2.05 is small. Since this new type of work is a current area of review, you may see a high concentration of this type of work for a period of time. But it doesn't mean we can presume more then was stated.

Yes, here I was mistaken. Simply with new version 2.05 some time in the beginning i recieve ONLY the new types of WU using few RAM. From what I have come to a (wrong) conclusion.
But now some WUs of old types come, and for them memory usage about same have as in version 2.03.

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65011 - Posted 16 Jan 2010 23:27:01 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=310901552
This one stalled twice at about 5 hrs 35 mins but was running for over 9 hours. I restarted boinc and it then stalled again in the same place.
____________

Mike_Solo

Joined: Nov 16 09
Posts: 2
ID: 358616
Credit: 67,261
RAC: 0
Message 65013 - Posted 17 Jan 2010 11:06:30 UTC

Soooo... this new version hangs too often. 2.0.3 was much more stable.
It hangs on my 2xAthlonMP 2800 as well on the Intel E8400 so the CPU is not the issue.
I think 15% of tasks stuck in the middle consuming >200 Megs of RAM but no CPU.
I'm thinking to leave Rosetta for a while until new version ready as tired of kicking off broken tasks every morning :(

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65015 - Posted 17 Jan 2010 11:47:55 UTC

Looks like Mike Solo has 3 machines:
One WinXP using BOINC version 6.10.18
One WinXP using BOINC version 6.10.18
One WinServer 2003 using BOINC version 6.10.18
____________
Rosetta Moderator: Mod.Sense

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65020 - Posted 17 Jan 2010 18:11:01 UTC
Last modified: 17 Jan 2010 18:12:57 UTC

2 more tasks of type *gbnnotyr* with the same result - by operation without stops all work normally, but if during calculation there was a break - results befo a break disappear, and the task is ended with validate error.
Total i have:
2 WU handled without stops, seems all of them is OK:
http://boinc.bakerlab.org/rosetta/result.php?resultid=310752146
http://boinc.bakerlab.org/rosetta/result.php?resultid=311145245

And 3 WU with a break in processing, all were completed with validate errors:
http://boinc.bakerlab.org/rosetta/result.php?resultid=310935403
http://boinc.bakerlab.org/rosetta/result.php?resultid=310946429
http://boinc.bakerlab.org/rosetta/result.php?resultid=311163725

P.S.
Last from these 3(id 311163725) it has been stopped at the very beginning of operation, still before 1st checkpoint has been written. However after restarting its processing all was completed with validate error.
So it is possible validate errors in this type of WUs are not linked with checkpoints directly and these are 2 different bugs.

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 65021 - Posted 17 Jan 2010 19:06:18 UTC - in response to Message ID 65020.

Thanks! We'll have a look at this as soon as possible and let you know what we find. Best, Sarel.

2 more tasks of type *gbnnotyr* with the same result - by operation without stops all work normally, but if during calculation there was a break - results befo a break disappear, and the task is ended with validate error.
Total i have:
2 WU handled without stops, seems all of them is OK:
http://boinc.bakerlab.org/rosetta/result.php?resultid=310752146
http://boinc.bakerlab.org/rosetta/result.php?resultid=311145245

And 3 WU with a break in processing, all were completed with validate errors:
http://boinc.bakerlab.org/rosetta/result.php?resultid=310935403
http://boinc.bakerlab.org/rosetta/result.php?resultid=310946429
http://boinc.bakerlab.org/rosetta/result.php?resultid=311163725

P.S.
Last from these 3(id 311163725) it has been stopped at the very beginning of operation, still before 1st checkpoint has been written. However after restarting its processing all was completed with validate error.
So it is possible validate errors in this type of WUs are not linked with checkpoints directly and these are 2 different bugs.


____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 65022 - Posted 17 Jan 2010 19:40:25 UTC

In the last week I've had to abort 11 tasks on W7 because the tasks are hung consuming 0% CPU time. I was hoping that the combination of upgrading to the latest BOINC and the new 2.05 version of R@h would fix the problem but no: it continues as before. Tasks on Mac OS X seem to be unaffected by this problem. Until there's some indication this problem is fixed I'm not getting any more tasks for W7.

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 2,473,178
RAC: 1,976
Message 65023 - Posted 17 Jan 2010 21:10:47 UTC

Task: 311103842
Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00003_0000018_07.pdb_00003.pdb.JOB_16835_29

ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

AdeB
____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65024 - Posted 17 Jan 2010 21:17:05 UTC
Last modified: 17 Jan 2010 22:15:46 UTC

Here's another Validate error, it didn't seem to have any problems running.

Edit/ This was on 64bit linux.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=283574991

8gbnnotyr_3gbn_1s68_9Jan2010_16915_22_0

# cpu_run_time_pref: 14400
======================================================
DONE :: 37 starting structures 14469.9 cpu seconds
This process generated 37 decoys from 37 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Validate error__Done__14,470.06
=========================================================================
Edit/ added this.

This one was on linux 32bit, again didn't seem to have a problem.

Very low credits.

8gbnnotyr_3gbn_1opd_9Jan2010_16915_42_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=283817716

# cpu_run_time_pref: 14400
======================================================
DONE :: 8 starting structures 12134.6 cpu seconds
This process generated 8 decoys from 8 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Success__Done__12,135.35__28.60__4.61
____________


Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65025 - Posted 17 Jan 2010 23:06:39 UTC

Validate Error on Win7, successfully completed by a wingman on win xp
http://boinc.bakerlab.org/rosetta/result.php?resultid=311128874
name: 8gbnnotyr_3gbn_1iuk_9Jan2010_16915_131_0

Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 65026 - Posted 18 Jan 2010 1:15:29 UTC
Last modified: 18 Jan 2010 1:17:58 UTC

About time I updated my recent fault lists. I've had several errors under 2.03, but only this under 2.05:

On Intel T5500 laptop running W7 and Boinc 6.10.18

Outcome Validate error
8gbnnotyr_3gbn_2onu_9Jan2010_16909_17_0

# cpu_run_time_pref: 28800
======================================================
DONE :: 345 starting structures 28787.1 cpu seconds
This process generated 345 decoys from 345 attempts
======================================================


Note: On several occasions the following line appears:

No heartbeat from core client for 30 sec - exiting


Edit: Wingman running XP also received a validate error on apparently successful completion.
____________

MeGaBeSuNTa

Joined: Oct 15 07
Posts: 1
ID: 212621
Credit: 778,249
RAC: 2
Message 65029 - Posted 18 Jan 2010 12:24:34 UTC

Hi guys, let me just tell you.
If youre using Windows7 the beta version 6.10.24 or even the new beta 6.10.29 is much more stable.
Ive used a lot of time the beta 6.10.24 and i had no problem at all with rosetta.
For me its much more stable than 6.10.18 in windows7 of course. Anyway its just my case.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65031 - Posted 18 Jan 2010 13:45:20 UTC - in response to Message ID 65023.

Task: 311103842
Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00003_0000018_07.pdb_00003.pdb.JOB_16835_29

ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

AdeB


I too had a same error in this type of WU: http://boinc.bakerlab.org/rosetta/result.php?resultid=310238605
And on 2nd computer processing this WU - too: http://boinc.bakerlab.org/rosetta/result.php?resultid=310471681
The truth it was still version 2.03, therefore I did not write about it, but above an example of the same error and to versions 2.05.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65032 - Posted 18 Jan 2010 14:51:56 UTC - in response to Message ID 65024.

Here's another Validate error, it didn't seem to have any problems running.

Edit/ This was on 64bit linux.
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=283574991
8gbnnotyr_3gbn_1s68_9Jan2010_16915_22_0

Seems only one problem with that WU - it has restart too (may be swith to another project?) and bug related with it.


This one was on linux 32bit, again didn't seem to have a problem.

Very low credits.

8gbnnotyr_3gbn_1opd_9Jan2010_16915_42_0
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=283817716

I too have such example: http://boinc.bakerlab.org/rosetta/result.php?resultid=311202691
Claimed credit=54.35 vs Granted credit = 1.83 (about 30 times lower)
And I even can tell what exactly with it have occurred:
Usually in this type of WUs model settle up very fast, nearby 1 or several minutes on 1 model. This task started as - approximately for 15 minutes 13 models have been calculated (on ~500 steps in each) , but about 14th something has occurred, calculation has not stopped on 500th step, and proceeded much longer, I saw as the counter have passed for 40000 steps, and did not look any more further(i think all was about 60000-70000 steps total).
I was already think to abort this task since thought that calculation has gone in cycles, but in 5 hours (instead of several minutes) calculation of 14th model all the same was completed. I.e. 13 models were considered about 15 minutes, and 14th about 5 hours.
From here from such small stake-in Granted credit - since they are calculated proportionally to quantity of models. (If not this 14th model, for 5 hours it would be calculated about 300 models instead of 14 and Granted credit would be close to Claimed credit).
I think too most was and in your taks...

P.S.
Quite probably that it NOT an error, but a feature of algorithm - if it finds something interesting more detail calculation of this model probably starts. It is desirable for specifying for scientists responsible for this type of WUs.

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 65034 - Posted 18 Jan 2010 18:39:22 UTC

Hello,

based on the reports of validator issues, David Kim has now fixed the validator. He also asked me to remind people that credit is granted based on the client's claimed credit, regardless of validator results.

Let us know if you see more such problems.

Thanks, Sarel.
____________

Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 65035 - Posted 18 Jan 2010 19:27:57 UTC

Thanks for the information Sarel - and David for the fix.

No further errors today, but a cursory check has revealed I haven't re-booted my desktop since Dec 15th! I'm sure I've had various updates since then, but that's a ridiculous amount of uptime for me... Back in 5... ;)
____________

Link
Avatar

Joined: May 4 07
Posts: 260
ID: 173059
Credit: 338,704
RAC: 3
Message 65036 - Posted 18 Jan 2010 19:57:23 UTC - in response to Message ID 65034.

credit is granted based on the client's claimed credit, regardless of validator results.

Does that not apply only to results with compute errors or validate errors?
____________
.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65037 - Posted 18 Jan 2010 23:49:52 UTC - in response to Message ID 64959.
Last modified: 18 Jan 2010 23:56:15 UTC

hellotheworld wrote:

Hi,
I have a strange graphic I wanted to show you... I *think* there *might* be a problem...
Please go to see this sreen shoot :
http://www.flickr.com/photos/37828392@N08/4273113531/
(Capitain Flam is my account on Flickr)

Possible bug for the application BOINC / ROSETTA, because the protein is *completely* folded, in a tiny meat ball ;-)
I hope this is NOT a bug, or even, I hope it will help you to solve it ;)

Oxfez wrote:
One of my tasks has "meatballed" too:

lr5_no_pro_close_no_dun_A_rlbd_1rnb_SAVE_ALL_OUT_IGNORE_THE_REST_DECOY_16701_583_0

Running new 2.05
According to the time to completion, it's going to be a long old process too.


I have another "meatball" too.
Task: http://boinc.bakerlab.org/rosetta/result.php?resultid=311361747
Some screenshots:
http://s001.radikal.ru/i193/1001/1f/cffd2181b53b.jpg
http://i073.radikal.ru/1001/d9/c87d3083bfb9.jpg
http://s41.radikal.ru/i094/1001/8e/a86dfd3a7d6a.jpg
Plus about last 2 hours of computation(or ~20 steps) there were no changes in Energy or RMSD at all. (I did not do more screenshots since further varied nothing except CPU Time and Steps count)

I do not think that it is an error in the software, but probably weak place in the scientific algorithm itself, so it is necessary to address it not to programmers, but scientists.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 65038 - Posted 19 Jan 2010 0:17:29 UTC
Last modified: 19 Jan 2010 0:20:37 UTC

{wrong version area for my posts}

namtraf

Joined: Jun 6 06
Posts: 1
ID: 91525
Credit: 535,242
RAC: 0
Message 65040 - Posted 19 Jan 2010 5:49:43 UTC

minirosetta 2.05 hangs on my computer frequently. It's a windows vista machine. The cpu meter shows no activity, the time to completion is incrementing instead of decrementing and the screen saver for r@h is blank. I've shut my machine off then on 3 times and r@h runs normally after that. The cpu meter shows activity, the time to completion is decrementing and the time as decreased from more than 10 hours to around 2 hours and the screen saver works. This started happening the second week of January. My machine was off from December 18 to January 8. After a few hours of running, r@h hangs again.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65043 - Posted 19 Jan 2010 14:40:44 UTC - in response to Message ID 65034.

Hello,

based on the reports of validator issues, David Kim has now fixed the validator. He also asked me to remind people that credit is granted based on the client's claimed credit, regardless of validator results.

Let us know if you see more such problems.

Thanks, Sarel.


If I have my facts straight, Sarel means to say that credit is issued as normal. This means based on the average credit claims PER MODEL of the tasks reported before yours. This is a bit odd for Sarel's tasks because, as he's been explaining, there is a new technique where a quick cursory review of a given model is performed, and then some small percentage of those are deemed worth a more detailed review. And so model runtimes can vary from around 60 seconds, to several hours. So you will see credit all over the map. But it seems that on average most tasks spend the majority of their time crunching on one low level model, and so over time credit is still comparable with other types of Rosetta work.

If you somehow run through 60 models, and none require low level analysis, and you only allow a 1hr runtime preference, then you would probably see considerably more credit granted then your claim. As I say, this would be rather rare. If you run for a 24hr runtime preference, then you'll probably see several low level models. But then that is over a longer period of crunching too. But once you've run through several such tasks the credit will average out, as it always does.
____________
Rosetta Moderator: Mod.Sense

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 65044 - Posted 19 Jan 2010 19:33:12 UTC

Thanks RosettaMod for the clarification!

On another note, I've isolated why on restart the *gnb* runs report starting over from model 1. The fix for this will be part of the next update of the minirosetta application. Despite the confusion, the models that we get are unharmed and credit is allocated correctly.

Many thanks to the users who reported this for another bug catch!
____________

macko Profile
Avatar

Joined: Jun 25 09
Posts: 32
ID: 323638
Credit: 152,285
RAC: 0
Message 65047 - Posted 20 Jan 2010 13:28:01 UTC


Hi

This WU's, "8gbnnotyr" and older "dock" types won't be listed on results pages?

With regards
____________

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 65048 - Posted 20 Jan 2010 16:57:44 UTC - in response to Message ID 65047.

Could you elaborate what it is that you're seeing? These types of job are treated as others in these respects.


Hi

This WU's, "8gbnnotyr" and older "dock" types won't be listed on results pages?

With regards


____________

macko Profile
Avatar

Joined: Jun 25 09
Posts: 32
ID: 323638
Credit: 152,285
RAC: 0
Message 65049 - Posted 20 Jan 2010 21:07:17 UTC - in response to Message ID 65048.

Could you elaborate what it is that you're seeing? These types of job are treated as others in these respects.


Hi

This WU's, "8gbnnotyr" and older "dock" types won't be listed on results pages?

With regards


Hi

There were some WUs not showed on results page, here is small (uncomplete)collection from last months:
aTt13
histone
1 famA
foldit WUs
denovo_design_rossmann2x3_flxbb (a really RAM eating ang long running ones)
NeR103A
CGR26A
and finally this two from 2010: CtR69A_2KRU_BOINC_ABRELAX, 3gbn bla-bla&gz_dock

And now the 8gbnnotyr WUs seems to have similar fate, crunching only for credit.

____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65050 - Posted 21 Jan 2010 6:47:12 UTC

This only ran for 19 min, no idea what happened.

boinc.loopbuild_threading_hb_2kruA_IGNORE_THE_REST_17084_3403_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=284668104

# cpu_run_time_pref: 14400
======================================================
DONE :: 5 starting structures 1201 cpu seconds
This process generated 5 decoys from 5 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Over__Validate error__Done__1,152.13

____________


AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 65053 - Posted 21 Jan 2010 15:24:13 UTC

Here's a couple of relaxopt_grow WUS that exited after a few seconds with the error:

ERROR: LoopRebuild::ERROR Loop definition out of boundary
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 595
BOINC:: Error reading and gzipping output datafile: default.out

http://boinc.bakerlab.org/rosetta/result.php?resultid=312109858
http://boinc.bakerlab.org/rosetta/result.php?resultid=312068049

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65058 - Posted 22 Jan 2010 6:49:19 UTC

This ran for 1hr, 30min's then fell over.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=284866813

tyrsim_3gbn_2c2p_20Jan2010_17119_14_0

<message>
process exited with code 193 (0xc1, -63)
</message>

# cpu_run_time_pref: 14400
SIGSEGV: segmentation violation
Stack trace (64 frames):
[0x96c49b3]
[0x96ee888]
[0xb7fad420]

JUST A FEW OF THEM.

____________


TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 65068 - Posted 22 Jan 2010 18:51:02 UTC - in response to Message ID 65058.

Error
Error
____________
WWW of Polish National Team - Join! Crunch! Win!

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65069 - Posted 22 Jan 2010 19:03:58 UTC

Compute Error

relaxopt_grow.1bk2.1bk2.IGNORE_THE_REST.S_00066_0000013_0_0000_noncon_00066.pdb.JOB_16957_8

ERROR: LoopRebuild::ERROR Loop definition out of boundary

ERROR:: Exit from: ..\..\src\protocols\loops\Loops.cc line: 595
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

macko Profile
Avatar

Joined: Jun 25 09
Posts: 32
ID: 323638
Credit: 152,285
RAC: 0
Message 65070 - Posted 22 Jan 2010 19:13:26 UTC - in response to Message ID 65053.

Here's a couple of relaxopt_grow WUS that exited after a few seconds with the error:

ERROR: LoopRebuild::ERROR Loop definition out of boundary
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 595
BOINC:: Error reading and gzipping output datafile: default.out

http://boinc.bakerlab.org/rosetta/result.php?resultid=312109858
http://boinc.bakerlab.org/rosetta/result.php?resultid=312068049


Same error, same wus relaxopt_grow.1ctf.1ctf.
____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 65072 - Posted 22 Jan 2010 19:48:39 UTC

Another "Loop definition out of boundary" error as reported by others. On Mac OS X 10.6.

Task : 312278520
Name : relaxopt_grow.1c9o.1c9o.IGNORE_THE_REST.S_00082_0000671_0.pdb.JOB_16963_8_0

ERROR: LoopRebuild::ERROR Loop definition out of boundary

ERROR:: Exit from: src/protocols/loops/Loops.cc line: 595
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65077 - Posted 23 Jan 2010 14:49:56 UTC - in response to Message ID 65034.

Hello,

based on the reports of validator issues, David Kim has now fixed the validator. He also asked me to remind people that credit is granted based on the client's claimed credit, regardless of validator results.

Let us know if you see more such problems.

Thanks, Sarel.


The last 20 tasks on my computer were completed without any validation errors.
(Among them were including *gbnnotyr* and tasks restarted in execution time)
So seems this problem is solved.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65078 - Posted 23 Jan 2010 15:00:24 UTC - in response to Message ID 65044.

Thanks RosettaMod for the clarification!

On another note, I've isolated why on restart the *gnb* runs report starting over from model 1. The fix for this will be part of the next update of the minirosetta application. Despite the confusion, the models that we get are unharmed and credit is allocated correctly.

Many thanks to the users who reported this for another bug catch!


In addition to Wus type *gnb* bug with only 1 model after a restart occurs in many other types of tasks. But there it does not seem to affect the results sent to the server, but only on the mapping process in the graphic part. So it is not a significant error. It makes sense to report such?

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65081 - Posted 23 Jan 2010 22:45:11 UTC

Here's another one of these, zero run time.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=284355878

homopt_nat2.t331_.t331_.IGNORE_THE_REST.S_00004_0000011_06.pdb_00004.pdb.JOB_16832_30_1

<message>
process exited with code 1 (0x1, -255)
</message>

ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

____________


Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65091 - Posted 24 Jan 2010 21:45:03 UTC

This one failed after about 40 seconds
cst2.loopbuild_threading_hb_i1705_IGNORE_THE_REST_17160_389_0

- exit code -1073741819 (0xc0000005)

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0054FC53 read attempt to address 0xFFFFFFC0


____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65093 - Posted 25 Jan 2010 1:59:23 UTC - in response to Message ID 65044.
Last modified: 25 Jan 2010 1:59:57 UTC

Thanks RosettaMod for the clarification!

On another note, I've isolated why on restart the *gnb* runs report starting over from model 1. The fix for this will be part of the next update of the minirosetta application. Despite the confusion, the models that we get are unharmed and credit is allocated correctly.

Many thanks to the users who reported this for another bug catch!

=================================================================================

I'm assuming that these two tasks have been affected by this bug, both had a few

hundred models showing in the graphics before the rig was rebooted, then there

gone. The credits are O.K.

tyrsim_3gbn_2znr_20Jan2010_17119_66_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=285166061

# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 13670.3 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

--------------------------------------------------------------
tyrsim_3gbn_1s2x_20Jan2010_17119_291_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=285312382

# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 9627.95 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================
____________


Rabinovitch Profile
Avatar

Joined: Apr 28 07
Posts: 28
ID: 170444
Credit: 1,483,610
RAC: 2,997
Message 65095 - Posted 25 Jan 2010 3:44:38 UTC - in response to Message ID 64974.

New app working well. And it seems that now the WU need less RAM (about 100 MB per WU). Is it true? If it is, then may be this is a step to rosetta's GPU client? :-)


Well, now I see two WUs are being processed, and one is consupting about 510 MB of RAM, and another - 480. I like such a heavy WUs, give me more please! :-)

ofry

Joined: Jan 21 10
Posts: 4
ID: 367430
Credit: 164,475
RAC: 0
Message 65103 - Posted 25 Jan 2010 16:17:19 UTC

Hello :)

I have a "antimeatball" with energy eq. 2K, 16K etc.
Sample screenshot:


This bug might have in tasks such "boinc_filtered_loopbuild_threading_"

And this tasks usually does eq. 5 hours, but in my preferences "Target CPU run time" not selected (default 3 hours).

P.S. Sorry for my English, I speak Russian.

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65104 - Posted 25 Jan 2010 16:29:02 UTC

Just thought Id add to the post above mine. I can also confirm energy levels for t311 (same WU) have been sky high some values like 76053423 and RMSD running around 700. Thought it was strange so Id let you guys know.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65106 - Posted 25 Jan 2010 17:08:09 UTC

To help clarify, ofry has an anti-meatball. I don't see any problem in their screenshot though.

Admin, ofry's screenshot is of protein t374, so if you are doing t311, then it is a different protein... although perhaps using the same methods to study it.

Admin, how long would you see such high numbers? I'd think they'd settle down pretty quickly.

I don't believe these are Sarel's new ones, so you can see why he's working on the approach that makes that initial 60-100 second survey of a given model and then moves on to something more promising much of the time.

These proteins are very large, so when they are out of position and perhaps are nowhere near the natural conformation, the numbers can get pretty high... but 76m!?
____________
Rosetta Moderator: Mod.Sense

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65108 - Posted 25 Jan 2010 17:47:16 UTC - in response to Message ID 65103.

Hello :)

I have a "antimeatball" with energy eq. 2K, 16K etc.
Sample screenshot:
This bug might have in tasks such "boinc_filtered_loopbuild_threading_"

And this tasks usually does eq. 5 hours, but in my preferences "Target CPU run time" not selected (default 3 hours).
P.S. Sorry for my English, I speak Russian.


Таких я еще не видел (а вот скомкивание протеина в мячик - довольно часто).
По русски я бы это назвал "взрыв на макаронной фабрике" :)
А вообще не факт что это проблема, может просто одна из ранних стадий моделирования - т.к. изначально моделирования вообще начинается с протеина вытянутого в одну длинную "веревку". Причем в отличии от folding@home промежуточные этапы моделирования идут не точно (в соотвествии с тем, как это происходит в природе), а приблизительно и весьма хаотично. Так что промежуточные формы могут быть самыми причудливыми и далекими от оригинала.
Это объясняется разными целями проектов - в фолдинге ученые хотят знать КАК протеин из цепочки сворачивается в свою естественную форму/структуру. А в Розетте - определять только конечную простанственную структуру протеина(или взаимодействия 2-х протеинов), по его известной "аминокислотной формуле", но зато делать это на порядки(в десятки и сотни раз) быстрее чем фолдинг, с его моделированием "в лоб" (на уровне отдельных атомов с шагом порядка 1 пикосекунды).

А вот "мячик" (meatball) это проблема - т.к. там похоже какая-то ошибка, моделирование проскакивает естественную форму и начинает просто скомкивать белок в шар, все дальше уходя от оригинала (а не приближаясь к нему).

ofry

Joined: Jan 21 10
Posts: 4
ID: 367430
Credit: 164,475
RAC: 0
Message 65109 - Posted 25 Jan 2010 17:58:21 UTC

Mad_Max Thanks for translate.

Теперь на русском :)

Та же проблема есть и на t303. (но только на типах задач boinc_filtered_loopbuild_threading_. На других этого нету) Я просто не давал аналогичные скрины.

"translate"

[quote]
Admin, ofry's screenshot is of protein t374, so if you are doing t311, then it is a different protein... although perhaps using the same methods to study it.
[quote]

This problem is in many proteins, eq. t303 too. But in other methods (not boinc_filtered_loopbuild_threading_) I don't see this problem. Maybe, this method too bugged.

ofry

Joined: Jan 21 10
Posts: 4
ID: 367430
Credit: 164,475
RAC: 0
Message 65110 - Posted 25 Jan 2010 18:09:50 UTC

http://boinc.bakerlab.org/rosetta/rah_results.php?BatchID=16901&SubBatchName=t364__boinc_filtered_loopbuild_threading_cst_relax_tex_&UserID=367430

# t364__boinc_filtered_loopbuild_threading_cst_relax_tex_
5.126 3.418 5627.5 -384.673 17 14390 2010-01-25

My "best score" = 5627.5!

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65111 - Posted 25 Jan 2010 18:14:17 UTC

Mod,

I checked it last night, and it went though fine this morning but the values were defiantly very high either 7.6mill or 760k, the protein wasn't even in the window so I know it was a high value. Doesn't seem to be occurring with any other such WU right now, but ill keep an eye out.

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 65113 - Posted 26 Jan 2010 0:36:08 UTC - in response to Message ID 65111.

If it's the high energy of 4K you're worried about - that's not unusual when runs are submitted with constraints - looks all good to me ..

Mike

____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 65118 - Posted 26 Jan 2010 16:40:16 UTC

Each of these cl1 WUs gave an error for both crunchers:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=285818853
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=285786792

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 65121 - Posted 26 Jan 2010 17:44:37 UTC

A couple of tasks that failed early on Mac OS X 10.6 with the same error

SIGPIPE: write on a pipe with no reader
0 0x006e2839 SIGPIPE: write on a pipe with no reader


cl1.1cc8.1cc8.IGNORE_THE_REST.c.0.20.pdb.pdb.JOB_17236_2_0
cl1.1s12.1s12.IGNORE_THE_REST.c.3.0.pdb.pdb.JOB_17313_1_0

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65122 - Posted 26 Jan 2010 23:17:21 UTC

Add me to the list with these two that failed within 17 seconds

cl1.1enh.1enh.IGNORE_THE_REST.c.2.32.pdb.pdb.JOB_17243_1
cl1.1enh.1enh.IGNORE_THE_REST.c.2.21.pdb.pdb.JOB_17243_1_0
____________

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65124 - Posted 27 Jan 2010 0:05:13 UTC

Compute error occurred - Exit status -1073741819 (0xc0000005). Debug info is far too advanced for me to get any info from, so a team member will need to look at it. Occurred with cl1.2cmx.2cmx.IGNORE_THE_REST.c.0.25.pdb.pdb.JOB_17322_1_1.

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000

Link: http://boinc.bakerlab.org/rosetta/result.php?resultid=313485467.

Wingman also received compute error with same WU.

ofry

Joined: Jan 21 10
Posts: 4
ID: 367430
Credit: 164,475
RAC: 0
Message 65131 - Posted 27 Jan 2010 20:03:11 UTC
Last modified: 27 Jan 2010 20:05:28 UTC

Compute errors in this WU's:

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=285897731
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=286008240

("Access Violation" type)

(errors in 15-42 sec. CPU time)

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65132 - Posted 27 Jan 2010 20:58:43 UTC

Unhandled Exception Error - cl1.1ail.1ail.IGNORE_THE_REST.c.4.22.pdb.pdb.JOB_17227_9_1

http://boinc.bakerlab.org/rosetta/result.php?resultid=313773510

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000

Debug Info in link.. failed at 21 seconds

Mike_Solo

Joined: Nov 16 09
Posts: 2
ID: 358616
Credit: 67,261
RAC: 0
Message 65136 - Posted 28 Jan 2010 10:01:56 UTC - in response to Message ID 65015.

Looks like Mike Solo has 3 machines:
One WinXP using BOINC version 6.10.18
One WinXP using BOINC version 6.10.18
One WinServer 2003 using BOINC version 6.10.18

Yes, sorry, missed the OS info.
I introduced some Linux machines (Debian) instead of MS Win.
All looks stabe under Linux.

l_mckeon

Joined: Jun 5 07
Posts: 44
ID: 182403
Credit: 180,717
RAC: 0
Message 65147 - Posted 30 Jan 2010 0:01:05 UTC

"This app update includes a fix for checkpointing.

Please report issues and bugs here!"

I had a task yesterday that restarted from model one when the computer was switched off then on again.

The task had been saving checkpoints.

The upload was >200kB and the task ended a few minutes after restarting. I don't know the task number but I'm fairly sure it was an lr5 task.

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65148 - Posted 30 Jan 2010 3:03:01 UTC

Unhandled Exception Error - cl1.1ail.1ail.IGNORE_THE_REST.c.6.3.pdb.pdb.JOB_17227_4_0
http://boinc.bakerlab.org/rosetta/result.php?resultid=313910991

WU froze at 61% complete. I found it and had to abort it after 10 hours of run time.

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x75E31AF3

Debug info in link as usual

SFCC

Joined: Sep 3 09
Posts: 10
ID: 342915
Credit: 227,659
RAC: 0
Message 65149 - Posted 30 Jan 2010 4:27:05 UTC

I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid).

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,834,811
RAC: 4,046
Message 65152 - Posted 30 Jan 2010 11:35:59 UTC - in response to Message ID 65149.

I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid).


Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.

____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 65155 - Posted 30 Jan 2010 21:38:29 UTC - in response to Message ID 65152.


Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.
[/quote]

Even if this works, it shouldn't be necessary to babysit BOINC/Rosetta in this way. This hanging certainly seems to be a widespread issue but one that only affects Windows in its various incarnations. The fact that it's irreproducible means a fix may be some time in coming but I hope the project team find it soon.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65157 - Posted 31 Jan 2010 1:14:32 UTC
Last modified: 31 Jan 2010 1:17:39 UTC

Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal?

And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): http://boinc.bakerlab.org/rosetta/result.php?resultid=314040179
While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time).
This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type *gbnnotyr*, where the combined "small" and "huge" models in the same type of tasks)

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,834,811
RAC: 4,046
Message 65158 - Posted 31 Jan 2010 9:31:05 UTC - in response to Message ID 65155.



Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.


Even if this works, it shouldn't be necessary to babysit BOINC/Rosetta in this way. This hanging certainly seems to be a widespread issue but one that only affects Windows in its various incarnations. The fact that it's irreproducible means a fix may be some time in coming but I hope the project team find it soon.


The fact someone noticing that this is occurring, suggests babysitting to begin with.

____________

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 65159 - Posted 31 Jan 2010 20:32:50 UTC - in response to Message ID 65157.

Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details.

Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal?

And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): http://boinc.bakerlab.org/rosetta/result.php?resultid=314040179
While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time).
This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type *gbnnotyr*, where the combined "small" and "huge" models in the same type of tasks)


____________

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 65160 - Posted 31 Jan 2010 21:43:31 UTC - in response to Message ID 65159.

Rosetta @ Home has produced many very high-quality designs for our Protein-interface design team! So we're likely to submit many more jobs to Rosetta @ Home. To help you recognize these jobs, we'll add a _Protein_Interface_Design_ note to every job name that is related to these jobs from now on. This way you'll be able to follow these jobs. I also hope that this will help you see where the variable-credit issue is coming from more easily.

Hello, yes! The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details.

Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a lot of steps up to 200000 - 400000 for 1 model. Is this normal?

And on another type of job (critStubs_profiled_1dnA_...) one of them was abnormally small credits (1.97 cr and 9 decoys for 7k CPU seconds): http://boinc.bakerlab.org/rosetta/result.php?resultid=314040179
While other WUs of the same type considered normal (about ~20 cr and ~80 decoys for the same CPU time).
This is a bug with partial loss of the results of calculations? Or in this type of tasks just is uneven crediting? (As in the tasks type *gbnnotyr*, where the combined "small" and "huge" models in the same type of tasks)



____________

fredmeyer2470

Joined: Jun 6 09
Posts: 1
ID: 319967
Credit: 1,741,466
RAC: 0
Message 65161 - Posted 31 Jan 2010 23:20:50 UTC

The Rosetta application is spinning its wheels. It is continually running a task even though the task is 100% complete. There is another task to run, but Rosetta won't switch to it.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65162 - Posted 1 Feb 2010 3:14:03 UTC

2 Sarel
Thanks for the explanation.

And what about this?:
> Some tasks (for exsample type of abinitio_relax_homfrag_natfrag ....) makes a very lot of steps up to 200000 - 400000 for 1 model. Is this normal?

And at the same time, another note: it seems the job of this type: resa_sel_core_1.5_low200_beta_low200_nostart_texcst_05_hb_t328__IGNORE_THE_REST_17378_267_0 ignore the target CPU time. For example, this WU calculate 1 model somewhere for 2.5 hours (already longer than the target time ), but after the 1-st model, instead of sending the result starts calculating 2-nd model. Total 18850 seconds vs cpu_run_time_pref = 7200 seconds.
In this example, all ended well, but in other circumstances it can lead to excess cpu_run_time_pref more than 3 times and triggering watchdog and results loss. In addition, some members may think that the task stuck and abort it...

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65165 - Posted 1 Feb 2010 16:27:56 UTC

Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference.

However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.

Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.
____________
Rosetta Moderator: Mod.Sense

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 65169 - Posted 1 Feb 2010 21:39:59 UTC

A couple of t287__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901 WUs on two different Linux machines failed after a few seconds claiming "process got signal 11".

http://boinc.bakerlab.org/rosetta/result.php?resultid=314826769
http://boinc.bakerlab.org/rosetta/result.php?resultid=314751622

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65170 - Posted 2 Feb 2010 1:09:41 UTC

2 Mod.Sense
Thanks for the clarification on the watchdog. Previously I had seen how it hit after exceeding 6 hours of calculations and thought that he was fired after exceeding CPU TT x 3 (2h * 3 = 6h for my case). So in fact correct formula is CPU TT + 4h, right? (just in my case it gives the same 2h +4 h = 6h)

fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.

Yes, usually does so. Here's an example of such a task: http://boinc.bakerlab.org/rosetta/result.php?resultid=313861637
Calculation of 1-st model took 5145 sec and the program has ended the processing, because second model would exceed the CPU TT (5145 * 2 = 10290> 7200).
Or another example: http://boinc.bakerlab.org/rosetta/result.php?resultid=314455813
Calculation of the two models has taken 4995 sec and the program has ended the processing, because third model would exceed the CPU TT ((4995 / 2) * 3 = 7492> 7200).
In these (and most others) the logic of the program is working correct.
But in the example above, this algorithm seems to give a failure.

Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.

No, the last 2 weeks I have not changed runtime preference.
Yet I have no more recent examples, but before I had 2 other tasks that too, seems to ignore the runtime preference. (although I'm not 100% sure about it, because I have not followed their performance - perhaps just a 1st model was designed quickly, and the last took much longer than expected...)
Here they are:
cst2.loopbuild_threading_hb_i1496_IGNORE_THE_REST_17154_387_0
t364__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_4455_0

KnopperHarley

Joined: Nov 1 06
Posts: 2
ID: 126620
Credit: 788,560
RAC: 0
Message 65175 - Posted 2 Feb 2010 11:18:57 UTC

Hey there!

I got a problem with two tasks at the moment.
Yesterday i wondered why remaining time is set to 30,5h per WU when i saw it, but i didn't care about it ... perhaps a test with more work per WU ... who knows. ;-)

But now one task is 'stuck' at 58.285% (+0.002% in now more than 12h) and the other one at 82.419% work done.
Runtime for these WUs are at around 28h und 11,75h counting on and on up high (elapsed and remaining -_- ).

So i asked the task-manager for help and is says the following:
these two WUs are using 218mb and 300mb memory ... not using ANY cpu-resources any more ... 0% both (cpu-time is still counting on 1sec/sec).

Did something went wrong on my pc while crunching? Or what's the matter of this?

Tasks
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=286264240
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=287080918


greetings

PS: both paused for now

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65177 - Posted 2 Feb 2010 15:21:50 UTC
Last modified: 2 Feb 2010 15:24:05 UTC

Max:

perhaps just a 1st model was designed quickly, and the last took much longer than expected


Right and that is exactly what Sarel's new tasks do. Run 5 models in 5 minutes, then hit one that looks interesting and run for (for example) 80 minutes. Now 6 models have been completed in 85 minutes and with a 2hr runtime preference, we guess we can complete more models in the 2 hours. If that next one happens to be interesting as well, you run long.

Some of the improvements Sarel is making and working on will help the longer models run faster. So this should avoid some of those that were taking several hours for a single model, and make completion times closer to your preference.

Yes, Max. The watchdog USED to be based on 4 times the runtime preference. This was fine for short runtime preferences, but those with preference set to over 12 hours wanted to kill the task sooner and get on with others. Now it is runtime pref. plus 4 hrs, with the thought that all properly running models will complete in less then 4 hours.

The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue.

KnopperHarley
This is one of the few remaining problems that some people are seeing in version 2.05. It seems to be rather rare, and perhaps only to occur on Windows. I see you are running Win XP (I highlight that just to make it easy for the Project Team to see it, not because it should be a problem). I believe suspending and resuming the tasks seems to get them going again.

Could I ask you how your machine is configured? Specifically, do you leave tasks in memory while preempted? Do you run other BOINC projects? Do you allow BOINC to run 100% of CPU? Do you power your machine off each day?
____________
Rosetta Moderator: Mod.Sense

KnopperHarley

Joined: Nov 1 06
Posts: 2
ID: 126620
Credit: 788,560
RAC: 0
Message 65178 - Posted 2 Feb 2010 15:47:13 UTC

Uhm, well ...

I tried around a bit (restarted BOINC) and (you might guess): it works. ^^'

Cpu-time jumped back to 3h and 6h or something and it's using the cores again.
Seems like something really screwed up the Rosetta-apps while working.

So nevermind ... ignore my posting above. ;-)

I lost a bit of time, but the WUs are obviously (hopefully?!) undamaged and one has been completed in the meantime, so happy crunching again. \o/


greetings

PS: Would it make sense to send the WUs a second time to another participant to confirm the results ... just to be sure?!
Especially the second WU mentioned in my post above (probably more than 7,5h in the end) plus another WU with almost 6,75h
(t293__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_4919)
that has been finished last night are, let's say ... (maybe not impossible but) 'unusual' (to me :-) ).

PPS: for the protocol *g*
- Leave applications in memory while suspended? no
- Rosetta + SETI (50:50)
- Use at most 100 percent of CPU time
- it's almost every day off for a period of time (except weekend once in a while)

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 65182 - Posted 2 Feb 2010 23:12:43 UTC

compute error
t323__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_2006_0
http://boinc.bakerlab.org/rosetta/result.php?resultid=314347348

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
]]>

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 65183 - Posted 2 Feb 2010 23:14:45 UTC

compute error with unhandeled exception dump
http://boinc.bakerlab.org/rosetta/result.php?resultid=310017128
homopt_fa_cstmc_1.t370_.t370_.IGNORE_THE_REST.S_00003_0000784_010.pdb.JOB_16898_23_0

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E

l_mckeon

Joined: Jun 5 07
Posts: 44
ID: 182403
Credit: 180,717
RAC: 0
Message 65184 - Posted 3 Feb 2010 0:40:23 UTC

I aborted lr15clus_opt_.1ptq.1ptq.IGNORE_THE_REST.c.3.21.pdb.pdb.JOB_17471_3_0 after 50 minutes.

Stuck on model 1, step 0, with funny looking graphics.

I no longer have the patience to see how these turn out.

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65187 - Posted 3 Feb 2010 11:54:35 UTC - in response to Message ID 65184.

I aborted lr15clus_opt_.1ptq.1ptq.IGNORE_THE_REST.c.3.21.pdb.pdb.JOB_17471_3_0 after 50 minutes.

Stuck on model 1, step 0, with funny looking graphics.

I no longer have the patience to see how these turn out.

Instead of aborting just try closing and restarting Boinc. That often does the trick.
____________

John Hunt Profile
Avatar

Joined: Sep 18 05
Posts: 446
ID: 455
Credit: 128,172
RAC: 0
Message 65189 - Posted 3 Feb 2010 15:20:10 UTC
Last modified: 3 Feb 2010 15:21:06 UTC

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=287053961
has been running now for 56 hrs and still only 57.019% complete.

Core2Quad Q6600 @ 2.4GHz & Windows XP Home.

Keep going or abort?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65190 - Posted 3 Feb 2010 16:50:20 UTC

Keep going or abort?


As Evan points out, often such conditions get reset if you suspend and resume the task, or end and restart BOINC...

But first, I'd like to ask you to go to the advanced view, tasks tab, select the task that's been running so long, and then click the properties button that appears over on the left. There are three time figures there that I would like you to report:

CPU time at last checkpoint:
CPU time:
and Elapsed time:

It will take you a minute or so to jot that down, then close the window, and click again on the properties button for the task and see if the CPU time has changed at all.
____________
Rosetta Moderator: Mod.Sense

John Hunt Profile
Avatar

Joined: Sep 18 05
Posts: 446
ID: 455
Credit: 128,172
RAC: 0
Message 65191 - Posted 3 Feb 2010 17:52:51 UTC
Last modified: 3 Feb 2010 18:12:49 UTC

O.K. I've suspended the WU and then re-started.

Here are the figures requested (when suspended) -
CPU time at last checkpoint: 02:05:26
CPU time: 02:05:27
and Elapsed time: 58:38:24

After re-start -
CPU time at last checkpoint: 02:05:26
CPU time: 02:10:22
and Elapsed time: 58:43:35

WU completed shortly afterwards with a computation error.

Thank you!

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65195 - Posted 4 Feb 2010 1:09:32 UTC

Just took a look at my graphics and saw this, is it normal? Ive been watching it for awhile now and it seems to be stuck on the model 2 step 0. Any ideas on what i should do?

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65198 - Posted 4 Feb 2010 3:08:44 UTC

Strange seems to be fine now, you can disregard earlier post.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65208 - Posted 4 Feb 2010 20:14:30 UTC
Last modified: 4 Feb 2010 20:24:06 UTC

I do not think that should be ignored. This type of tasks on my computer, too, is behaving very strangely. Here's an example where the protein is coiled into a ring(Click to enlarge):



In this state model is already about 30 minutes. Sometime ring begins to deploy, but then rolled back into the ring.

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65211 - Posted 5 Feb 2010 0:41:07 UTC

Seems if i give it some time it finds the protean structure again it was quite strange. Also I wanted to give a headsup that im having a huge issue with the boinc_filtered_lookbuild_threading WU's. Most of the new ones i have received have stalled at about 5 percent and ive had to abort. Are we any closer to fixing this issue because it seems to be getting worse. Ill give you some info on my current one though: protein: t385, cpu time at last checkpoint 33:20, cpu time: 34:24, elapsed time 14:21:01.

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65213 - Posted 5 Feb 2010 17:40:07 UTC

Access Violation Error - lr15clusfa_opt_.1bgf.1bgf.IGNORE_THE_REST.c.85.0.pdb.pdb.JOB_17562_3_0

Link: http://boinc.bakerlab.org/rosetta/result.php?resultid=315684378

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000

Debug info in link as usual - Wingman also had same error

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65219 - Posted 6 Feb 2010 20:52:26 UTC

This one errored on Ubuntu x64 after 10sec.

lr15clusfa_opt_.1hz6.1hz6.IGNORE_THE_REST.c.3.21.pdb.pdb.JOB_17586_2_1

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=288116061

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

Watchdog active.
SIGSEGV: segmentation violation
Stack trace (8 frames):
[0x96c49b3]
[0x96ee888]
[0xffffe500]
[0x80a8721]
[0x808fcc1]
[0x804985f]
[0x974c15c]
[0x8048121]

Exiting...

</stderr_txt>


____________


Max DesGeorges

Joined: Oct 1 05
Posts: 35
ID: 2201
Credit: 942,527
RAC: 0
Message 65221 - Posted 7 Feb 2010 6:28:32 UTC

I have an “igfhum ProteinInterfaceDesign” that takes from 280 MB up to 1,2 GB of memory!! Is that normal?
If it is, I think that in future most of the people will not have enough memory to run Rosetta anymore…

____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65222 - Posted 7 Feb 2010 6:41:29 UTC

Another error after 10sec.

lr15clusfa_opt_.1bgf.1bgf.IGNORE_THE_REST.c.12.1.pdb.pdb.JOB_17562_9_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=288315508

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

Watchdog active.
SIGSEGV: segmentation violation
Stack trace (8 frames):
[0x96c49b3]
[0x96ee888]
[0xffffe500]
[0x80a8721]
[0x808fcc1]
[0x804985f]
[0x974c15c]
[0x8048121]

Exiting...
</stderr_txt>

____________


Max DesGeorges

Joined: Oct 1 05
Posts: 35
ID: 2201
Credit: 942,527
RAC: 0
Message 65223 - Posted 7 Feb 2010 9:16:09 UTC - in response to Message ID 65221.

I have an “igfhum ProteinInterfaceDesign” that takes from 280 MB up to 1,2 GB of memory!! Is that normal?
If it is, I think that in future most of the people will not have enough memory to run Rosetta anymore…

The name of the WU is:
igfhum_looprefine_placestub2_2dsrI_1B6E_ProteinInterfaceDesign_2Feb2010¬_17660_331_0
After 45 minutes I restarted BOINC and the WU restarted from zero. Now, after 2 hours, the properties show me that the CPU time after checkpoint is still without any number (“---“), like the WU has worked for a few minutes.
Looking at the task manager it seems that the WU asks continuosly more memory, until it reaches the limit set in the preferences. Then it decreases rapidly to 280 MB and again increases up to around 1,2 GB.

Vista 32 bit, Core Duo T7250, 2 GB DDR2, BOINC 6.10.29

____________

Max DesGeorges

Joined: Oct 1 05
Posts: 35
ID: 2201
Credit: 942,527
RAC: 0
Message 65225 - Posted 7 Feb 2010 12:12:14 UTC - in response to Message ID 65223.

I have an “igfhum ProteinInterfaceDesign” that takes from 280 MB up to 1,2 GB of memory!! Is that normal?
If it is, I think that in future most of the people will not have enough memory to run Rosetta anymore…

The name of the WU is:
igfhum_looprefine_placestub2_2dsrI_1B6E_ProteinInterfaceDesign_2Feb2010¬_17660_331_0
After 45 minutes I restarted BOINC and the WU restarted from zero. Now, after 2 hours, the properties show me that the CPU time after checkpoint is still without any number (“---“), like the WU has worked for a few minutes.
Looking at the task manager it seems that the WU asks continuosly more memory, until it reaches the limit set in the preferences. Then it decreases rapidly to 280 MB and again increases up to around 1,2 GB.

Vista 32 bit, Core Duo T7250, 2 GB DDR2, BOINC 6.10.29


UPDATE: The WU finished without errors.
Looking at the graphic I noticed that when the WU freeze in the “request memory loop”, it was always in the “kic_refine_r2” stage and the accepted energy didn’t vary.
I hope this info are useful. :)

____________

johndad5

Joined: Aug 12 09
Posts: 1
ID: 336865
Credit: 513,712
RAC: 0
Message 65226 - Posted 7 Feb 2010 13:14:59 UTC - in response to Message ID 64951.

This app update includes a fix for checkpointing.

Please report issues and bugs here!

thanks,

DK

For some reason I am not getting new work. When I update the project it simply says "Not reporting or requesting tasks". I am using BOINC version 6.10.18 .

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65227 - Posted 7 Feb 2010 16:38:19 UTC - in response to Message ID 65226.

For some reason I am not getting new work. When I update the project it simply says "Not reporting or requesting tasks". I am using BOINC version 6.10.18 .

John, it sounds like BOINC has decided to schedule work from other projects for the nearterm on your machine. It is trying to run within the resource shares between projects that you have established. It's normal, and once some work for the other projects has been done, it will come back and ask work from Rosetta automatically.
____________
Rosetta Moderator: Mod.Sense

Neo2

Joined: Feb 3 10
Posts: 2
ID: 368922
Credit: 307,819
RAC: 0
Message 65229 - Posted 8 Feb 2010 8:52:51 UTC

Hi!
I don't know if this happened also with older versions of rosetta since I started computing on the 3rd of February.
I'm running on an amd64 linux system, a pretty powerful one. Looking at my tasks log, I had about a 120 WUs assigned until today, but only 3-4 of them completed successfully. Others show "Outcome - Client error" / "Client state - Compute error". Looking at boinc.log gave me no information because it doesn't contain any error line except "output file .... absent", which I'm told from the FAQ it is safe to ignore. I'm running lhc, seti, milkyway, einstein, ralph, cosmology and with the exception of einstein tasks which seem to end up in computation errors also, every other program is running fine. Milkyway in particular granted me 2500 credits in the last four days (from which I assume the machine is stable). I have never observed problems with the machine itself (occasional lockups, strange sudden shutdowns etc).

This is the /proc/cpuinfo file (I've omitted the other 3 cores):
# cat /proc/cpuinfo
processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 920 Processor
stepping : 2
cpu MHz : 2800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt
bogomips : 5619.47
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

The machine is equipped with 8Gb of RAM. Everything is running at stock speed, I'm not overclocking. If any other information is needed I can provide it and I'm not scared to do some debugging. :)

Thanks
Neo2

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65231 - Posted 8 Feb 2010 17:23:26 UTC

Neo2, thanks for joining Rosetta. I see you have two machines. The 4 core that you described is here. And at present, it doesn't show any successfully completed work units. If you look at the task details for that host, such as this one], they each report an error opening a file. The file name seems to vary with each task.

This implies a security setup problem on your machine. The executable and the user that is running the BOINC core client, need authority to the files that are downloaded. Is it possible your BOINC installation is conflicting with some anti-virus software? Or other security measures?
____________
Rosetta Moderator: Mod.Sense

Neo2

Joined: Feb 3 10
Posts: 2
ID: 368922
Credit: 307,819
RAC: 0
Message 65232 - Posted 8 Feb 2010 18:52:55 UTC - in response to Message ID 65231.
Last modified: 8 Feb 2010 18:57:14 UTC

I don't think so, currently I have clamd up and running, but is only a daemon to fulfill requests from userspace programs, not a real-time antivirus software.
I'm running 2.6.33 git kernel, without any extra security measures: no grsecurity, no firewall, no external security hooks of any sort, no SElinux.
The directory in which BOINC runs is owned by user and group boinc, both existing, no file in the directory is owned by other users. Every file (except executables which have 0755) has got permission 0644 while the directories have 0755. The BOINC executable runs with boinc:boinc also.
Before starting BOINC for the first time I tuned the directory parameters, so every file in the BOINC directory has been created by BOINC itself.
Gentoo by default installs a stock /etc/conf.d file through which the BOINC service is started. I only modified the paths for data storage and logging, nothing else.

The file is the following:
# Config file for /etc/init.d/boinc

# Owner of BOINC process (must be existing)
USER="boinc"
GROUP="boinc"

# Directory with runtime data: Work units, project binaries, user info etc.
RUNTIMEDIR="/mnt/storage/boinc"

# Location of the boinc command line binary
BOINCBIN="/usr/bin/boinc_client"

# Logfile (/dev/null for nowhere)
LOGFILE="/mnt/storage/boinc/boinc.log"

# Allow remote gui RPC yes or no
ALLOW_REMOTE_RPC="yes"

# nice level
NICELEVEL="17"

# scheduling parameters, arguments to chrt(1)
SCHED_PARAM="--batch 0"

# Relative CPU allocation for boinc user, default is 1024,
# requires CONFIG_FAIR_GROUP_SCHED and CONFIG_USER_SCHED,
# see /usr/src/linux/Documentation/scheduler/sched-design-CFS.txt
CPU_SHARE="768"

Now I'm a bit disappointed.
Would the manual removal of the rosetta files and the re-sync with the project be of any use?

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65236 - Posted 8 Feb 2010 21:21:01 UTC

This errored after 47min.

igfhum_looprefine_placestub2_2dsrI_1P6F_ProteinInterfaceDesign_2Feb2010_17660_271_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=288505225

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
Maximum memory exceeded
</message>

Mon 08 Feb 2010 22:14:56 EST|rosetta@home|Aborting task igfhum_looprefine_placestub2_2dsrI_1P6F_ProteinInterfaceDesign_2Feb2010_17660_271_0: exceeded memory limit 918.79MB > 909.78MB

Mon 08 Feb 2010 22:14:59 EST|rosetta@home|Output file igfhum_looprefine_placestub2_2dsrI_1P6F_ProteinInterfaceDesign_2Feb2010_17660_271_0_0 for task absent


____________


Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65237 - Posted 8 Feb 2010 21:25:39 UTC - in response to Message ID 65232.

Would the manual removal of the rosetta files and the re-sync with the project be of any use?


...doubtful. I would have suggested that if I felt it stood a good chance of helping your situation. But it can't hurt anything (costs you some bandwidth to reload everything).

Now that I think about it, if security setup were the problem, you should have same issue with other projects.

Anyone else have any ideas why Linux would be unable to open an application file?
____________
Rosetta Moderator: Mod.Sense

jcorn

Joined: Jan 27 06
Posts: 6
ID: 54746
Credit: 198,437
RAC: 0
Message 65238 - Posted 8 Feb 2010 21:30:19 UTC

Hi Manuel and P.P.L.

The large memory requirements are a once-in-a-while occurrence, but not something entirely unexpected. These jobs occasionally find a very interesting possible solution and spend a lot of resources testing it. I had submitted these jobs with the requirement for 512 MB RAM allocated for boinc. But based on your observations, I'll increase that requirement to 1 GB in the future. Thanks very much for the reports!
____________

Craig Dickinson

Joined: May 7 07
Posts: 8
ID: 174326
Credit: 604,896
RAC: 215
Message 65239 - Posted 8 Feb 2010 22:23:47 UTC

Anyone else seeing the following consistent error:-

File - minirosetta_graphics_1.92_windows_x86_64.exe stops downloading at 4.57/5.10 MB

Message section is showing this as a HTTP error followed by Internet access OK - project servers may temporarily be down.

I have reset the project (more than once) also detached and waited until next PC boot to re-attach. All this had no impact and its been doing this for several days now. So I am unable to process any work units as the applications hasn't finished downloading.

Running on Boinc 6.10.18 for Windows 64Bit on Windows 7, AMD 64Bit Dual Core, 4GB RAM

I am also running Seti@Home and this is running error free in both the standard and astropulse projects.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65240 - Posted 8 Feb 2010 22:24:56 UTC - in response to Message ID 65221.

Hi jcorn.

I have an “igfhum ProteinInterfaceDesign” that takes from 280 MB up to 1,2 GB of memory!! Is that normal?
If it is, I think that in future most of the people will not have enough memory to run Rosetta anymore…

===============================================================================
Going by this, if i can make a suggestion you might want to up the memory limit to 1.5GB for those tasks.

My rig that had the error has 1GB total, less with O.S. taken out that's not going to be enough.




____________


Snagletooth

Joined: Feb 22 07
Posts: 193
ID: 149031
Credit: 1,425,415
RAC: 236
Message 65242 - Posted 8 Feb 2010 23:09:16 UTC

lr15clusfa_opt_.1hz6.1hz6.IGNORE_THE_REST.c.0.34.pdb.pdb.JOB_17586_4
Exit status 193 (0xc1)
SIGBUS: bus error

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65244 - Posted 9 Feb 2010 4:27:01 UTC - in response to Message ID 65239.

Anyone else seeing the following consistent error:-

File - minirosetta_graphics_1.92_windows_x86_64.exe stops downloading at 4.57/5.10 MB

Message section is showing this as a HTTP error followed by Internet access OK - project servers may temporarily be down.

I have reset the project (more than once) also detached and waited until next PC boot to re-attach. All this had no impact and its been doing this for several days now. So I am unable to process any work units as the applications hasn't finished downloading.

Running on Boinc 6.10.18 for Windows 64Bit on Windows 7, AMD 64Bit Dual Core, 4GB RAM

I am also running Seti@Home and this is running error free in both the standard and astropulse projects.


It should recover the transfer from where it left off and get the rest of the file. But it seems it must have a hiccup along the way. Are you using a cacheing proxy server or something?

Sounds like you've enabled the http tracing. Which Rosetta server does it say it is trying to get the file from? It should actually cycle through all of them as it does the retries. This should confuse a proxy enough that it would start fresh.

You could always download it with your browser and drop it in the rosetta project directory. Here is one of the direct URLs:
http://srv4.bakerlab.org/download/minirosetta_graphics_1.92_windows_x86_64.exe
____________
Rosetta Moderator: Mod.Sense

Max DesGeorges

Joined: Oct 1 05
Posts: 35
ID: 2201
Credit: 942,527
RAC: 0
Message 65248 - Posted 9 Feb 2010 13:54:46 UTC - in response to Message ID 65238.

Hi Manuel and P.P.L.

The large memory requirements are a once-in-a-while occurrence, but not something entirely unexpected. These jobs occasionally find a very interesting possible solution and spend a lot of resources testing it. I had submitted these jobs with the requirement for 512 MB RAM allocated for boinc. But based on your observations, I'll increase that requirement to 1 GB in the future. Thanks very much for the reports!

This is a good idea, but I think the specific WU I mentioned had another problem. It continued to take memory until the maximum available was reached. So maybe it tooke more RAM if I would have more in my PC.
So far I'm the only one that notice this problem, maybe it is only one case.

____________

TomaszPawel

Joined: Apr 28 07
Posts: 54
ID: 170716
Credit: 2,791,145
RAC: 0
Message 65250 - Posted 9 Feb 2010 19:02:22 UTC - in response to Message ID 65248.

Walid, but with errors!?... and "great" score...
____________
WWW of Polish National Team - Join! Crunch! Win!

Craig Dickinson

Joined: May 7 07
Posts: 8
ID: 174326
Credit: 604,896
RAC: 215
Message 65251 - Posted 9 Feb 2010 22:16:02 UTC - in response to Message ID 65244.

Anyone else seeing the following consistent error:-

File - minirosetta_graphics_1.92_windows_x86_64.exe stops downloading at 4.57/5.10 MB

Message section is showing this as a HTTP error followed by Internet access OK - project servers may temporarily be down.

I have reset the project (more than once) also detached and waited until next PC boot to re-attach. All this had no impact and its been doing this for several days now. So I am unable to process any work units as the applications hasn't finished downloading.

Running on Boinc 6.10.18 for Windows 64Bit on Windows 7, AMD 64Bit Dual Core, 4GB RAM

I am also running Seti@Home and this is running error free in both the standard and astropulse projects.


It should recover the transfer from where it left off and get the rest of the file. But it seems it must have a hiccup along the way. Are you using a cacheing proxy server or something?

Sounds like you've enabled the http tracing. Which Rosetta server does it say it is trying to get the file from? It should actually cycle through all of them as it does the retries. This should confuse a proxy enough that it would start fresh.

You could always download it with your browser and drop it in the rosetta project directory. Here is one of the direct URLs:
http://srv4.bakerlab.org/download/minirosetta_graphics_1.92_windows_x86_64.exe


It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection.

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65252 - Posted 9 Feb 2010 23:21:22 UTC

This one failed after about 20 seconds

lr15clusfa_opt_.1ail.1ail.IGNORE_THE_REST.c.1.24.pdb.pdb.JOB_17559_3

Exit status -1073741819 (0xc0000005)
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65254 - Posted 10 Feb 2010 3:55:31 UTC - in response to Message ID 65251.


It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection.


...even so, the reaction on the client should be to see that it did not receive the entire length of the file, and then try a second request for just the remainder of the file (this is done with an HTTP range header).

I just tried the link from home and it worked fine for me 5.1MB. So that would explain why others are not seeing the same.

Is it possible your ISP is limiting the time of each connection or something? Even so, I'm still puzzled why it doesn't sound like it is doing a retry from where it left off. What is the reaction on the client after the connection gets the 4.57M and then times out?? It should schedule a retry on the file until it gets it. And the retry should pick up where the first attempt left off at the 4.57M. You should see this in the advanced view, in the transfers tab.

Perhaps there is something up with Win7? I'd be curious to have a look at a Wireshark trace if you would take the time to gather one.
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65255 - Posted 10 Feb 2010 5:54:05 UTC

Hi jcorn.

Either this is an old task or the memory limit hasn't been changed yet, this one

had the same problem on the same rig, would you believe!

Only ran for 10min this time.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=288907163

igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0


Wed 10 Feb 2010 16:19:03 EST|rosetta@home|Aborting task igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0: exceeded memory limit 910.28MB > 909.78MB

Wed 10 Feb 2010 16:19:05 EST|rosetta@home|Output file igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0_0 for task absent

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
Maximum memory exceeded
</message>



____________


AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 65263 - Posted 10 Feb 2010 15:45:34 UTC

Some lr15clusfa_opt WUs are failing after a few seconds:

http://boinc.bakerlab.org/rosetta/result.php?resultid=316845967
http://boinc.bakerlab.org/rosetta/result.php?resultid=316788455
http://boinc.bakerlab.org/rosetta/result.php?resultid=316769175
http://boinc.bakerlab.org/rosetta/result.php?resultid=316754661
http://boinc.bakerlab.org/rosetta/result.php?resultid=316741225

Craig Dickinson

Joined: May 7 07
Posts: 8
ID: 174326
Credit: 604,896
RAC: 215
Message 65264 - Posted 10 Feb 2010 16:52:10 UTC - in response to Message ID 65254.


It does not show which server it is connecting to, I tried the url you provided and srv3 as well. Both of these download at 136KB/sec consistently until the download hits 4.57MB then the transfer rate drops to 0KB/sec and after 5 minutes or so it times out the connection.


...even so, the reaction on the client should be to see that it did not receive the entire length of the file, and then try a second request for just the remainder of the file (this is done with an HTTP range header).

I just tried the link from home and it worked fine for me 5.1MB. So that would explain why others are not seeing the same.

Is it possible your ISP is limiting the time of each connection or something? Even so, I'm still puzzled why it doesn't sound like it is doing a retry from where it left off. What is the reaction on the client after the connection gets the 4.57M and then times out?? It should schedule a retry on the file until it gets it. And the retry should pick up where the first attempt left off at the 4.57M. You should see this in the advanced view, in the transfers tab.

Perhaps there is something up with Win7? I'd be curious to have a look at a Wireshark trace if you would take the time to gather one.




I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.

Where do you want me to send the wireshark trace report.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65267 - Posted 11 Feb 2010 2:15:03 UTC - in response to Message ID 65248.

This is a good idea, but I think the specific WU I mentioned had another problem. It continued to take memory until the maximum available was reached. So maybe it tooke more RAM if I would have more in my PC.
So far I'm the only one that notice this problem, maybe it is only one case.

By the way - it looks like a typical memory leak...
A fairly common error in computer programs

Hi jcorn.
Either this is an old task or the memory limit hasn't been changed yet, this one
had the same problem on the same rig, would you believe!
Only ran for 10min this time.
http://boinc.bakerlab.org/rosetta/workunit.php?wuid=288907163
igfhum_looprefine_placestub2_2dsrI_2R99_ProteinInterfaceDesign_2Feb2010_17660_441_0

Yes, its old.
Hint: name of the task contains date when it was scheduled. 2 Feb 2010 in this case.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 65270 - Posted 11 Feb 2010 8:35:40 UTC

Credit wise, this task: http://boinc.bakerlab.org/rosetta/result.php?resultid=315895911 (igfhum_looprefine_placestub2_2dsrI_2WA0_ProteinInterfaceDesign_2Feb2010_17660_334_0) wasn't even worthy my CPU time. I claimed 99 credit and got only 3!
It ran the full length of time, 15000+ seconds,ran 44 models and generated 2 decoys.

Something is wrong with those numbers. Especially granted credit.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 65271 - Posted 11 Feb 2010 8:36:58 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=315583449
lr15clusfa_opt_.1dhn.1dhn.IGNORE_THE_REST.c.14.1.pdb.pdb.JOB_17574_1_0

Compute error -177 (0xffffff4f)

Got full credit though.

Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 65276 - Posted 11 Feb 2010 11:34:26 UTC - in response to Message ID 65270.

Credit wise, this task: http://boinc.bakerlab.org/rosetta/result.php?resultid=315895911 (igfhum_looprefine_placestub2_2dsrI_2WA0_ProteinInterfaceDesign_2Feb2010_17660_334_0) wasn't even worthy my CPU time. I claimed 99 credit and got only 3!
It ran the full length of time, 15000+ seconds,ran 44 models and generated 2 decoys.

Something is wrong with those numbers. Especially granted credit.

But the times you were awarded more than claimed credit weren't a problem? Funny how that works.

It's an average and you're ahead of average generally. I am too but I thought best not to mention it ;)
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65277 - Posted 11 Feb 2010 11:57:27 UTC

Let's not get testy Sid. It looks like he ran 46 models and got credit for only the last 2. I've asked the Project Team to look in to these "double headers" as I call them. Thanks for reporting it Greg. If you have any hints about any rare events that may have occurred on your PC about the time those last two models would have been run, that would be great. Did you happen to power off or shutdown BOINC about that time?
____________
Rosetta Moderator: Mod.Sense

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65278 - Posted 11 Feb 2010 12:41:47 UTC

It would appear that some of these lr15clusfa.. work units have a problem.

lr15clusfa_opt_.2cmx.2cmx.SAVE_ALL_OUT_IGNORE_THE_REST.c.2.28.pdb.pdb.JOB_17759_1_0

The previous one I reported has also failed on its second attempt
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65286 - Posted 11 Feb 2010 17:45:13 UTC - in response to Message ID 65264.


I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.

Where do you want me to send the wireshark trace report.


Craig and I arranged exchange of the trace via PMs... Craig, I've reviewed the trace and it seems the BOINC client is requesting a partial transfer from the next server in Rosetta's cluster, just as expected. But your machine never acknowledges receipt of any of the packets. And so the 3rd and 4th try are asking for the same point in the file that the second did. Plenty of time elapses, and the frame is resend about 10 times. But never acknowledged. Eventually the connection is closed.

The only times I've seen such fundamental things not work properly is when (bear with me :) there is a fundamental problem. Some examples from my past... bad ethernet port in the hub. Bad cable. But most commonly there's a router between the client and the server that's in need of a code refresh. That, or perhaps Win7 itself or the driver for your LAN adapter need an update. Because it sounds like it is happening so consistently at the same point, I tend to think there is a code problem here, because a hardware problem like a bad cable or port would probably be more intermittent.

The only other thing I can think of to try is that the downloads started out looking good, so after that first one gets most of the file and fails, shutdown BOINC, and shutdown the LAN adapter ("disable"), reenable the adapter, and fire up BOINC again. Now the retry will occur with a fresh start on the adapter and perhaps get you over the hump. But ultimately, that is just a work around. I think the true fix is going to be drivers, Win7, or router/firewall code updates. You might also try disabling the IPv6 support on the adapter.

Is anyone aware of any specific TCP fixes for Win7?

Notes to self (cuz this little scrap of paper is sure to disappear as soon as anyone askes me anything else about it!): client starts on srv1, retries occur to srv5, srv6, and srv3. Local ports used are 51221, 51229, 51235, 51241. These make good filters, such as: tcp.port == 51235 in the Wireshark display.
____________
Rosetta Moderator: Mod.Sense

mfbabb2

Joined: Oct 10 08
Posts: 4
ID: 283282
Credit: 10,345
RAC: 0
Message 65289 - Posted 12 Feb 2010 0:43:24 UTC

What is up with the low credit?
316913595 289066389 10 Feb 2010 15:09:57 UTC 12 Feb 2010 0:36:02 UTC Over Success Done 12,371.91 36.80 2.13

____________

Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 65290 - Posted 12 Feb 2010 2:39:21 UTC - in response to Message ID 65277.

Let's not get testy Sid.

I didn't mean it that way - sorry if that's how it came across. I just recalled Sarel's comment way up the thread that "The critstubs jobs are of the same sort of unevenly crediting trajectories as the *gbnnotyr*. You can have a look at Eva-Maria Strauch's message on the Protein-interface design thread for details" so I'm pretty much ignoring all the vagaries of credit awards against claims. It averages out so we win some, we lose some. Is that not right?

If it's not then I can report quite a few too, for what it's worth.

Probably of more benefit I should report some compute errors, much the same as reported by others:

BOINC client version 6.10.18 for windows_x86_64
Processor: 2 GenuineIntel Intel(R) Core(TM)2 Duo CPU T6600@2.20GHz [Intel64 Family 6 Model 23 Stepping 10]
OS: Microsoft Windows 7: Home Premium x64 Edition, (06.01.7600.00)
Memory: 4.00 GB physical

# cpu_run_time_pref: 28800
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000

CPU time 20.65453
lr15clusfa_opt_.1ctf.1ctf.IGNORE_THE_REST.c.18.2.pdb.pdb.JOB_17573_10_0

BOINC client version 6.10.18 for windows_x86_64
Processor: AMD Phenom(tm) 9850 Quad-Core Processor [AMD64 Family 16 Model 2 Stepping 3]
OS: Microsoft Windows Vista Home Premium x64 Edition, Service Pack 2, 06.00.6002.00)
Memory: 8.00 GB physical

# cpu_run_time_pref: 28800
Reason: Access Violation (0xc0000005) at address 0x006D2D46 read attempt to address 0x00000000

CPU time 14.77329
lr15clusfa_opt_.1scj.1scj.IGNORE_THE_REST.c.2.32.pdb.pdb.JOB_17610_1_0
CPU time 15.2101
lr15clusfa_opt_.1iib.1iib.IGNORE_THE_REST.c.9.2.pdb.pdb.JOB_17588_5_1
CPU time 15.1477
lr15clusfa_opt_.1ttz.1ttz.IGNORE_THE_REST.c.0.27.pdb.pdb.JOB_17619_4_1
CPU time 15.0073
lr15clusfa_opt_.1ail.1ail.IGNORE_THE_REST.c.4.11.pdb.pdb.JOB_17559_8_1
____________

Copelco

Joined: Feb 11 10
Posts: 1
ID: 369754
Credit: 8,097
RAC: 0
Message 65292 - Posted 12 Feb 2010 4:20:36 UTC

I'm a new user running latest version. The first work unit you sent ran fine to about 70% then stopped and dropped off the task list as submitted. Account shows no work units submitted. May be a problem.

Thanks,
TC

Link
Avatar

Joined: May 4 07
Posts: 260
ID: 173059
Credit: 338,704
RAC: 3
Message 65298 - Posted 12 Feb 2010 14:33:07 UTC

Now I've got also quite low credit: WU 288293546.
I usually need something like 450-650 CPU-seconds for 1Cr, on this WU I got 1Cr/1972sec.
____________
.

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 65300 - Posted 12 Feb 2010 15:59:56 UTC

Compute error - exit status 1 lrmixclus_opt_.1hz6.1hz6.SAVE_ALL_OUT_IGNORE_THE_REST.c.20.2.pdb.pdb.JOB_17816_1_0

http://boinc.bakerlab.org/rosetta/result.php?resultid=317250268

ERROR: start_res != middle_res
ERROR:: Exit from: ..\..\src\protocols\moves\KinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65302 - Posted 12 Feb 2010 20:35:24 UTC

This one failed after just 14 sec.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=289171483

lr15clusfa_opt_.1ail.1ail.SAVE_ALL_OUT_IGNORE_THE_REST.c.4.20.pdb.pdb.JOB_17676_7_0

Fri 12 Feb 2010 21:40:02 EST|rosetta@home|Output file lr15clusfa_opt_.1ail.1ail.SAVE_ALL_OUT_IGNORE_THE_REST.c.4.20.pdb.pdb.JOB_17676_7_0_0 for task absent

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 193 (0xc1, -63)
</message>

BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
SIGSEGV: segmentation violation
Stack trace (8 frames):
[0x96c49b3]
[0x96ee888]
[0xb7fd1420]
[0x80a8721]
[0x808fcc1]
[0x804985f]
[0x974c15c]
[0x8048121]

Exiting...

</stderr_txt>



____________


P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65310 - Posted 14 Feb 2010 1:53:36 UTC

This one ran for 11min.

lr15clus_3fa_opt_.1bm8.1bm8.SAVE_ALL_OUT_IGNORE_THE_REST.c.13.6.pdb.pdb.JOB_17967_1_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=289719139

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process exited with code 1 (0x1, -255)
</message>

BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
# cpu_run_time_pref: 14400

ERROR: start_res != middle_res
ERROR:: Exit from: src/protocols/moves/KinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

____________


AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 2,473,178
RAC: 1,976
Message 65311 - Posted 14 Feb 2010 10:24:55 UTC

The same error as P.P.L. and Admin.

ERROR: start_res != middle_res
ERROR:: Exit from: src/protocols/moves/KinematicMover.cc line: 132
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Task 317684657

AdeB
____________

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65313 - Posted 14 Feb 2010 16:39:05 UTC - in response to Message ID 65165.
Last modified: 14 Feb 2010 16:47:19 UTC

Max, the watchdog should be expected to kick in if a model exceeds target runtime by more then 4 hours. And so with a short 2 hour preference, you will tend to see much more variation in runtimes then if you had a longer preference.

However, I would expect that once the target runtime is exceeded, and a model is completed, that work on the WU should be ended and it would be reported back. So if you have a 2hr preference, and the first model took more then 2hrs to completed, I should have expected the WU to end once the first model was completed. In fact, if the first model took more then an hour, the predicted completion time of the second model would have exceeded your preference, and so I would expect that scenario to end after the first model as well.

Have you changed your runtime preference? Perhaps this machine hadn't received the updated preference yet? (looking at the WU you linked, it appears to have had the 7200sec preference during the entire run). So, it sounds like your expectations are correct. More observations would be helpful. Reference to the specific WU (as you've done) is very helpful.


The rest of what you say sounds right to me. So keep watching, and jotting notes. This information will be key to resolving this issue.


I got a lot of tasks that ignore the Target CPU Time in preferences recently
It seems most of them belong to the type * boinc_filtered_loopbuild_threading *
Examples of such tasks:
t380__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_2906_0 - 15002.5 cpu seconds, 2 decoys

t347__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_9452_0 - 20591.2 cpu seconds, 2 decoys

t330__boinc_filtered_loopbuild_threading_cst_relax_tex_IGNORE_THE_REST_16901_3175_0 - 16323.3 cpu seconds, 2 decoys

t322__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_3299_0 - 21789.4 cpu seconds, 3 decoys

In all the examples cpu_run_time_pref was set at 7200 sec. And all was generated 2 or more decoys(and 2 of them i saw what 1st model took about 2hr or more), so that the program was able to stop after 1st decoy correctly. But for some reason did not do so.

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,834,811
RAC: 4,046
Message 65323 - Posted 15 Feb 2010 6:17:47 UTC
Last modified: 15 Feb 2010 6:18:02 UTC

I had a computation error on this task. My 'wingman' also got a computation error on this task. The task errored out after 17 seconds.
lr15clusfa_opt_.1wd6.1wd6.SAVE_ALL_OUT_IGNORE_THE_REST.c.0.10.pdb.pdb.JOB_17747_2
____________

banditwolf Profile

Joined: Jan 10 06
Posts: 28
ID: 49031
Credit: 139,737
RAC: 0
Message 65329 - Posted 15 Feb 2010 15:46:14 UTC

I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k.
____________

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 65331 - Posted 15 Feb 2010 18:27:47 UTC - in response to Message ID 65329.

Hello, if these are the Protein-interface Design jobs then this is expected since they work with very large complexes of proteins. If you turn on the graphics you'll see that the protein systems are much larger than the typical ones on Rosetta @ Home. These jobs are sent out with a requirement for 512Mb of memory to ensure that large-memory jobs are not sent out to low-resource machines.

Best, Sarel.

I am seeing the 2.05 use more memory than previous versions. Up to 160k from ~60-100k.


____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 65333 - Posted 15 Feb 2010 20:41:29 UTC

Task 317544195 , lr15clus_opt_.1a32.1a32.IGNORE_THE_REST.c.2.8.pdb.pdb.JOB_17418_5_1 behaved strangely on Mac OS X. It got hung at Model 2: step 0 and had to aborted. In the Searching... pane in the graphics window the protein was compressed into a furball: the other protein displays seemed pretty normal.

Craig Dickinson

Joined: May 7 07
Posts: 8
ID: 174326
Credit: 604,896
RAC: 215
Message 65338 - Posted 15 Feb 2010 23:30:05 UTC - in response to Message ID 65286.


I have the wireshark trace from the re-attaching of the Rosetta project to the Boinc client showing the original download failure of the graphics file and the first 2 attempts to resume the download of the graphics file. Once again all other Rosetta files and the first work unit have sucessfully downloaded.

Where do you want me to send the wireshark trace report.


Craig and I arranged exchange of the trace via PMs... Craig, I've reviewed the trace and it seems the BOINC client is requesting a partial transfer from the next server in Rosetta's cluster, just as expected. But your machine never acknowledges receipt of any of the packets. And so the 3rd and 4th try are asking for the same point in the file that the second did. Plenty of time elapses, and the frame is resend about 10 times. But never acknowledged. Eventually the connection is closed.

The only times I've seen such fundamental things not work properly is when (bear with me :) there is a fundamental problem. Some examples from my past... bad ethernet port in the hub. Bad cable. But most commonly there's a router between the client and the server that's in need of a code refresh. That, or perhaps Win7 itself or the driver for your LAN adapter need an update. Because it sounds like it is happening so consistently at the same point, I tend to think there is a code problem here, because a hardware problem like a bad cable or port would probably be more intermittent.

The only other thing I can think of to try is that the downloads started out looking good, so after that first one gets most of the file and fails, shutdown BOINC, and shutdown the LAN adapter ("disable"), reenable the adapter, and fire up BOINC again. Now the retry will occur with a fresh start on the adapter and perhaps get you over the hump. But ultimately, that is just a work around. I think the true fix is going to be drivers, Win7, or router/firewall code updates. You might also try disabling the IPv6 support on the adapter.

Is anyone aware of any specific TCP fixes for Win7?

Notes to self (cuz this little scrap of paper is sure to disappear as soon as anyone askes me anything else about it!): client starts on srv1, retries occur to srv5, srv6, and srv3. Local ports used are 51221, 51229, 51235, 51241. These make good filters, such as: tcp.port == 51235 in the Wireshark display.



I am using a router so re-loaded the firmware as suggested and this has fixed the problem. Didn't think about the router being the cause as I would have expected that to have caused problems with other BOINC project updates or other software auto updaters.

pvh

Joined: Feb 7 10
Posts: 3
ID: 369324
Credit: 1,792,756
RAC: 0
Message 65346 - Posted 16 Feb 2010 16:50:58 UTC

I am seeing WUs that seem to be "stuck". If you look at the properties of the WU, you typically see something like:

CPU time at last checkpoint 00:35:26
CPU time 06:02:29

If you look at the graphics, you see that the protein is not changing shape at all and the energy and RMSD are perfectly constant. These jobs run on for around 25,000 seconds and (I assume) are then terminated by the watchdog. You get very low credit for these jobs. I assume this is a bug in the code. If so, please fix it quickly since it is wasting lots of CPU time.

When I see such a WU, should I abort it, or is it better to leave it running?

This is with Rosetta Mini 2.05 on a 64-bit Linux system. I have seen this on both of my OpenSUSE 11.2 systems with the 2.6.31.8-0.1-desktop kernel (so hardware problems are ruled out). I have so far not seen this on my dual-core OpenSUSE 11.0 system with a custom 2.6.28.2-vanilla kernel. The latter is by far the least performant machine, so there is a (small) chance that this is just random chance. I do not see an obvious pattern which WUs suffer from this.

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65348 - Posted 16 Feb 2010 18:38:46 UTC

CPU time at last checkpoint 00:35:26
CPU time 06:02:29]

Try closing down BOINC and re-opening it. That seems to do the trick.
____________

pvh

Joined: Feb 7 10
Posts: 3
ID: 369324
Credit: 1,792,756
RAC: 0
Message 65349 - Posted 16 Feb 2010 20:57:50 UTC - in response to Message ID 65348.

Try closing down BOINC and re-opening it. That seems to do the trick.


Thanks for the tip, but why did you remove my post?

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 65365 - Posted 19 Feb 2010 13:11:40 UTC

lr15clusfa_opt_.1o5u.1o5u.SAVE_ALL_OUT_IGNORE_THE_REST.c.13.8.pdb.pdb.JOB_17715_7_0

http://boinc.bakerlab.org/rosetta/result.php?resultid=317120311

Outcome Client error
Client state Compute error
Exit status -1073741819 (0xc0000005)
CPU time 15.39063

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65393 - Posted 22 Feb 2010 21:05:31 UTC

Hi.

I don't know if this is a task problem or because of the validator problems, i'll put it here anyway.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=291130717

2cgq_Jan28_2cgq_3cp0_ProteinInterfaceDesign_15Feb2010_18083_187_0

Validate error

# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 14487.9 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

____________


sam_spade

Joined: Dec 2 08
Posts: 1
ID: 290882
Credit: 232,974
RAC: 0
Message 65394 - Posted 22 Feb 2010 22:08:14 UTC
Last modified: 22 Feb 2010 22:11:10 UTC

Almost since a week I get an error while downloading the app:
[error] Can't create HTTP response output file projects/boinc.bakerlab.org_rosetta/minirosetta_2.05_windows_x86_64.exe
What can I do? I already tried to reset the project.
The rosetta_beta version_598 app works well.

[AF>Le_Pommier>MacBidouille.com] BlueG3

Joined: Mar 16 08
Posts: 1
ID: 247562
Credit: 43,585
RAC: 0
Message 65403 - Posted 23 Feb 2010 22:21:11 UTC
Last modified: 23 Feb 2010 22:24:47 UTC

ProteinInterfaceDesign seems to finish in validate error:
error 1
error 2
error 3
error 4

markj

Joined: Jun 21 08
Posts: 3
ID: 265343
Credit: 7,525,781
RAC: 11,091
Message 65409 - Posted 24 Feb 2010 9:53:19 UTC
Last modified: 24 Feb 2010 9:54:12 UTC

all, or at least most, of the ProteinInterface jobs cause validate errors - would it be possible to fix this and post in this thread when the fix is performed? It occurs on three different computers (PC, Mac), so appears to be platform-independent. In the meantime, I am aborting all ProteinInterface jobs, leaving the others which run ok.
markj

J

Joined: Feb 23 10
Posts: 4
ID: 371593
Credit: 68,995
RAC: 0
Message 65420 - Posted 26 Feb 2010 3:16:43 UTC

http://img80.imageshack.us/img80/4378/roserr1.jpg

Haven't been on this project long. No noticeable problems outside of punching 'ok'. Briefly searched the forums for c++ runtime error and didn't find anything, so cheers, here's a pic.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65424 - Posted 26 Feb 2010 14:42:10 UTC
Last modified: 26 Feb 2010 14:43:33 UTC

Looks like "J" has had a few compute errors reported on Win XP running BOINC 6.5.0:

abinitio_withrelax_nodisulf_nohomfrag_cst0.1_129_B_1wjgA_SAVE_ALL_OUT_17405_1983_0
abinitio_withrelax_nodisulf_nohomfrag_cst0.1_129_B_2hx5A_SAVE_ALL_OUT_17405_468_0
lrmixclus_opt_.5cro.5cro.SAVE_ALL_OUT_IGNORE_THE_REST.c.5.6.pdb.pdb.JOB_17886_5_1

The first was completed ok by a wingman
The second is out being worked on right now
The third failed on a wingman as well after 2 min. with an error: The system cannot find the path specified. (0x3) - exit code 3 (0x3)

____________
Rosetta Moderator: Mod.Sense

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 65429 - Posted 27 Feb 2010 3:00:02 UTC - in response to Message ID 65152.
Last modified: 27 Feb 2010 3:55:14 UTC

I am running Rosetta@home (along with some other projects) on four machines, three runing XP and one running Win7. Two of the machines have AMD 64 single core processors and two have AMD 64 dual core processors. The rosetta app is 'rosetta mini 2.05' and the BOINC version is 6.10.18. About once a week or so I will notice that the Rosetta WU running on at least one of the machines has been running longer than usual and using little or no CPU time. I abort it, and the next WU runs fine. I have not seen this occur with any of my other projects (climite prodiction, malaria control or world community grid).


Try suspending/resuming, or even entirely turn-off/restart BOINC, before aborting the WU. I have found this works often, not always, for me.


Thanks - one of my 2.05 workunits had the same problem, but now seems to be running well after a reboot.

http://boinc.bakerlab.org/rosetta/result.php?resultid=320652086

64-bit Vista SP2, BOINC 6.10.18, quad-core Intel, not using keep in memory when suspended (something tends to tie up lots of memory and make the computer unresponsive to the mouse and keyboard; haven't found what, though)

t311__boinc_filtered_loopbuild_threading type workunit

Before the reboot, showed CPU time 03:39:05, last checkpoint 03:39:03, elapsed time so far 20:29:26, not using any CPU time

Rebooted, that workunit restarted at about 4 hours elapsed time, but is now using a CPU core again.

robertmiles Profile

Joined: Jun 16 08
Posts: 658
ID: 264600
Credit: 3,743,487
RAC: 7,157
Message 65430 - Posted 27 Feb 2010 3:32:00 UTC - in response to Message ID 65229.
Last modified: 27 Feb 2010 3:53:38 UTC

Hi!
I don't know if this happened also with older versions of rosetta since I started computing on the 3rd of February.
I'm running on an amd64 linux system, a pretty powerful one. Looking at my tasks log, I had about a 120 WUs assigned until today, but only 3-4 of them completed successfully. Others show "Outcome - Client error" / "Client state - Compute error". Looking at boinc.log gave me no information because it doesn't contain any error line except "output file .... absent", which I'm told from the FAQ it is safe to ignore. I'm running lhc, seti, milkyway, einstein, ralph, cosmology and with the exception of einstein tasks which seem to end up in computation errors also, every other program is running fine. Milkyway in particular granted me 2500 credits in the last four days (from which I assume the machine is stable). I have never observed problems with the machine itself (occasional lockups, strange sudden shutdowns etc).

Thanks
Neo2


One thing to look for: I've found that when the output file absent error occurs, it's a good idea to search the logfile for any reference to boinc_lockfile. Errors that refer to that file tend to cascade from one workunit to the next, at least with the older versions of BOINC, but not with some of the newer versions like the 6.10.18 I'm now using. They can also cascade to other BOINC projects that use a file with the same name, again for the older BOINC versions.

Minardi

Joined: Jan 19 10
Posts: 1
ID: 367201
Credit: 779,670
RAC: 305
Message 65460 - Posted 5 Mar 2010 2:11:31 UTC

I have had several tasks stall out and stop using CPU over the past few days. I am finishing up my rosetta tasks, then taking this machine off the project. I was running an XP machine and had no problems. It died, and I replaced it with a W7 64-bit machine and some tasks started stalling out on me. In reviewing this thread, it appears there is a problem with mini Rosetta 2.05 running on W7 machines.

ramostol

Joined: Feb 6 07
Posts: 64
ID: 145835
Credit: 584,052
RAC: 0
Message 65480 - Posted 7 Mar 2010 11:42:31 UTC

My first Protein_interface (validation related?) error as far as I know - MacOS 10.5:

tyrsim_3gbn_2esa_Protein_interface_design_01Feb2010_17949_9_2


Outcome Success
Client state Done
Exit status 0 (0x0)

CPU time 21540.8

<core_client_version>6.10.36</core_client_version>
<![CDATA[
<stderr_txt>

[...]

# cpu_run_time_pref: 21600
======================================================
DONE :: 327 starting structures 21540.3 cpu seconds
This process generated 327 decoys from 327 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Workunit error - check skipped


One of two wingmen validated successfully after his deadline, but with far fewer decoys completed.

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 2,473,178
RAC: 1,976
Message 65481 - Posted 7 Mar 2010 21:53:11 UTC - in response to Message ID 65480.
Last modified: 7 Mar 2010 22:00:19 UTC

My first Protein_interface (validation related?) error as far as I know - MacOS 10.5:

tyrsim_3gbn_2esa_Protein_interface_design_01Feb2010_17949_9_2


Outcome Success
Client state Done
Exit status 0 (0x0)

CPU time 21540.8

<core_client_version>6.10.36</core_client_version>
<![CDATA[
<stderr_txt>

[...]

# cpu_run_time_pref: 21600
======================================================
DONE :: 327 starting structures 21540.3 cpu seconds
This process generated 327 decoys from 327 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Workunit error - check skipped


One of two wingmen validated successfully after his deadline, but with far fewer decoys completed.


There is nothing wrong on your end. This is a very old (and rare) bug in the boinc server software. Take a look here.
Wait a second, the trac item claims that the bug is fixed. Maybe it is time for Rosetta to update the server-code.

AdeB
____________

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 65482 - Posted 7 Mar 2010 22:49:27 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=322413556
tyrsim_3gbn_q.gz_Protein_interface_design_25Feb2010_18415_276_1
Outcome Client error
Client state Compute error
Exit status 1 (0x1)
CPU time 4.4375
stderr out

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Incorrect function. (0x1) - exit code 1 (0x1)

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65492 - Posted 9 Mar 2010 0:22:41 UTC

Looks like there are still problems with this app, same task

it just restarted near the end and i got it in the neck, not impressed.

tyrsim_3gbn_1c81_Protein_interface_design_25Feb2010_18415_410_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=294414088


# cpu_run_time_pref: 14400
======================================================
DONE :: 348 starting structures 14397.5 cpu seconds
This process generated 348 decoys from 348 attempts
======================================================


# cpu_run_time_pref: 14400
======================================================
DONE :: 2 starting structures 14498.9 cpu seconds
This process generated 2 decoys from 2 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>

Validate state Valid

Claimed credit 102.297287162446

Granted credit 0.384433279143336

application version 2.05

____________


apohawk Profile
Avatar

Joined: Sep 13 08
Posts: 5
ID: 278595
Credit: 1,072,002
RAC: 135
Message 65530 - Posted 12 Mar 2010 10:55:16 UTC

This work unit reports "success" despite having errors in the end.

http://boinc.bakerlab.org/rosetta/result.php?resultid=323517090

application: minitosetta 2.05
name of work unit: ina2inaN_to_NOE__18638_5045_0
Outcome: Success
Exit status: 0 (0x0)

CPU time: 2212.594

but at the end of the result we got:
# cpu_run_time_pref: 7200

ERROR: Unrecognized edge type!
ERROR:: Exit from: ..\..\src\core\kinematics\util.cc line: 1422
called boinc_finish


CPU: Phenom II 945
OS: WinXP 64 SP2
____________

Duzz

Joined: Nov 14 05
Posts: 1
ID: 11953
Credit: 13,148
RAC: 0
Message 65544 - Posted 13 Mar 2010 13:16:48 UTC
Last modified: 13 Mar 2010 13:17:53 UTC

During the last days I had several WUs staying idle after some time of computation. Windows XP task manager shows no CPU activity. If one does not notice this, many hours of WU processing get lost, which is very unproductive for the project.
____________

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 2,473,178
RAC: 1,976
Message 65547 - Posted 13 Mar 2010 22:39:05 UTC

In workunit gunn_fragments_SAVE_ALL_OUT_-1wtyA__18642_1106 both tasks (324092645 and 323994500) ended with the same error:

ERROR: ct == final_atoms
ERROR:: Exit from: ..\..\src\core\scoring\rms_util.cc line: 397
BOINC:: Error reading and gzipping output datafile: default.out

AdeB
____________

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65555 - Posted 15 Mar 2010 3:44:52 UTC

Today I got strange validation errors: "Task was reported too late to validate"
But there are 4 days until deadline (19 Mar)!
Links to the tasks:
http://boinc.bakerlab.org/rosetta/result.php?resultid=323161767
http://boinc.bakerlab.org/rosetta/result.php?resultid=323181972
http://boinc.bakerlab.org/rosetta/result.php?resultid=323205144

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,834,811
RAC: 4,046
Message 65560 - Posted 15 Mar 2010 17:35:23 UTC - in response to Message ID 65555.

Today I got strange validation errors: "Task was reported too late to validate"
But there are 4 days until deadline (19 Mar)!


I see more things happened during the weekend. I'm seeing you detached or something which might cause all incomplete units to report. Did you perhaps restore a backup which tried to continue from an earlier point?
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65561 - Posted 15 Mar 2010 23:09:15 UTC

What is odd is the way the tasks were reissued before he reported the completed ones back. That wouldn't normally happen. That isn't dependent upon Mad Max's machine, so I doubt they did a restore or anything. I'll have to see what we can find out.
____________
Rosetta Moderator: Mod.Sense

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65564 - Posted 16 Mar 2010 2:38:42 UTC - in response to Message ID 65560.


I see more things happened during the weekend. I'm seeing you detached or something which might cause all incomplete units to report. Did you perhaps restore a backup which tried to continue from an earlier point?


Error with "detached" is boinc related.
Actually I have not detached from the project, but rather connect a new computer. But after that boinc client initially goes mad - first it started to download to the new computer(Athlon II X2 250 ) tasks have already downloaded to old computer (Athlon XP 2600+), then at some point, thought better of it and register new computer on the server under a new ID, and than deleted mistakenly downloaded tasks. (I think this point and recorded on the server as "detached").

Note: there was no transfer of any boinc-related files from old computer to new one. The new client was a clean install from the distrib. So I do not know what caused this behavior. Maybe the fact that the computer is connect to internet under same ip?

Hmm, now I think that in principle, such an validate error could happen because of it. If one computer "cancels" the tasks(mistakenly downloaded), while the second worked on its, the server can issue the same WU to another volunteer computer and shift deadline time?

transient
Avatar

Joined: Sep 30 06
Posts: 376
ID: 115553
Credit: 7,834,811
RAC: 4,046
Message 65565 - Posted 16 Mar 2010 5:33:06 UTC

You still would've gotten credits if you had managed to report before the other computer. :) Anyway, from what you're telling about the other computer I do think the "too late to validate" error was more likely related to the new PC, than to a bug in the science-application. Maybe a problem with the BOINC-manager itself?
____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65567 - Posted 16 Mar 2010 16:32:04 UTC

True, not a problem specific to v2.05 Rosetta. Perhaps BOINC server, or client. Either way, we should start another thread if further problem tasks are found.

Certainly many users that have multiple machines are connecting from same IP address (I'm talking the router's public IP address that the project servers see). And many other users come in via dynamic IPs, and so it is always different. My understanding is that BOINC uses many factors to determine if a given machine is the same as an existing registered one to keep it all straight and separated correctly. Factors such as the user ID, host name, any existing BOINC host ID, machine type, installed OS, last RPC sequence number... so a fresh install should not have caused the client to "go mad" on either machine. Indeed many users have identically configured machines at same site coming in via same IP.
____________
Rosetta Moderator: Mod.Sense

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 65570 - Posted 17 Mar 2010 3:07:31 UTC

This took 8hrs, 2min on my 3ghz intel, four hour run time.

aqp9__boinc_aqp9_fast_run01_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18658_1421_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=296064742

# cpu_run_time_pref: 14400
Continuing computation from checkpoint: chk_S_2B6OA_15_0001_Remodel__loop_1_0_0_S ... success!
BOINC:: CPU time: 28914.7s, 14400s + 14400s[2010- 3-17 13:39:17:] :: BOINC
InternalDecoyCount: 0
======================================================
DONE :: 1 starting structures 28914.7 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish
SIGSEGV: segmentation violation
Stack trace (15 frames):
[0x96c49b3]
[0x96ee888]
[0xb7fe9420]
[0x91d6455]
[0x842671e]
[0x83e85d3]
[0x80a7840]
[0x84381fe]
[0x812a54a]
[0x812b82d]
[0x86aa16b]
[0x8243cf5]
[0x8049897]
[0x974c15c]
[0x8048121]

Exiting...

</stderr_txt>
]]>
Validate state Valid
Claimed credit__69.3077894676244
Granted credit__25.52312719487 -- for 8hrs.



____________


Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 65572 - Posted 17 Mar 2010 3:22:49 UTC
Last modified: 17 Mar 2010 3:24:29 UTC

On this desktop I got a Compute error Exit status -177 (0xffffff4f) in the following task:
aqp9__boinc_aqp9_fast_run01_blast_yfsong_loopbuild_threading_cst_relax_superfast_yfsong_IGNORE_THE_REST_18653_30510_0

<message>
Maximum disk usage exceeded
</message>

I did notice while it was running it was about 2 hours over my 8 hour runtime, on Model 6 Step 19051, but it reported 0 CPU time in the end.

I allow 10Gb disk space for Boinc and have about 581Mb in use on 5 current or waiting tasks, 9.43Gb free.

Also, on this laptop I got a validate error on the following task a few days back:
t290__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_8451_0
____________

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65575 - Posted 17 Mar 2010 14:13:21 UTC

2 Mod.Sense
Yes, it is certainly not a problem with minirosetta 2.05. It looks like some rare bug with boinc server. Probably connected with the fact that the computer had the same ip (not only "external" router ip, but internal too) and same network name. The new computer was a replacement of old, so I called the new as well as the previous one, before that renaming the old one. Actually, this should not be a factor, because boinc used to identify the internal id (such as 1211592) and not windows names. But the bug is a bug and that something is not go as intended :)
In any case, now more such errors do not come across, so I think this can be forgotten.

2 Sid Celery
I also had a lot of errors in tasks such as *__boinc_filtered_loopbuild_threading_*. In fact, every second job terminated by an error. And violating the target CPU time in each of the first (ie all tasks of this type) + strange looking things in graphics part (such as RMSD from 20 to 50 and odd-looking models)
So now I am canceling all jobs of this type, if i see them in the job queue.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65576 - Posted 17 Mar 2010 20:51:24 UTC

Sid, each task also has a configured maximum disk space. So that must be the limit that was hit by the task you mention. This is just one more failsafe that is in place to help assure things keep running smoothly.
____________
Rosetta Moderator: Mod.Sense

Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 65578 - Posted 17 Mar 2010 21:03:05 UTC - in response to Message ID 65575.

I also had a lot of errors in tasks such as *__boinc_filtered_loopbuild_threading_*. In fact, every second job terminated by an error. And violating the target CPU time in each of the first (ie all tasks of this type) + strange looking things in graphics part (such as RMSD from 20 to 50 and odd-looking models)
So now I am canceling all jobs of this type, if I see them in the job queue.

It's the only error I've had in the last week on that W7 laptop, and credit was granted in the clean-up job, so I'm not worried by it - I don't understand any of these validate errors but while I was reporting the other one I thought I'd just mention it. I don't think my errors are the same as yours in that case.

I'm more surprised by the disk-usage issue on the Vista desktop which is otherwise very well behaved. I did suspect the task type, but others have gone through now with no problem at all, so maybe it just went a bit 'rogue' on me. I just thought it was worth describing seeing as I noticed it was a bit odd while running for 10 hours, yet the task details didn't indicate anything more than it failed on startup, which wasn't actually the case.

One for the backroom team to ponder.
____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 65655 - Posted 28 Mar 2010 0:24:31 UTC

Miscellaneous computation errors:

----

327069193 (v2FcInnerW_1dAl_3GM3_ProteinInterfaceDesign_15Mar2010_18672_254_0) failed on Mac OS X. Similar failure from wingman.

ERROR: f.check_fold_tree()
ERROR:: Exit from: src/protocols/docking/DockingProtocol.cc line: 405
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

----

326722657 (placestub_alt_denovo_1zvy_1z2m_ProteinInterfaceDesign_21Mar2010_18705_22_0) failed on W7

ERROR: in::file::zip minirosetta_database.zip does not exist!
ERROR:: Exit from: ..\..\src\apps\public\boinc\minirosetta.cc line: 137
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

----

326721814 (tedor-cs_-tdonly-1-calbindin__18708_33_1) failed on W7. Similar failure from wingman.


ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65658 - Posted 28 Mar 2010 8:01:15 UTC
Last modified: 28 Mar 2010 8:09:45 UTC

326722657 (placestub_alt_denovo_1zvy_1z2m_ProteinInterfaceDesign_21Mar2010_18705_22_0) failed on W7

ERROR: in::file::zip minirosetta_database.zip does not exist!
ERROR:: Exit from: ..\..\src\apps\public\boinc\minirosetta.cc line: 137
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Add me to the list with

tedor-cs_-tdonly-1-gb3__18708_4647
ERROR: rsd_type_list.size()
ERROR:: Exit from: ..\..\src\core\fragment\Frame.cc line: 62
BOINC:: Error reading and gzipping output datafile: default.out
____________

allenandholmes

Joined: Dec 17 07
Posts: 1
ID: 227594
Credit: 7,563
RAC: 0
Message 65659 - Posted 28 Mar 2010 8:17:07 UTC

I have been processing my current minirosetta task for 4 or 5 days now and have had a suspicion about its checkpointing capabilities. I shut my PC down each night and restart it the next morning for BOINC processing. However the elapsed time displayed resets to 0, the time to completion continues to increase all day long (and between sessions) and the processed percentage is dramatically different from a ratio of elapsed/completion times. Am I wasting my time?

Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 65663 - Posted 28 Mar 2010 15:14:17 UTC

One unusual error I haven't seen before - W7-64bit laptop:

Rossmann3x3_abinitio_SAVE_ALL_OUT_design_k031_001_18698_1551_0

Outcome Client error
Client state Compute error
Exit status 1 (0x1)
[...]
<core_client_version>6.10.36</core_client_version>
[...]
# cpu_run_time_pref: 28800
Starting work on structure: _00018
Continuing computation from checkpoint: chk_S_00000018_ClassicAbinitio__stage_3_iter1_10 ... success!
Continuing computation from checkpoint: chk_S_00000018_ClassicAbinitio__stage4_kk_1 ... success!
Continuing computation from checkpoint: chk_S_00000018_ClassicAbinitio__stage4_kk_2 ... success!
std::cerr: Exception was thrown:
no success reading silent file chk_S_00000018_ClassicAbinitio__stage4_kk_3.out

____________

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65666 - Posted 28 Mar 2010 21:21:41 UTC - in response to Message ID 65659.

I have been processing my current minirosetta task for 4 or 5 days now and have had a suspicion about its checkpointing capabilities. I shut my PC down each night and restart it the next morning for BOINC processing. However the elapsed time displayed resets to 0, the time to completion continues to increase all day long (and between sessions) and the processed percentage is dramatically different from a ratio of elapsed/completion times. Am I wasting my time?


If you would like to see detail on when checkpoints are being taken, you can create or modify the cc_config.xml file. Add the tag for:
<checkpoint_debug>1</checkpoint_debug>
to the log_flags section.

The frequency of checkpoints varies with the various types of tasks.
____________
Rosetta Moderator: Mod.Sense

marc

Joined: Feb 20 10
Posts: 1
ID: 371185
Credit: 4,053
RAC: 0
Message 65668 - Posted 28 Mar 2010 22:50:33 UTC - in response to Message ID 64951.

Hi , guys ;
I sometimes have a trouble with the images ;
well , the trouble is that there are no images at all ;
and when I try to open it , Windows 7 is trying to open it ,
and trying , and trying , then telling me that the program can't be open ,
and so I have to stop the program .
There are no tasks problems , it's running , but the images are not availible .
Marc Denis




This app update includes a fix for checkpointing.

Please report issues and bugs here!

thanks,

DK

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3389
ID: 106194
Credit: 0
RAC: 0
Message 65675 - Posted 29 Mar 2010 16:24:01 UTC

Marc, I'm not clear on what you are saying you are doing. You should open the BOINC Manager, go to the advanced view, go to the tasks tab, select a task with status of "running", and then click the show graphics button at the left. Then you should expect to see something like is shown here.

...or are you talking about the screensaver? ...or are you trying to "open it" from Windows explorer? (don't)
____________
Rosetta Moderator: Mod.Sense

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 5,175,766
RAC: 6,775
Message 65679 - Posted 30 Mar 2010 12:30:23 UTC

I think he means that hangs graphical part of the application. (like (Minirosetta_graphics_1.92)
Yesterday I watched 2 times the same bug on my computer - the calculation of the tasks goes good (it seems), but when I try to click "show graphics" graphical application does not show anything, and after some time hangs.
After killing proces "minirosetta_graphics_1.92_windows_intelx86.exe" all seems good, but if click "show graphics" again, its hangs again too.
Job Type was tedor-cs*. Unfortunately no link to a specific WU, because I have them not recorded, and in the log of tasks, they all appear to be successful.

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65680 - Posted 30 Mar 2010 13:39:28 UTC

I have four of the tedor-cs... programs running at the moment on this computer. They all are working perfectly but only three graphs can be activated. The fourth does not respond.
____________

Sid Celery

Joined: Feb 11 08
Posts: 806
ID: 241409
Credit: 10,030,156
RAC: 9,347
Message 65681 - Posted 30 Mar 2010 17:36:37 UTC

Ditto here. I've had several long-running WUs recently and a click on the "Show graphics" button to see how or if it's progressing at all results in no display.

I have one now: tedor-cs_-csonly-1-Alg13__18706_6784_0

Cpu_run_time_pref: 28800 = 8 hours
Elapsed time 10h 6m 20s
CPU time 9h 45m 47s
Last Checkpoint 9h 44m 41s

No graphics. But I have an LR5_cus8 job that displays just fine.
____________

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,404,794
RAC: 6,286
Message 65703 - Posted 6 Apr 2010 5:21:31 UTC

A couple of recent failures with an error message I haven't seen before. Mac OS X

329517823 t308__boinc_corebuild_round2_rerun_cstrun2_sel_core_1.5.broker_corebuild_mtyka_IGNORE_THE_REST_19026_913_0

ERROR: Could not find disulfide partner for residue 6
ERROR:: Exit from: src/core/scoring/disulfides/FullatomDisulfideEnergyContainer.cc line: 566
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

329552087 t322__boinc_corebuild_round2_rerun_cstrun2_withcst_sel_core_1.5.broker_corebuild_mtyka_IGNORE_THE_REST_19063_575_0

ERROR: Could not find disulfide partner for residue 49
ERROR:: Exit from: src/core/scoring/disulfides/FullatomDisulfideEnergyContainer.cc line: 566
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>
]]>

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 65704 - Posted 6 Apr 2010 20:53:47 UTC

Another disulphide error
t293__boinc_corebuild_round2_rerun_cstrun2_withcst_sel_core_1.5.broker_corebuild_mtyka_IGNORE_THE_REST_19053_181_0
ERROR: Could not find disulfide partner for residue 97
ERROR:: Exit from: ..\..\src\core\scoring\disulfides\FullatomDisulfideEnergyContainer.cc line: 566

A validation error:
t303__boinc_corebuild_round2_rerun_cstrun2_sel_core_1.5.broker_corebuild_mtyka_IGNORE_THE_REST_19023_811_1
The prvious run ended in an error


____________

.clair.

Joined: Jan 2 07
Posts: 45
ID: 139198
Credit: 6,166,224
RAC: 4,604
Message 65706 - Posted 7 Apr 2010 21:00:58 UTC
Last modified: 7 Apr 2010 21:04:19 UTC

It is unusual that i get any errors,
I dont know if it was a fault at my end or not,
here are my two :-

t322__boinc_corebuild_round2_rerun_cstrun2_withcst_sel_core_1.5.broker_corebuild_mtyka_IGNORE_THE_REST_19063_460_0
ERROR: Could not find disulfide partner for residue 49
ERROR:: Exit from: src/core/scoring/disulfides/FullatomDisulfideEnergyContainer.cc line: 566


Ross3X3_SAVE_ALL_OUT_relax_t012_18981_82_0
SIGSEGV: segmentation violation
____________

damon

Joined: Sep 3 09
Posts: 1
ID: 342719
Credit: 33,051
RAC: 667
Message 65718 - Posted 9 Apr 2010 23:31:28 UTC

Has anyone run Rosetta on a netbook (Acer AspireONE AOA150-1777)? My specs:

--
CPU type GenuineIntel
Intel(R) Atom(TM) CPU N270 @ 1.60GHz [x86 Family 6 Model 28 Stepping 2]
Number of CPUs 2
Operating System Microsoft Windows XP
Home x86 Edition, Service Pack 3, (05.01.2600.00)
Memory 1523.88 MB
Cache 512 KB
Swap space 4401.18 MB
Total disk space 144.17 GB
Free Disk Space 120.95 GB
Measured floating point speed 689.15 million ops/sec
Measured integer speed 1752.15 million ops/sec
Maximum daily WU quota per CPU 89/day
--
I've been seeing the following Windows error for a few months:
--
Event Type: Information
Event Source: Application Popup
Event Category: None
Event ID: 26
Date: 4/9/2010
Time: 1:19:22 AM
User: N/A
Computer: ULTRAPORT
Description:
Application popup: Microsoft Visual C++ Runtime Library : Runtime Error!

Program: ...kerlab.org_rosetta\minirosetta_2.05_windows_intelx86.exe


This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.


For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.
--
Unfortunately, I can't locate the text file that keeps a track of what the project was doing at the time of the error. Initially, I thought that the amount of RAM was the problem, so I increased the amount from 1 GB to 1.5 GB. No go.

One of the other symptoms of a problem is the gradual filling of my swapfile with multiple instances of the Rosetta executable shown above. (Windows keeps announcing the resizing of the swapfile. Answering "OK" to each error pop-up doesn't appear to free the space allocated in RAM and the swapfile taken by the problem Rosetta executable instance.)

I have a larger Dell Inspiron B130 that has 2088316K (2 GB) of RAM, is at least three years older than the Netbook, runs only one work unit at a time, but functions smoothly without filling up the swapfile.

Am I asking too much of my Netbook by running Rosetta on it?

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,969,735
RAC: 81
Message 65746 - Posted 15 Apr 2010 13:38:07 UTC

t293__boinc_corebuild_round2_rerun_cstrun2_sel_core_1.5.broker_corebuild_mtyka_IGNORE_THE_REST_19020_211_0

# cpu_run_time_pref: 14400
ERROR: Could not find disulfide partner for residue 97
ERROR:: Exit from: ..\..\src\core\scoring\disulfides\FullatomDisulfideEnergyContainer.cc line: 566
BOINC:: Error reading and gzipping output datafile: default.out


rosetta2_relax_cst_0.1_cm.loopbuild_threading_hb_t297__IGNORE_THE_REST_19126_320_0

disk space error (i have 23.4gb free so no way do i not have enough space)

Message boards : Number crunching : minirosetta 2.05


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^