Rosetta@home

minirosetta 2.03

  UW Seal
 
[ Home ] [ Join ] [ About ] [ Participants ] [ Community ] [ Statistics ]
  [ login/out ]


Advanced search
Message boards : Number crunching : minirosetta 2.03

Sort
AuthorMessage
Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 64478 - Posted 15 Dec 2009 1:22:02 UTC

THis version fixes a stackoverflow error that we didn't catch in 2.02.

Please post issues here, thanks !


____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Darmok

Joined: Sep 4 09
Posts: 6
ID: 343146
Credit: 178,639
RAC: 0
Message 64480 - Posted 15 Dec 2009 1:58:04 UTC

Just got this error message on reporting tasks 5 minutes ago:
12/14/2009 8:51:04 PM rosetta@home Message from server: Server error: can't attach shared memory

____________

Bruce Downing

Joined: Jul 19 08
Posts: 16
ID: 269702
Credit: 1,546,335
RAC: 326
Message 64483 - Posted 15 Dec 2009 5:36:41 UTC

Same message here.

MrWizard

Joined: Oct 30 05
Posts: 3
ID: 7716
Credit: 86,494
RAC: 0
Message 64484 - Posted 15 Dec 2009 6:28:20 UTC

Ditto.
____________

mfbabb2

Joined: Oct 10 08
Posts: 4
ID: 283282
Credit: 10,345
RAC: 0
Message 64485 - Posted 15 Dec 2009 6:31:00 UTC

Me 2
____________

Boris K.

Joined: Nov 14 09
Posts: 1
ID: 358410
Credit: 25,435
RAC: 0
Message 64486 - Posted 15 Dec 2009 6:47:13 UTC

Same error message here! Can't report or get new tasks.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 64487 - Posted 15 Dec 2009 7:56:40 UTC

Same message here. Also, your home page seems confused about whether this version is 2.02 again.

Dirk Broer

Joined: Nov 16 05
Posts: 16
ID: 12707
Credit: 1,194,135
RAC: 67
Message 64488 - Posted 15 Dec 2009 8:27:41 UTC

Same here

____________

Telescope Adrian

Joined: Nov 14 06
Posts: 9
ID: 129278
Credit: 1,906,378
RAC: 0
Message 64489 - Posted 15 Dec 2009 8:33:20 UTC - in response to Message ID 64480.

Just got this error message on reporting tasks 5 minutes ago:
12/14/2009 8:51:04 PM rosetta@home Message from server: Server error: can't attach shared memory

Ditto .
____________

Dmitry V. Silaev

Joined: Aug 13 07
Posts: 2
ID: 198075
Credit: 1,426,564
RAC: 0
Message 64490 - Posted 15 Dec 2009 9:04:52 UTC

Have the same message both on Windows- and Debian-based clients. Work is not being submited, nor is being requested.

Neon Profile
Avatar

Joined: May 21 06
Posts: 2
ID: 83709
Credit: 253,387
RAC: 0
Message 64491 - Posted 15 Dec 2009 9:18:05 UTC

Me pasa lo mismo ===> Error de servidor: no se Puede adjuntar la memoria compartida
____________

Mark W. Patton

Joined: May 16 09
Posts: 1
ID: 315978
Credit: 845,164
RAC: 0
Message 64492 - Posted 15 Dec 2009 10:37:06 UTC

I have a number of units completed but keep getting the message : Message from server; Can't connect to shared memory.

Mates

Joined: Oct 7 09
Posts: 1
ID: 353325
Credit: 561,355
RAC: 0
Message 64494 - Posted 15 Dec 2009 12:13:20 UTC

I have some problem:
15.12.2009 13:03:05 rosetta@home
Message from server: Server error: can't attach shared memory
Boinc ver.6.10.18

Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 64495 - Posted 15 Dec 2009 12:23:30 UTC - in response to Message ID 64494.

I have some problem:
15.12.2009 13:03:05 rosetta@home
Message from server: Server error: can't attach shared memory
Boinc ver.6.10.18


Yes we all have got the same message. We will have to wait until the sun rises on the East Coast of America.

____________

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 64498 - Posted 15 Dec 2009 12:40:08 UTC - in response to Message ID 64495.
Last modified: 15 Dec 2009 12:42:35 UTC

Yes we all have got the same message. We will have to wait until the sun rises on the East Coast of America.

Or even the West Coast, where Washington State is, 3 hours later... UTC -8

...and afterwards a massive logjam as we all try to upload and download further WUs.

At least it isn't Friday or Saturday night!
____________

CarreraGT

Joined: Aug 16 09
Posts: 2
ID: 338320
Credit: 1,585,189
RAC: 0
Message 64499 - Posted 15 Dec 2009 12:42:33 UTC

Hi,

I've tired to update andreceive this error: 12/15/2009 7:40:01 AM rosetta@home Message from server: Server error: can't attach shared memory


Evan

Joined: Dec 23 05
Posts: 268
ID: 42505
Credit: 402,585
RAC: 0
Message 64500 - Posted 15 Dec 2009 12:55:31 UTC

Or even the West Coast, where Washington State is, 3 hours later... UTC -8
Geography has never been my strong point!
____________

Aroundomaha

Joined: Sep 11 08
Posts: 14
ID: 278107
Credit: 40,746,712
RAC: 12,824
Message 64502 - Posted 15 Dec 2009 14:51:13 UTC - in response to Message ID 64478.

THis version fixes a stackoverflow error that we didn't catch in 2.02.

Please post issues here, thanks !


All 5 of my machines are getting the same "Server error: can't attach shared memory" message that the others in this thread are reporting.

____________

Bill G
Avatar

Joined: Dec 28 07
Posts: 6
ID: 230650
Credit: 1,124,444
RAC: 0
Message 64505 - Posted 15 Dec 2009 14:56:30 UTC

Getting the same problem here in both Vista and Windows 7. All my WU's have finished and they will not upload, nor can I get new ones.
____________

dvr

Joined: May 19 08
Posts: 2
ID: 259774
Credit: 416,591
RAC: 0
Message 64507 - Posted 15 Dec 2009 15:16:51 UTC

15.12.2009 18:09:40 rosetta@home Reporting 16 completed tasks, requesting new tasks for GPU
15.12.2009 18:09:43 rosetta@home Started upload of 1gvp__FRAG_NNMAKE_NNMAKE_BOINC_abrelax.v1_SAVE_ALL_OUT_16311_536_0_0
15.12.2009 18:09:43 rosetta@home Started upload of folditP_4793A_ABRELAX_BOINC_SAVE_ALL_OUT_16522_8507_0_0
15.12.2009 18:09:54 rosetta@home Finished upload of folditP_4793A_ABRELAX_BOINC_SAVE_ALL_OUT_16522_8507_0_0
15.12.2009 18:09:54 rosetta@home Started upload of mix_score13_env_rlbd_4icb__IGNORE_THE_RESTlr10_DECOY_16523_119_0_0
15.12.2009 18:10:07 rosetta@home update requested by user
15.12.2009 18:10:12 rosetta@home Scheduler request completed: got 0 new tasks
15.12.2009 18:10:12 rosetta@home Message from server: Server error: can't attach shared memory


from yesterday.

LogixGeer

Joined: Jan 29 09
Posts: 1
ID: 299134
Credit: 453,984
RAC: 0
Message 64508 - Posted 15 Dec 2009 15:17:25 UTC

Same here:

Message from server: Server error: can't attach shared memory

Kristaps

Joined: Jun 11 07
Posts: 1
ID: 183424
Credit: 81,256
RAC: 0
Message 64509 - Posted 15 Dec 2009 15:29:36 UTC

Tue 15 Dec 2009 05:24:30 PM EET|rosetta@home|Message from server: Server error: can't attach shared memory

Panoramix

Joined: Dec 4 07
Posts: 1
ID: 224197
Credit: 12,620,793
RAC: 1,686
Message 64511 - Posted 15 Dec 2009 16:52:04 UTC

Same here, all computer respond with:
Message from server: Server error: can't attach shared memory

BarryAZ

Joined: Dec 27 05
Posts: 149
ID: 43659
Credit: 28,500,234
RAC: 13,501
Message 64512 - Posted 15 Dec 2009 17:00:06 UTC - in response to Message ID 64511.

OK --so anyone reporting can confirm the error. Haven't seen anything in the way of an acknowledging response from the project though.

Same here, all computer respond with:
Message from server: Server error: can't attach shared memory


____________

MrWizard

Joined: Oct 30 05
Posts: 3
ID: 7716
Credit: 86,494
RAC: 0
Message 64515 - Posted 15 Dec 2009 17:06:03 UTC - in response to Message ID 64512.

OK --so anyone reporting can confirm the error. Haven't seen anything in the way of an acknowledging response from the project though.

Same here, all computer respond with:
Message from server: Server error: can't attach shared memory



It's 9:05am at their location now. Should get some action soon...
____________

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 64516 - Posted 15 Dec 2009 17:07:55 UTC

Hi! 9.07 here now :)

Not sure what the problem is - we're looking into it now. THis version
worked fine on RALPH, so it we suspect something went awry during the actual update.


____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Darmok

Joined: Sep 4 09
Posts: 6
ID: 343146
Credit: 178,639
RAC: 0
Message 64517 - Posted 15 Dec 2009 17:08:44 UTC - in response to Message ID 64515.

OK --so anyone reporting can confirm the error. Haven't seen anything in the way of an acknowledging response from the project though.

Same here, all computer respond with:
Message from server: Server error: can't attach shared memory



It's 9:05am at their location now. Should get some action soon...


I was just going to say that. It's now been 15 hours w/out comm. They will let us know soon...

MrWizard

Joined: Oct 30 05
Posts: 3
ID: 7716
Credit: 86,494
RAC: 0
Message 64518 - Posted 15 Dec 2009 17:25:54 UTC - in response to Message ID 64516.

Hi! 9.07 here now :)

Not sure what the problem is - we're looking into it now. THis version
worked fine on RALPH, so it we suspect something went awry during the actual update.


I'm under the impression it's a server problem not an application problem. Correct me if I'm wrong...
____________

bill brandt-gasuen

Joined: Jun 9 09
Posts: 1
ID: 320372
Credit: 1,341,795
RAC: 0
Message 64520 - Posted 15 Dec 2009 18:05:32 UTC

So how do we remedy this situation? Is there something we can do on our end or do we just sit tight in the rowboat waiting to be rescued? If WUs were passengers, I've got a cruise ship full of passengers that need evacuation! Googling brings to light past similar occurrences where tinkering directly with the WUs solved the problem, but what I'm picking up here doesn't seem to indicate a physical server switch as much as a software issue.

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 64521 - Posted 15 Dec 2009 18:07:39 UTC - in response to Message ID 64518.

I'm under the impression it's a server problem not an application problem. Correct me if I'm wrong...

Sounds correct to me. Every post in this thread has nothing to do with 2.03 yet.

My uploads have all gone through now. Now for the big fight over new WUs!
____________

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 64522 - Posted 15 Dec 2009 18:12:14 UTC

Ok, it appears we had to many old application backlogged. It was indeed a server problem - it should be resolved now :) - sorry for the hick up.
____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

Aroundomaha

Joined: Sep 11 08
Posts: 14
ID: 278107
Credit: 40,746,712
RAC: 12,824
Message 64525 - Posted 15 Dec 2009 20:31:43 UTC - in response to Message ID 64478.

THis version fixes a stackoverflow error that we didn't catch in 2.02.

Please post issues here, thanks !


I'm seeing work units rolling in again. Thank you to the Rosetta team for a quick resolution.

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 64527 - Posted 15 Dec 2009 22:59:12 UTC

Is the stack overflow issue the reason why mix_score13_env_rlbd_1hz6__IGNORE_THE_RESTlr8_DECOY_16523_77_09 made 129 decoys from 129 attempts?

____________
Have a crunching good day!!

Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 64529 - Posted 16 Dec 2009 1:15:39 UTC

One machine, having run out of work, started a 2.03 WU ver yquickly and closed just as quickly after just 12 decoys. A validate error, but no error messages within the task details:

mix_score13_hb_rlbd_1shf__IGNORE_THE_RESTlr13_DECOY_16324_352_1
____________

Mike Tyka

Joined: Oct 20 05
Posts: 96
ID: 5612
Credit: 2,190
RAC: 0
Message 64530 - Posted 16 Dec 2009 2:37:52 UTC - in response to Message ID 64527.

Is the stack overflow issue the reason why mix_score13_env_rlbd_1hz6__IGNORE_THE_RESTlr8_DECOY_16523_77_09 made 129 decoys from 129 attempts?


No, but 129 decoys is a good thing - isn't it ?

____________
http://beautifulproteins.blogspot.com/
http://www.miketyka.com/

AMD_is_logical

Joined: Dec 20 05
Posts: 299
ID: 41207
Credit: 31,460,681
RAC: 0
Message 64531 - Posted 16 Dec 2009 4:30:22 UTC

This broker_idealclose_kic_in20_hb_t308__IGNORE_THE_REST_16512_810 WU gave an error for both crunchers.

Speedy
Avatar

Joined: Sep 25 05
Posts: 159
ID: 1058
Credit: 507,926
RAC: 0
Message 64532 - Posted 16 Dec 2009 5:40:52 UTC - in response to Message ID 64530.
Last modified: 16 Dec 2009 5:44:38 UTC



No, but 129 decoys is a good thing - isn't it ?

Absolutely. Reason I asked was because I thought there was a 100 decoy limit for tasks that had a high model count. I think it was limited because there were upload problems for tasks that had over 100 decoys.
____________
Have a crunching good day!!

jjwhalen
Avatar

Joined: Dec 20 06
Posts: 4
ID: 137022
Credit: 399,398
RAC: 0
Message 64537 - Posted 16 Dec 2009 18:34:18 UTC

(Hint) It sure would be great if someone from project administration would comment in this forum about this issue, even if just to say "we're looking at the problem."
____________
Best wishes:)

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 64538 - Posted 16 Dec 2009 19:24:34 UTC

304932456 (lr8_combine_smooth_torsion_it00_rama02_A_rlbd_2hng_IGNORE_THE_REST_DECOY_14887_678_2) failed on Mac OS X 10.6

Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/yfsong_lr8_combine_smooth_torsion_it00_rama02_A.zip
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/lr8_2hng.out.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
std::cerr: Exception was thrown:
failure to read decoy F_00003_0004346_0 from silent-file lr8_2hng.out

</stderr_txt>
]]>

Caius Corp.

Joined: Dec 10 05
Posts: 1
ID: 34389
Credit: 242,480
RAC: 0
Message 64539 - Posted 16 Dec 2009 20:30:26 UTC

I get an issue on this workunit during CPU benchmarks:

mer. 16 déc. 2009 19:57:17 CET||Running CPU benchmarks
mer. 16 déc. 2009 19:57:17 CET||Suspending computation - running CPU benchmarks
mer. 16 déc. 2009 19:57:28 CET|rosetta@home|Task mix_score13_env_rlbd_2apb__IGNORE_THE_RESTlr10_DECOY_16523_731_0: no shared memory segment
mer. 16 déc. 2009 19:57:28 CET|rosetta@home|Task mix_score13_env_rlbd_2apb__IGNORE_THE_RESTlr10_DECOY_16523_731_0 exited with zero status but no 'finished' file
mer. 16 déc. 2009 19:57:28 CET|rosetta@home|If this happens repeatedly you may need to reset the project.
mer. 16 déc. 2009 19:57:49 CET||Benchmark results:
mer. 16 déc. 2009 19:57:49 CET|| Number of CPUs: 2
mer. 16 déc. 2009 19:57:49 CET|| 2217 floating point MIPS (Whetstone) per CPU
mer. 16 déc. 2009 19:57:49 CET|| 6149 integer MIPS (Dhrystone) per CPU
mer. 16 déc. 2009 19:57:50 CET||Resuming computation


But now the task is still running well on Rosetta mini 2.03, no other message.

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 64547 - Posted 20 Dec 2009 1:34:51 UTC - in response to Message ID 64539.
Last modified: 20 Dec 2009 1:38:42 UTC

I get an issue on this workunit during CPU benchmarks:
mer. 16 déc. 2009 19:57:17 CET||Running CPU benchmarks
mer. 16 déc. 2009 19:57:17 CET||Suspending computation - running CPU benchmarks
mer. 16 déc. 2009 19:57:28 CET|rosetta@home|Task mix_score13_env_rlbd_2apb__IGNORE_THE_RESTlr10_DECOY_16523_731_0: no shared memory segment
mer. 16 déc. 2009 19:57:28 CET|rosetta@home|Task mix_score13_env_rlbd_2apb__IGNORE_THE_RESTlr10_DECOY_16523_731_0 exited with zero status but no 'finished' file
mer. 16 déc. 2009 19:57:28 CET|rosetta@home|If this happens repeatedly you may need to reset the project.
mer. 16 déc. 2009 19:57:49 CET||Benchmark results:
mer. 16 déc. 2009 19:57:49 CET|| Number of CPUs: 2
mer. 16 déc. 2009 19:57:49 CET|| 2217 floating point MIPS (Whetstone) per CPU
mer. 16 déc. 2009 19:57:49 CET|| 6149 integer MIPS (Dhrystone) per CPU
mer. 16 déc. 2009 19:57:50 CET||Resuming computation



But now the task is still running well on Rosetta mini 2.03, no other message.


To me, this one looks like SOMETHING won't run correctly during the CPU benchmarks, but the application is able to recover afterwards. Since the CPU benchmarks aren't run very often, this problem won't be seen very often either.

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 64554 - Posted 20 Dec 2009 17:58:31 UTC

broker_idealclose_kic_in20_hb_t312__IGNORE_THE_REST_16513_879_0, task 305259567, failed on Windows 7 with a Compute Error after about an hour.

Setting up checkpointing ...
Setting up graphics native ...
FNAME: native.pdb
FNAME: ss_core_native.pdb
FNAME: ss_core_native_radical.pdb
FNAME: native_notails.pdb
FNAME: native.pdb
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
CLOSING with IDEALIZATION
CLOSING with IDEALIZATION
CLOSING with IDEALIZATION
CLOSING with IDEALIZATION


Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x5B202020 read attempt to address 0x5B202020

Engaging BOINC Windows Runtime Debugger...

Followed by pages of W7 debug info

SekeRob

Joined: Sep 7 06
Posts: 35
ID: 110407
Credit: 8,731
RAC: 391
Message 64562 - Posted 21 Dec 2009 11:34:09 UTC - in response to Message ID 64554.
Last modified: 21 Dec 2009 12:24:58 UTC

Just for try out, set 1 hour run time pref in home profile [because Rosetta is on a small share], saved, received 2 with about a 1 hour run time, but now this one is running

21/12/2009 12:23:40 rosetta@home [checkpoint_debug] result broker_idealclose_hb_t293__IGNORE_THE_REST_16362_82629_0 checkpointed

1:25 CPU time 1:26 Elapsed, and on 28 percent. It's checkpointing regularly, so don't consider this a bad task. Why this long one in-between? Mini 2.03 release.

PS: First 2 validated, this long one now Pending Validation at 2.11 hours... twice as long from specified.
____________
Coelum Non Animum Mutant, Qui Trans Mare Currunt

robertmiles Profile

Joined: Jun 16 08
Posts: 656
ID: 264600
Credit: 3,462,248
RAC: 2,198
Message 64563 - Posted 21 Dec 2009 13:30:44 UTC - in response to Message ID 64562.

Just for try out, set 1 hour run time pref in home profile [because Rosetta is on a small share], saved, received 2 with about a 1 hour run time, but now this one is running

21/12/2009 12:23:40 rosetta@home [checkpoint_debug] result broker_idealclose_hb_t293__IGNORE_THE_REST_16362_82629_0 checkpointed

1:25 CPU time 1:26 Elapsed, and on 28 percent. It's checkpointing regularly, so don't consider this a bad task. Why this long one in-between? Mini 2.03 release.

PS: First 2 validated, this long one now Pending Validation at 2.11 hours... twice as long from specified.


At least partly because the test for whether to end the task normally occurs only at the end of a decoy, so if the last decoy it started run significantly longer than expected, you'd be likely to exceed your time preference.

Also, I remember some discussion about setting the minimum allowed time preference longer than one hour, so you might want to check for signs that it's actually now set longer.

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 64564 - Posted 21 Dec 2009 13:49:17 UTC

I don't believe any change in minimum runtime was made robert.

But the minimum amount of useful work is one model (or "decoy"). If that takes longer then an hour, then that does happen and is normal for this project. The watchdog is still there keeping everyone in line if need be.

With such a short runtime you should expect the % completed to vary widely. BOINC's expectations and Rosetta's estimates will have trouble settling in when one task completed in 45 min. and the next in 2 hours.
____________
Rosetta Moderator: Mod.Sense

SekeRob

Joined: Sep 7 06
Posts: 35
ID: 110407
Credit: 8,731
RAC: 391
Message 64566 - Posted 21 Dec 2009 14:43:12 UTC - in response to Message ID 64564.
Last modified: 21 Dec 2009 14:57:49 UTC

The present test standing is:

306433095 279395159 21 Dec 2009 13:45:43 UTC 21 Dec 2009 14:53:27 UTC Over Success Done 3,366.56 17.47 15.56
306421142 279383866 21 Dec 2009 12:39:42 UTC 21 Dec 2009 13:49:54 UTC Over Success Done 3,503.61 18.18 15.74
306417911 279381434 21 Dec 2009 12:22:40 UTC 21 Dec 2009 13:33:02 UTC Over Success Done 3,542.00 18.38 16.13
306408510 279371956 21 Dec 2009 11:33:58 UTC 21 Dec 2009 12:43:56 UTC Over Success Done 3,594.54 18.65 16.79
306393079 279357285 21 Dec 2009 10:11:58 UTC 21 Dec 2009 12:26:53 UTC Over Success Done 7,661.61 39.76 40.86
306381783 279347432 21 Dec 2009 9:11:43 UTC 21 Dec 2009 11:38:12 UTC Over Success Done 3,444.83 17.88 15.05
306365927 279333133 21 Dec 2009 7:44:11 UTC 21 Dec 2009 9:07:33 UTC Over Success Done 3,497.34 18.15 16.63

Looks like it's pretty well figured out that an hour is an hour most of the times. I do appreciate that there is a non-deterministic element and if just incidental, a good project to act as filler when on a shutdown schedule. From 4 O'clock it's power-off.
____________
Coelum Non Animum Mutant, Qui Trans Mare Currunt

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 64850 - Posted 7 Jan 2010 18:32:45 UTC

Workunit 281289676 failed on Windows 7: it appeared to hang (not using processor time) and had to be aborted. It was successfully completed by a wingman on XP.

AdeB Profile
Avatar

Joined: Dec 12 06
Posts: 45
ID: 135244
Credit: 2,358,915
RAC: 2,105
Message 64868 - Posted 9 Jan 2010 3:25:16 UTC

Task: 309276026
Workunit: homopt_nat2.t370_.t370_.IGNORE_THE_REST.S_00003_0000009_04.pdb_00003.pdb.JOB_16836_1
stderr out:

...
ERROR: [ERROR] Error opening RBSeg file 'S_00011_0000013_0_0_00060.pdb_00029.pdb_00011.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


AdeB
____________

coturnix

Joined: Oct 8 09
Posts: 4
ID: 353496
Credit: 729,171
RAC: 0
Message 64879 - Posted 9 Jan 2010 19:21:58 UTC

Task: 309472876
Workunit: homopt_nat2.t368_.t368_.IGNORE_THE_REST.S_00002_0000023_04.pdb_00002.pdb.JOB_16835_15

ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 64883 - Posted 9 Jan 2010 22:33:23 UTC

Same error as AdeB reported

ERROR: [ERROR] Error opening RBSeg file 'native_0001_2.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

</stderr_txt>

Task 309256017

Mac OS X10.6

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 64886 - Posted 10 Jan 2010 3:05:39 UTC

Had these two error within seconds of each other.

homopt_cstmc_1.t308_.t308_.IGNORE_THE_REST.S_00002_0000618_00037.pdb.JOB_16846_1

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=282054734

ERROR: [ERROR] Unable to open constraints file: /work/tex/projects/cm/benchmark/cross_filt/t308_/t308_.aln_list_mike_chosen_bestaln.alns.combined.csts
ERROR:: Exit from: src/core/scoring/constraints/ConstraintIO.cc line: 332
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

=======================================================================================

homopt_nat2.t312_.t312_.IGNORE_THE_REST.S_00022_0000017_04.pdb_00022.pdb.JOB_16828_5

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=282085871

ERROR: [ERROR] Error opening RBSeg file 'S_00002_0000020_0_0_030.pdb_00002.pdb_00002.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

____________


Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 4,704,855
RAC: 9,235
Message 64892 - Posted 10 Jan 2010 14:16:16 UTC
Last modified: 10 Jan 2010 14:23:18 UTC

Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly.
For example WUs named "ha_notyr_..." - after several hours of computing "CPU time at last checkpoint" stays "---" (none). If I restart(or shut down) computer (or BOINC client only) while such WU running - all results are lost and after restart computation starts from the very beginning.
Here examples of such tasks:
http://boinc.bakerlab.org/rosetta/result.php?resultid=308985993
http://boinc.bakerlab.org/rosetta/result.php?resultid=309233711

And so they look from BOINС Manager:


And a part of other tasks write about checkpoints, but similar actually them do not do it (or most likely do, but after restarting them do not use it). It looks so: "CPU time at last checkpoint" looks correct (only for some minutes less, than total CPU Time), BUT before exiting from BOINC I look "Show graphics", we will accept there is displayed that 38 models are calculated already. After restarting counting of models starts with 0,1,2.. and so on. Аnd then reporting to the server is referred less models, than has been considered before restarting(for exsample only 20) - similar only that has been computed after the last restarting.

It is a significant problem for me, because i NEED to restart this computer from time to time, and it always shut down when i sleep (because of noise).
Besides similar too most occurs at automatic switching between projects if BOINС calculate some projects simultaneously (for example R@H and E@H).

While I have troubleshot so: has exposed small "CPU target time" (2 hours at present instead of 10 which used earlier) to reduce possible losses of useful calculations to a minimum.
But I think, it only partial and is far not the best solution...

P.S.
I running minirosetta 2.03, BOINC 6.10.18 on Windows XP SP3.

Snagletooth

Joined: Feb 22 07
Posts: 192
ID: 149031
Credit: 1,396,123
RAC: 1,318
Message 64898 - Posted 10 Jan 2010 16:04:51 UTC

9gbnnotyr_3gbn_2bk8_9Jan2010_16860_5_0
Completed successfully... 2713 models! It ran for the entire 10 hour target run time without stopping despite having a 60 minute switch interval, plenty of work on board from other projects and a STD and resource share which leads most 10 hour rosetta units to break at least once, usually twice before finishing up. I don't have checkpoint flags enabled and I'm running BOINC 6.2.18 so I have no information on checkpoints. I posted a similar report (more than 100 models, no switching) about a different type of WU over on Ralph.

Snags

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 64899 - Posted 10 Jan 2010 17:17:47 UTC

These homopt_nat2.t* models are causing real problems.

On Mac i get a bunch erroring out immediately, e.g.

Task 309606751
Task 309641678

gave this same
ERROR: [ERROR] Error opening RBSeg file 'S_00002_0000022_0_0_00009.pdb_00001.pdb_00002.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

while Task [url=http://boinc.bakerlab.org/rosetta/result.php?resultid= 309628562]309628562[/url]

failed like this

Options::initialize() End reached
ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

On Windows 7 I still get a bunch that have to be aborted as they're hanging but not taking up CPU time

Task [url=http://boinc.bakerlab.org/rosetta/result.php?resultid= 308815202] 308815202[/url]
Task [url=http://boinc.bakerlab.org/rosetta/result.php?resultid= 308815003] 308815003[/url]
Task [url=http://boinc.bakerlab.org/rosetta/result.php?resultid= 308586455] 308586455[/url]



Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 64902 - Posted 10 Jan 2010 19:24:33 UTC
Last modified: 10 Jan 2010 19:41:50 UTC

A disappointing validate error on my W7 laptop:
ha_notyr_3gbn_2hpj_6Jan2010_16806_6_1

And a Compute Error on my Vista desktop:
ha_notyr_3gbn_1.gz_6Jan2010_16806_2_1

ERROR: data
ERROR:: Exit from: ..\..\src\protocols\ProteinInterfaceDesign\read_patchdock.cc line: 70
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


And another 2 for the laptop:
homopt4.t367_.t367_.IGNORE_THE_REST.S_00006_0000013_0_0_00086.pdb_00004.pdb_00006.pdb.JOB_16819_17_1
homopt4.t328_.t328_.IGNORE_THE_REST.S_00001_0000007_0_0_0_0020.pdb_00001.pdb_00001.pdb.JOB_16816_23_1
Both showing:
ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

____________

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 64903 - Posted 10 Jan 2010 20:32:27 UTC - in response to Message ID 64902.

Thanks, Sid! The error you report in ha_notyr_3gbn_1.gz_6Jan2010_16806_2_1 is due to some scripting bug, which I've now fixed. Thanks for the bug report! Sarel.

A disappointing validate error on my W7 laptop:
ha_notyr_3gbn_2hpj_6Jan2010_16806_6_1

And a Compute Error on my Vista desktop:
ha_notyr_3gbn_1.gz_6Jan2010_16806_2_1
ERROR: data
ERROR:: Exit from: ..\..\src\protocols\ProteinInterfaceDesign\read_patchdock.cc line: 70
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


And another 2 for the laptop:
homopt4.t367_.t367_.IGNORE_THE_REST.S_00006_0000013_0_0_00086.pdb_00004.pdb_00006.pdb.JOB_16819_17_1
homopt4.t328_.t328_.IGNORE_THE_REST.S_00001_0000007_0_0_0_0020.pdb_00001.pdb_00001.pdb.JOB_16816_23_1
Both showing:
ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file


____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 64904 - Posted 10 Jan 2010 21:05:56 UTC
Last modified: 10 Jan 2010 21:19:30 UTC

Two more over night one errored, the other odd and i'm not impressed.
---------------------------------------------------------------------
homopt4.t290_.t290_.IGNORE_THE_REST.S_00002_000.pdb.JOB_16809_13_0.

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=282226265

ERROR: [ERROR] Error opening RBSeg file 'S_00001_0000038_04.pdb_00001.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish

=====================================================================
This ran for over 8hrs none stop didn't let other tasks run, the last model
seems to have taken four hours. My run time is set at 4hrs.

ha_notyr_3gbn_2oeb_8Jan2010_16808_18_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=282041421

BOINC:: CPU time: 29392.7s, 14400s + 14400s[2010- 1-10 18:36:45:] :: BOINC
InternalDecoyCount: 87
======================================================
DONE :: 2 starting structures 29392.7 cpu seconds
This process generated 87 decoys from 87 attempts
======================================================
called boinc_finish

Over__Success__Done__29,394.75__69.44__9.95
____________


Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 64905 - Posted 10 Jan 2010 21:10:33 UTC - in response to Message ID 64892.

Hello,

Thanks for your comments! In this sort of WU trajectories are typically not very long, but successful trajectories are extremely rare (credit is allocated per time spent computing, so you get credit for the work regardless of success). Accordingly, don't worry about turning off the computer in the middle of one of these trajectories. The most you're going to lose is a couple of minutes of computation.

Thanks, Sarel.

Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly.
For example WUs named "ha_notyr_..." - after several hours of computing "CPU time at last checkpoint" stays "---" (none). If I restart(or shut down) computer (or BOINC client only) while such WU running - all results are lost and after restart computation starts from the very beginning.
Here examples of such tasks:
http://boinc.bakerlab.org/rosetta/result.php?resultid=308985993
http://boinc.bakerlab.org/rosetta/result.php?resultid=309233711

And so they look from BOINС Manager:


And a part of other tasks write about checkpoints, but similar actually them do not do it (or most likely do, but after restarting them do not use it). It looks so: "CPU time at last checkpoint" looks correct (only for some minutes less, than total CPU Time), BUT before exiting from BOINC I look "Show graphics", we will accept there is displayed that 38 models are calculated already. After restarting counting of models starts with 0,1,2.. and so on. Аnd then reporting to the server is referred less models, than has been considered before restarting(for exsample only 20) - similar only that has been computed after the last restarting.

It is a significant problem for me, because i NEED to restart this computer from time to time, and it always shut down when i sleep (because of noise).
Besides similar too most occurs at automatic switching between projects if BOINС calculate some projects simultaneously (for example R@H and E@H).

While I have troubleshot so: has exposed small "CPU target time" (2 hours at present instead of 10 which used earlier) to reduce possible losses of useful calculations to a minimum.
But I think, it only partial and is far not the best solution...

P.S.
I running minirosetta 2.03, BOINC 6.10.18 on Windows XP SP3.


____________

frederick corse

Joined: Oct 7 05
Posts: 10
ID: 3142
Credit: 1,496,201
RAC: 175
Message 64906 - Posted 10 Jan 2010 22:07:30 UTC

Hello


I ran 9gbnnotyr_3gbn_3bfm_9jan2010_16860_ 11 it ran for 14399.07 secs and had 844 decoys the most i ever saw was 100





regards
____________

Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 64908 - Posted 11 Jan 2010 2:21:26 UTC - in response to Message ID 64906.

That's right. Again, this is a different sort of simulation (see http://boinc.bakerlab.org/forum_thread.php?id=4477&nowrap=true#64838 for details).

In these runs many of the trajectories are cut short early on because they are unlikely to yield useful results. Credit for runs is allocated for computational time and we need to know how many times simulations were started on your computers and those are reported as decoys. The amount of information that is sent back to our servers per triaged trajectory is very small though to limit bandwidth loss.

Hello


I ran 9gbnnotyr_3gbn_3bfm_9jan2010_16860_ 11 it ran for 14399.07 secs and had 844 decoys the most i ever saw was 100





regards


____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 64920 - Posted 11 Jan 2010 20:41:48 UTC

This died last night, same as others.

homopt4.t328_.t328_.IGNORE_THE_REST.S_00002_0000009_0_0_0_0001.pdb_00004.pdb_00002.pdb.JOB_16816_14_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=282247578

ERROR: [ERROR] Error opening RBSeg file 'S_00001_0000002_07.pdb_00001.pdb.loopfile'
ERROR:: Exit from: src/protocols/loops/Loops.cc line: 483
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish



____________


Sarel Profile

Joined: May 11 06
Posts: 51
ID: 81994
Credit: 81,712
RAC: 0
Message 64923 - Posted 11 Jan 2010 23:20:01 UTC - in response to Message ID 64905.

Hi again,

David Kim and I have tracked down this problem and I'm going to test a fix to it in the upcoming release. The problem was that per-decoy checkpointing was not on in this batch of simulations. When I mentioned that these protocols do not need checkpointing I only meant within-trajectory checkpointing.

For the time being, I've stopped sending out this type of simulation, though over the next few days your computers might still work on them as quite a few have already been sent out. To assure you, the results of these simulations are certainly useful to us and in most cases credit will be allocated correctly.

Thanks a lot for sending specific comments that allowed us to figure this out!

Hello,

Thanks for your comments! In this sort of WU trajectories are typically not very long, but successful trajectories are extremely rare (credit is allocated per time spent computing, so you get credit for the work regardless of success). Accordingly, don't worry about turning off the computer in the middle of one of these trajectories. The most you're going to lose is a couple of minutes of computation.

Thanks, Sarel.

Seems what some of Rosetta's WU's do not checkpoint at all or do it incorrectly.
For example WUs named "ha_notyr_..." - after several hours of computing "CPU time at last checkpoint" stays "---" (none). If I restart(or shut down) computer (or BOINC client only) while such WU running - all results are lost and after restart computation starts from the very beginning.
Here examples of such tasks:
http://boinc.bakerlab.org/rosetta/result.php?resultid=308985993
http://boinc.bakerlab.org/rosetta/result.php?resultid=309233711

And so they look from BOINС Manager:


And a part of other tasks write about checkpoints, but similar actually them do not do it (or most likely do, but after restarting them do not use it). It looks so: "CPU time at last checkpoint" looks correct (only for some minutes less, than total CPU Time), BUT before exiting from BOINC I look "Show graphics", we will accept there is displayed that 38 models are calculated already. After restarting counting of models starts with 0,1,2.. and so on. Аnd then reporting to the server is referred less models, than has been considered before restarting(for exsample only 20) - similar only that has been computed after the last restarting.

It is a significant problem for me, because i NEED to restart this computer from time to time, and it always shut down when i sleep (because of noise).
Besides similar too most occurs at automatic switching between projects if BOINС calculate some projects simultaneously (for example R@H and E@H).

While I have troubleshot so: has exposed small "CPU target time" (2 hours at present instead of 10 which used earlier) to reduce possible losses of useful calculations to a minimum.
But I think, it only partial and is far not the best solution...

P.S.
I running minirosetta 2.03, BOINC 6.10.18 on Windows XP SP3.



____________

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 64925 - Posted 12 Jan 2010 5:15:01 UTC
Last modified: 12 Jan 2010 5:47:01 UTC

I don't know if this is a Validator problem or the task, any ideas.

Edit/ It ran for over 4hrs none stop to finish.

9gbnnotyr_3gbn_2hxm_9Jan2010_16880_35_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=282555003

# cpu_run_time_pref: 14400
======================================================
DONE :: 27 starting structures 15475.2 cpu seconds
This process generated 27 decoys from 27 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>

Over__Validate error__Done__15,475.75
____________


Sid Celery

Joined: Feb 11 08
Posts: 796
ID: 241409
Credit: 9,546,016
RAC: 7,460
Message 64926 - Posted 12 Jan 2010 7:34:40 UTC

More of the previously reported errors here on what's usually a very reliable error-free machine.

One new odd one though, relating to credits rather than anything else:

9gbnnotyr_3gbn_2p8g_9Jan2010_16860_4_0

Outcome Success
Client state Done
Exit status 0 (0x0)

<core_client_version>6.10.18</core_client_version>

# cpu_run_time_pref: 28800
======================================================
DONE :: 10 starting structures 28348.9 cpu seconds
This process generated 10 decoys from 10 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Claimed credit 133.53157877111
Granted credit 1.50234135281901
application version 2.03

Generally my granted credit on this W7 laptop is close the claimed credit, with the occasional one being 30% less or 50% more, but 99% less seems very odd. Any ideas, or just a one off?
____________

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 64927 - Posted 12 Jan 2010 8:22:10 UTC

Been having a slight issue with WU's freezing on my computer for the last few days. Most seem fine but a few odd ones lately. Most are Homopt WU's and now im having problems with boinc_filtered_loopbuild_threading (2nd on that has frozen). They seem to get to 20-70% then for some reason stop at some point and just tick away with process sitting idle. Ive chosen to manually abort these, has anyone else been having this issue? Also when i go to "show graphics" the graphics window freezes which makes me have to kill the process. I dont wanna keep aborting these Wu's but doesnt seem like anything else i can do.... anyone wanna help me out?

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 64937 - Posted 12 Jan 2010 19:29:24 UTC

Just downloaded 2 more of the same type WU's and they are stuck at 8 and 9%... can anyone tell me whats going on?

Mod.Sense
Forum moderator
Project administrator

Joined: Aug 22 06
Posts: 3381
ID: 106194
Credit: 0
RAC: 0
Message 64938 - Posted 12 Jan 2010 20:19:12 UTC

Admin, I've not heard of such problems until just the past few days. I've EMailed the Project Team asking they look in to it.

Do you spot any pattern in the WU names that are working vs those hanging up? Your profile looks like you are running Win7.
____________
Rosetta Moderator: Mod.Sense

svincent

Joined: Dec 30 05
Posts: 202
ID: 44923
Credit: 4,102,500
RAC: 5,735
Message 64940 - Posted 12 Jan 2010 20:25:17 UTC - in response to Message ID 64938.

Admin, I've not heard of such problems until just the past few days. I've EMailed the Project Team asking they look in to it.

Do you spot any pattern in the WU names that are working vs those hanging up? Your profile looks like you are running Win7.


admin, I see you have one computer and it's running Windows System 7. I've had identical problems with R@h tasks running under this OS, some of which I've reported above. There seems no common pattern to the tasks that have to be aborted: given two tasks with names identical apart from the digits at the end one may successfully complete while the other has to be aborted. It always seems though, for those tasks I've looked at, that it gets successfully completed by a wingman running under a different OS.

Admin

Joined: Apr 13 07
Posts: 42
ID: 164784
Credit: 260,782
RAC: 0
Message 64941 - Posted 12 Jan 2010 20:29:53 UTC
Last modified: 12 Jan 2010 20:31:06 UTC

Its really random so I cant quite say which work, but Ive stated the ones that dont work for me above. If you can check the tasks Ive aborted, those are the WU's that have been faulty. Its been more and more the past few days, so ive aborted the last 2 bad ones and wont get anymore for right now until the issue is looked at. Homopt and boinc_filtered_loopback_threading seem to be the biggest issues for me and they have been the only ones ive been getting. Anything else you need to know? Yes im running Windows 7 RC right now.

P . P . L .
Avatar

Joined: Aug 20 06
Posts: 581
ID: 105843
Credit: 4,864,105
RAC: 0
Message 64942 - Posted 13 Jan 2010 7:01:16 UTC

This lasted about 11sec.

t287__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_576_0

http://boinc.bakerlab.org/rosetta/workunit.php?wuid=282963623

<core_client_version>6.2.14</core_client_version>
<![CDATA[
<message>
process got signal 11
</message>

Wed 13 Jan 2010 16:27:59 EST|rosetta@home|Output file t287__boinc_filtered_loopbuild_threading_cst_lb_tex_IGNORE_THE_REST_16900_576_0_0 for task absent

____________


Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 4,704,855
RAC: 9,235
Message 64946 - Posted 13 Jan 2010 13:40:04 UTC - in response to Message ID 64905.

Hello,

Thanks for your comments! In this sort of WU trajectories are typically not very long, but successful trajectories are extremely rare (credit is allocated per time spent computing, so you get credit for the work regardless of success). Accordingly, don't worry about turning off the computer in the middle of one of these trajectories. The most you're going to lose is a couple of minutes of computation.

Thanks, Sarel.


I do not worry about possible losses of 1 not completed model - in these tasks they are small, so losses will really make no more than several minutes of CPU time.
And what about possible losses of all models calculated before turn of (or reboot of the computer or boinc client) - apparently from a screenshot(posted above), this type of WUs at all does not do any checkpoints for all time of the computation.
Or results of ready models (completely calculated) are saved somehow differently (not through the mechanism of checkpoints), and checkpoints are necessary only for saving of subproducts in 1 model?
And BOINC simply does not know about it and writes about them "no CPU time at last checkpoint?

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 4,704,855
RAC: 9,235
Message 64947 - Posted 13 Jan 2010 14:12:59 UTC - in response to Message ID 64923.

Hi again,

David Kim and I have tracked down this problem and I'm going to test a fix to it in the upcoming release. The problem was that per-decoy checkpointing was not on in this batch of simulations. When I mentioned that these protocols do not need checkpointing I only meant within-trajectory checkpointing.

For the time being, I've stopped sending out this type of simulation, though over the next few days your computers might still work on them as quite a few have already been sent out. To assure you, the results of these simulations are certainly useful to us and in most cases credit will be allocated correctly.

Thanks a lot for sending specific comments that allowed us to figure this out!

Oh, I have written the previous post before has read this one.
Is glad to hear that the problem is localised. Always it is pleasant to "squash up" one more bug in software. :)
(On the my main work I am linked with programming as a whole, and with testing and debugging in particular. Projects are much easier, in comparison with scientific, but in programming much in common).

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 4,704,855
RAC: 9,235
Message 64948 - Posted 13 Jan 2010 15:17:46 UTC - in response to Message ID 64926.

More of the previously reported errors here on what's usually a very reliable error-free machine.

One new odd one though, relating to credits rather than anything else:

9gbnnotyr_3gbn_2p8g_9Jan2010_16860_4_0
Outcome Success
Client state Done
Exit status 0 (0x0)

<core_client_version>6.10.18</core_client_version>

# cpu_run_time_pref: 28800
======================================================
DONE :: 10 starting structures 28348.9 cpu seconds
This process generated 10 decoys from 10 attempts
======================================================

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

Claimed credit 133.53157877111
Granted credit 1.50234135281901
application version 2.03

Generally my granted credit on this W7 laptop is close the claimed credit, with the occasional one being 30% less or 50% more, but 99% less seems very odd. Any ideas, or just a one off?


I think here too the problem with saving of results of calculations takes place. Your computer has transmitted in the report "10 decoys" it is a very little for the given type of WUs.
For matching here result of calculation of the similar WU on my processor: http://boinc.bakerlab.org/rosetta/result.php?resultid=309983219
Apparently my processor has calculated "96 decoys" all for 7138.77 cpu seconds.
And your result: 10 decoys for 28348.9 cpu seconds, despite more powerful processor.
Credits are calculated seem correctly:
15.15 Cr for 96 decoys (my result)
1.5 Cr for 10 decoys (your result)
I.e. nearby 0,15 Cr for 1 result in both cases.

So I think a problem on the side of you computer, instead of on a server.
If on the computer there are no serious problems, capable to call sharp (many times over) degradation of calculations speed (for example hard swopping) most likely you computer calculates is much more "decoys", but their most part for any reason has been lost, and in the report have been referred only 10.

Mad_Max

Joined: Dec 31 09
Posts: 150
ID: 365007
Credit: 4,704,855
RAC: 9,235
Message 64983 - Posted 14 Jan 2010 22:44:50 UTC
Last modified: 14 Jan 2010 23:06:32 UTC

On my observations (after I have faced a similar problem, I some time watched disk writing of Rosetta application) majority of WUs wrote checkpoints even much more often - about 1 time each 1-2 minutes (I think according to setting in BOINС which by default set to 60 seconds).
Except for two types WUs - one did not write checkpoints at all (as you have marked this problem is already localised and FIX for it should be included in new version Rosetta mini 2.05) and another wrote checkpoints as usually, but after restarting for any reason could not use them (or did not try at all).
If the job of 2nd type once again gets to me I will try to catch it.
I think an indirect tag of such tasks there should be a bad ratio between "claimed credit" and "granted credit"


I think that I have caught the second bug with checkpoints. This time not "between small models", and "intro one big".
Here one of such tasks: http://boinc.bakerlab.org/rosetta/result.php?resultid=310448366
Apparently the ratio between "Claimed credit" and "Granted credit" very bad that indirectly testifies to a problem (too few useful results for such CPU time) Those tasks which never interrupted in an operating time usual shows much better ratio on my computer.

And now as performance of this job on my computer looked: it was fulfilled in 3 stages with 2 restartings between them (the 1st - this turn off of the computer for the night, 2nd - I specially restarted BOINC for testing).
In the end of the first stage (before 1st restarting) CPU time was about 2.5 hours, the progress percent was ~88 %, "show graphics" - 1 model and it is a lot of steps (some thousand).

Next day at start the progress percent has fallen at once to ~47 %, though I think that it has reduced to zero, is simple BOINC has calculated it as 2:49 hours (already used CPU Time) to divide at 6 hours (as much as possible admissible time = target CPU Time х 3 = 6h). In "show graphics" was a following:
http://s004.radikal.ru/i206/1001/e5/15254410b960.jpg
http://s005.radikal.ru/i210/1001/d5/a235df07123e.jpg
Looks as though computing went from the very beginning.

After two hours of computing I restarted BOINC 2nd time (Exit on the tray icon), after start "show graphics" looks so:
http://i069.radikal.ru/1001/1f/f431840cb759.jpg
Again counting of models and steps goes from 0...

In task logs (stderr out) record about reading checkpoint is, but it only one though the job interrupted and restarted twice. Besides in a working folder was much more files concerning to checkpoints.

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 65039 - Posted 19 Jan 2010 0:25:07 UTC
Last modified: 19 Jan 2010 0:25:29 UTC

seems to be a common theme going on with these tasks:

homopt4.t367_.t367_.IGNORE_THE_REST.S_00006_0000001_0_0_10089.pdb_00002.pdb_00006.pdb.JOB_16819_18_1
http://boinc.bakerlab.org/rosetta/result.php?resultid=309699920

homopt4.t328_.t328_.IGNORE_THE_REST.S_00001_0000022_0_0_0_0077.pdb_00001.pdb_00001.pdb.JOB_16816_16_1
http://boinc.bakerlab.org/rosetta/result.php?resultid=309816167

homopt4.t322_.t322_.IGNORE_THE_REST.S_00006_0000023_0_0_00034.pdb_00008.pdb_00006.pdb.JOB_16815_24_1
http://boinc.bakerlab.org/rosetta/result.php?resultid=309824444

They all died immediately due to:
ERROR: No values of the appropriate type specified for multi-valued option -loops:loop_file

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 65051 - Posted 21 Jan 2010 11:43:21 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=310017128
homopt_fa_cstmc_1.t370_.t370_.IGNORE_THE_REST.S_00003_0000784_010.pdb.JOB_16898_23_0


Outcome Client error
Client state Compute error
Exit status -177 (0xffffff4f)
Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x7C90120E

Engaging BOINC Windows Runtime Debugger...

BOINC Windows Runtime Debugger Version 6.5.0

Dump Timestamp : 01/21/10 00:25:36
LoadLibraryA( E:xxxxx: GetLastError = 126
Loaded Library : version.dll
Debugger Engine : 4.0.5.0

Greg_BE Profile
Avatar

Joined: May 30 06
Posts: 4835
ID: 85645
Credit: 2,948,921
RAC: 243
Message 65052 - Posted 21 Jan 2010 11:44:29 UTC

http://boinc.bakerlab.org/rosetta/result.php?resultid=310017144
t308__boinc_filtered_loopbuild_threading_cst_all_tex_IGNORE_THE_REST_16902_69_0

Outcome Client error
Client state Compute error
Exit status -177 (0xffffff4f)

<core_client_version>6.10.18</core_client_version>
<![CDATA[
<message>
Maximum elapsed time exceeded
</message>
]]>

Message boards : Number crunching : minirosetta 2.03


Home | Join | About | Participants | Community | Statistics

Copyright © 2017 University of Washington

Last Modified: 10 Nov 2010 1:51:38 UTC
Back to top ^